Examples

In the following example, we will imagine we are training a model to predict the temperature and humidity in a city for each hour.

First we load some hourly weather data:

julia> using DataFrames, Dates, FeatureTransforms

julia> using FeatureTransforms: fit!

julia> df = DataFrame(
            :time => DateTime(2018, 9, 10):Hour(1):DateTime(2018, 9, 10, 23),
            :temperature => [10.6, 9.5, 8.9, 8.9, 8.4, 8.4, 7.7, 8.9, 11.7, 13.9, 16.2, 17.7, 18.9, 20.0, 21.2, 21.7, 21.7, 21.2, 20.0, 18.4, 16.7, 15.0, 13.9, 12.7],
            :humidity => [93.8, 96.1, 94.8, 92.4, 92.7, 97.3, 100.2, 96.2, 89.2, 83.2, 77.4, 69.7, 65.1, 59.2, 55.1, 54.9, 54.5, 56.8, 60.3, 64.8, 70.8, 77.3, 83.1, 87.0],
        )
24×3 DataFrame
 Row │ time                 temperature  humidity
     │ DateTime             Float64      Float64
─────┼────────────────────────────────────────────
   1 │ 2018-09-10T00:00:00         10.6      93.8
   2 │ 2018-09-10T01:00:00          9.5      96.1
   3 │ 2018-09-10T02:00:00          8.9      94.8
   4 │ 2018-09-10T03:00:00          8.9      92.4
   5 │ 2018-09-10T04:00:00          8.4      92.7
   6 │ 2018-09-10T05:00:00          8.4      97.3
   7 │ 2018-09-10T06:00:00          7.7     100.2
   8 │ 2018-09-10T07:00:00          8.9      96.2
  ⋮  │          ⋮                ⋮          ⋮
  18 │ 2018-09-10T17:00:00         21.2      56.8
  19 │ 2018-09-10T18:00:00         20.0      60.3
  20 │ 2018-09-10T19:00:00         18.4      64.8
  21 │ 2018-09-10T20:00:00         16.7      70.8
  22 │ 2018-09-10T21:00:00         15.0      77.3
  23 │ 2018-09-10T22:00:00         13.9      83.1
  24 │ 2018-09-10T23:00:00         12.7      87.0
                                    9 rows omitted

We want to create some data features based on the time of day. One way to do this is with the Periodic transform, specifying a period of 1 day:

julia> periodic = Periodic(sin, Day(1));

julia> feature_df = FeatureTransforms.apply_append(df, periodic, cols=:time, header=[:hour_of_day_sin])
24×4 DataFrame
 Row │ time                 temperature  humidity  hour_of_day_sin
     │ DateTime             Float64      Float64   Float64
─────┼─────────────────────────────────────────────────────────────
   1 │ 2018-09-10T00:00:00         10.6      93.8         0.0
   2 │ 2018-09-10T01:00:00          9.5      96.1         0.258819
   3 │ 2018-09-10T02:00:00          8.9      94.8         0.5
   4 │ 2018-09-10T03:00:00          8.9      92.4         0.707107
   5 │ 2018-09-10T04:00:00          8.4      92.7         0.866025
   6 │ 2018-09-10T05:00:00          8.4      97.3         0.965926
   7 │ 2018-09-10T06:00:00          7.7     100.2         1.0
   8 │ 2018-09-10T07:00:00          8.9      96.2         0.965926
  ⋮  │          ⋮                ⋮          ⋮             ⋮
  18 │ 2018-09-10T17:00:00         21.2      56.8        -0.965926
  19 │ 2018-09-10T18:00:00         20.0      60.3        -1.0
  20 │ 2018-09-10T19:00:00         18.4      64.8        -0.965926
  21 │ 2018-09-10T20:00:00         16.7      70.8        -0.866025
  22 │ 2018-09-10T21:00:00         15.0      77.3        -0.707107
  23 │ 2018-09-10T22:00:00         13.9      83.1        -0.5
  24 │ 2018-09-10T23:00:00         12.7      87.0        -0.258819
                                                     9 rows omitted

Now suppose we want to use the first 22 hours as training data and the last 2 hours as test data. Our input features are the temperature, humidity, and periodic encodings for the current hour, and the outputs to predict are the temperature and humidity for the next hour.

julia> train_df = feature_df[1:end-2, :];

julia> test_df = feature_df[end-1:end, :];

julia> output_cols = [:temperature, :humidity];

For many models it is helpful to standardise the training data. We can use StandardScaling for that purpose. Note that we are mutating the data frame in-place using apply! one column at a time.

julia> temp_scaling = StandardScaling();

julia> fit!(temp_scaling, train_df; cols=:temperature);

julia> hum_scaling = StandardScaling();

julia> fit!(hum_scaling, train_df; cols=:humidity);

julia> FeatureTransforms.apply!(train_df, temp_scaling; cols=:temperature);

julia> FeatureTransforms.apply!(train_df, hum_scaling; cols=:humidity)
22×4 DataFrame
 Row │ time                 temperature  humidity     hour_of_day_sin
     │ DateTime             Float64      Float64      Float64
─────┼────────────────────────────────────────────────────────────────
   1 │ 2018-09-10T00:00:00   -0.807635    0.98858         0.0
   2 │ 2018-09-10T01:00:00   -1.01916     1.12684         0.258819
   3 │ 2018-09-10T02:00:00   -1.13454     1.04869         0.5
   4 │ 2018-09-10T03:00:00   -1.13454     0.904422        0.707107
   5 │ 2018-09-10T04:00:00   -1.23068     0.922456        0.866025
   6 │ 2018-09-10T05:00:00   -1.23068     1.19897         0.965926
   7 │ 2018-09-10T06:00:00   -1.36529     1.3733          1.0
   8 │ 2018-09-10T07:00:00   -1.13454     1.13285         0.965926
  ⋮  │          ⋮                ⋮            ⋮              ⋮
  16 │ 2018-09-10T15:00:00    1.32683    -1.3498         -0.707107
  17 │ 2018-09-10T16:00:00    1.32683    -1.37385        -0.866025
  18 │ 2018-09-10T17:00:00    1.23068    -1.23559        -0.965926
  19 │ 2018-09-10T18:00:00    0.99993    -1.02519        -1.0
  20 │ 2018-09-10T19:00:00    0.692259   -0.754687       -0.965926
  21 │ 2018-09-10T20:00:00    0.365359   -0.394011       -0.866025
  22 │ 2018-09-10T21:00:00    0.0384588  -0.00327887     -0.707107
                                                        7 rows omitted

We can use the same scaling transform to standardise the test data:

julia> FeatureTransforms.apply!(test_df, temp_scaling; cols=:temperature);

julia> FeatureTransforms.apply!(test_df, hum_scaling; cols=:humidity)
2×4 DataFrame
 Row │ time                 temperature  humidity  hour_of_day_sin
     │ DateTime             Float64      Float64   Float64
─────┼─────────────────────────────────────────────────────────────
   1 │ 2018-09-10T22:00:00    -0.173065  0.345374        -0.5
   2 │ 2018-09-10T23:00:00    -0.403818  0.579814        -0.258819

Suppose we then train our model, and get a prediction for the test points. We can scale this back to the original units of temperature and humidity by using the inverse scaling:

julia> predictions = DataFrame([-0.36 0.61; -0.45 0.68], output_cols);

julia> FeatureTransforms.apply!(predictions, temp_scaling; cols=:temperature, inverse=true);

julia> FeatureTransforms.apply!(predictions, hum_scaling; cols=:humidity, inverse=true)
2×2 DataFrame
 Row │ temperature  humidity 
     │ Float64      Float64  
─────┼───────────────────────
   1 │     12.9279   87.5022
   2 │     12.4598   88.6666