Example

In this example, we're going to step through a set of common operations we typically perform when converting a collection of individually fetched features into a simple set of training features (X, y) and predict/testing features (X̂, ŷ).

Lets start by loading some packages we'll need.

using AxisKeys, AxisSets, DataFrames, Dates, Impute, Random, Statistics, TimeZones
using AxisSets: Pattern, flatten, rekey

Data

To see how we can use AxisSets.jl to aid with data wrangling problems, we're going to assume our dataset is a nested NamedTuple of DataFrames. We're using DataFrames for simplicity, but we could also construct our Dataset from a LibPQ.Result or anything else that follows the Tables interface. Let's start by taking a look at what our data looks like. To make things easier we're gonna flatten our nested structure and display the column names for each dataframe.

flattened = AxisSets.flatten(data)
d = Dict(flattened...)
Dict(k => names(v) for (k, v) in flattened)
Dict{Tuple{Symbol, Symbol, Symbol}, Vector{String}} with 8 entries:
  (:train, :input, :temp)      => ["time", "id", "temp"]
  (:predict, :output, :prices) => ["time", "id", "price"]
  (:predict, :input, :load)    => ["time", "id", "load"]
  (:predict, :input, :temp)    => ["time", "id", "temp"]
  (:train, :output, :prices)   => ["time", "id", "price"]
  (:train, :input, :prices)    => ["time", "id", "lag", "price"]
  (:train, :input, :load)      => ["time", "id", "load"]
  (:predict, :input, :prices)  => ["time", "id", "lag", "price"]

Something you may notice about our data is that each component has a :time and :id column which uniquely identify each value. Therefore we can more compactly represent our components as AxisKeys.KeyedArrays.

components = (
    k => allowmissing(wrapdims(v, Tables.columnnames(v)[end], Tables.columnnames(v)[1:end-1]...))
    for (k, v) in flattened
)
Base.Generator{Vector{Pair}, Main.ex-full.var"#19#20"}(Main.ex-full.var"#19#20"(), Pair[(:train, :input, :prices) => 2320×4 DataFrame
  Row │ time                 id      lag        price
      │ DateTime             Symbol  Dates.Day  Float64?
──────┼─────────────────────────────────────────────────────────
    1 │ 2021-01-01T00:00:00  a       -1 day           0.766797
    2 │ 2021-01-01T01:00:00  a       -1 day           0.460085
    3 │ 2021-01-01T02:00:00  a       -1 day           0.854147
    4 │ 2021-01-01T03:00:00  a       -1 day           0.298614
    5 │ 2021-01-01T04:00:00  a       -1 day           0.579672
    6 │ 2021-01-01T05:00:00  a       -1 day           0.0109059
    7 │ 2021-01-01T06:00:00  a       -1 day           0.956753
    8 │ 2021-01-01T07:00:00  a       -1 day           0.112486
  ⋮   │          ⋮             ⋮         ⋮             ⋮
 2314 │ 2021-01-06T18:00:00  d       -4 days          0.785768
 2315 │ 2021-01-06T19:00:00  d       -4 days          0.299575
 2316 │ 2021-01-06T20:00:00  d       -4 days    missing
 2317 │ 2021-01-06T21:00:00  d       -4 days          0.5247
 2318 │ 2021-01-06T22:00:00  d       -4 days          0.334544
 2319 │ 2021-01-06T23:00:00  d       -4 days          0.708389
 2320 │ 2021-01-07T00:00:00  d       -4 days          0.475858
                                               2305 rows omitted, (:train, :input, :load) => 290×3 DataFrame
 Row │ time                 id      load
     │ DateTime             Symbol  Float64?
─────┼────────────────────────────────────────
   1 │ 2021-01-01T00:00:00  p       0.498267
   2 │ 2021-01-01T01:00:00  p       0.767626
   3 │ 2021-01-01T02:00:00  p       0.920107
   4 │ 2021-01-01T03:00:00  p       0.590117
   5 │ 2021-01-01T04:00:00  p       0.594357
   6 │ 2021-01-01T05:00:00  p       0.58704
   7 │ 2021-01-01T06:00:00  p       0.993471
   8 │ 2021-01-01T07:00:00  p       0.279791
  ⋮  │          ⋮             ⋮         ⋮
 284 │ 2021-01-06T18:00:00  q       0.446769
 285 │ 2021-01-06T19:00:00  q       0.392469
 286 │ 2021-01-06T20:00:00  q       0.463188
 287 │ 2021-01-06T21:00:00  q       0.308557
 288 │ 2021-01-06T22:00:00  q       0.258191
 289 │ 2021-01-06T23:00:00  q       0.908562
 290 │ 2021-01-07T00:00:00  q       0.0673735
                              275 rows omitted, (:train, :input, :temp) => 435×3 DataFrame
 Row │ time                 id      temp
     │ DateTime             Symbol  Float64?
─────┼──────────────────────────────────────────────
   1 │ 2021-01-01T00:00:00  x             0.0852417
   2 │ 2021-01-01T01:00:00  x             0.863001
   3 │ 2021-01-01T02:00:00  x             0.842769
   4 │ 2021-01-01T03:00:00  x             0.536419
   5 │ 2021-01-01T04:00:00  x             0.455309
   6 │ 2021-01-01T05:00:00  x             0.449724
   7 │ 2021-01-01T06:00:00  x             0.32601
   8 │ 2021-01-01T07:00:00  x             0.713125
  ⋮  │          ⋮             ⋮            ⋮
 429 │ 2021-01-06T18:00:00  z             0.120013
 430 │ 2021-01-06T19:00:00  z             0.194426
 431 │ 2021-01-06T20:00:00  z             0.282915
 432 │ 2021-01-06T21:00:00  z             0.462426
 433 │ 2021-01-06T22:00:00  z             0.97443
 434 │ 2021-01-06T23:00:00  z             0.194879
 435 │ 2021-01-07T00:00:00  z       missing
                                    420 rows omitted, (:train, :output, :prices) => 580×3 DataFrame
 Row │ time                 id      price
     │ DateTime             Symbol  Float64?
─────┼────────────────────────────────────────
   1 │ 2021-01-02T00:00:00  a       0.303532
   2 │ 2021-01-02T01:00:00  a       0.25275
   3 │ 2021-01-02T02:00:00  a       0.799366
   4 │ 2021-01-02T03:00:00  a       0.555203
   5 │ 2021-01-02T04:00:00  a       0.189023
   6 │ 2021-01-02T05:00:00  a       0.0735751
   7 │ 2021-01-02T06:00:00  a       0.91517
   8 │ 2021-01-02T07:00:00  a       0.0538971
  ⋮  │          ⋮             ⋮         ⋮
 574 │ 2021-01-07T18:00:00  d       0.529445
 575 │ 2021-01-07T19:00:00  d       0.831529
 576 │ 2021-01-07T20:00:00  d       0.959442
 577 │ 2021-01-07T21:00:00  d       0.600782
 578 │ 2021-01-07T22:00:00  d       0.278689
 579 │ 2021-01-07T23:00:00  d       0.423005
 580 │ 2021-01-08T00:00:00  d       0.99701
                              565 rows omitted, (:predict, :input, :prices) => 400×4 DataFrame
 Row │ time                 id      lag        price
     │ DateTime             Symbol  Dates.Day  Float64?
─────┼──────────────────────────────────────────────────
   1 │ 2021-01-07T00:00:00  a       -1 day     0.827425
   2 │ 2021-01-07T01:00:00  a       -1 day     0.509469
   3 │ 2021-01-07T02:00:00  a       -1 day     0.582964
   4 │ 2021-01-07T03:00:00  a       -1 day     0.037312
   5 │ 2021-01-07T04:00:00  a       -1 day     0.668785
   6 │ 2021-01-07T05:00:00  a       -1 day     0.931221
   7 │ 2021-01-07T06:00:00  a       -1 day     0.88757
   8 │ 2021-01-07T07:00:00  a       -1 day     0.35648
  ⋮  │          ⋮             ⋮         ⋮         ⋮
 394 │ 2021-01-07T18:00:00  d       -4 days    0.148835
 395 │ 2021-01-07T19:00:00  d       -4 days    0.177516
 396 │ 2021-01-07T20:00:00  d       -4 days    0.956917
 397 │ 2021-01-07T21:00:00  d       -4 days    0.184215
 398 │ 2021-01-07T22:00:00  d       -4 days    0.461396
 399 │ 2021-01-07T23:00:00  d       -4 days    0.599939
 400 │ 2021-01-08T00:00:00  d       -4 days    0.77756
                                        385 rows omitted, (:predict, :input, :load) => 50×3 DataFrame
 Row │ time                 id      load
     │ DateTime             Symbol  Float64
─────┼────────────────────────────────────────
   1 │ 2021-01-07T00:00:00  p       0.227583
   2 │ 2021-01-07T01:00:00  p       0.554615
   3 │ 2021-01-07T02:00:00  p       0.587662
   4 │ 2021-01-07T03:00:00  p       0.810217
   5 │ 2021-01-07T04:00:00  p       0.832929
   6 │ 2021-01-07T05:00:00  p       0.153034
   7 │ 2021-01-07T06:00:00  p       0.862698
   8 │ 2021-01-07T07:00:00  p       0.486309
  ⋮  │          ⋮             ⋮         ⋮
  44 │ 2021-01-07T18:00:00  q       0.975243
  45 │ 2021-01-07T19:00:00  q       0.306591
  46 │ 2021-01-07T20:00:00  q       0.327716
  47 │ 2021-01-07T21:00:00  q       0.998151
  48 │ 2021-01-07T22:00:00  q       0.279987
  49 │ 2021-01-07T23:00:00  q       0.3156
  50 │ 2021-01-08T00:00:00  q       0.929448
                               35 rows omitted, (:predict, :input, :temp) => 75×3 DataFrame
 Row │ time                 id      temp
     │ DateTime             Symbol  Float64?
─────┼───────────────────────────────────────
   1 │ 2021-01-07T00:00:00  x       0.425157
   2 │ 2021-01-07T01:00:00  x       0.930817
   3 │ 2021-01-07T02:00:00  x       0.803252
   4 │ 2021-01-07T03:00:00  x       0.316607
   5 │ 2021-01-07T04:00:00  x       0.853816
   6 │ 2021-01-07T05:00:00  x       0.389236
   7 │ 2021-01-07T06:00:00  x       0.925086
   8 │ 2021-01-07T07:00:00  x       0.532332
  ⋮  │          ⋮             ⋮        ⋮
  69 │ 2021-01-07T18:00:00  z       0.669497
  70 │ 2021-01-07T19:00:00  z       0.176072
  71 │ 2021-01-07T20:00:00  z       0.704747
  72 │ 2021-01-07T21:00:00  z       0.356378
  73 │ 2021-01-07T22:00:00  z       0.208576
  74 │ 2021-01-07T23:00:00  z       0.992507
  75 │ 2021-01-08T00:00:00  z       0.235793
                              60 rows omitted, (:predict, :output, :prices) => 100×3 DataFrame
 Row │ time                 id      price
     │ DateTime             Symbol  Float64?
─────┼────────────────────────────────────────────────
   1 │ 2021-01-08T00:00:00  a             0.000600888
   2 │ 2021-01-08T01:00:00  a             0.232931
   3 │ 2021-01-08T02:00:00  a             0.30747
   4 │ 2021-01-08T03:00:00  a             0.507047
   5 │ 2021-01-08T04:00:00  a             0.573479
   6 │ 2021-01-08T05:00:00  a             0.82323
   7 │ 2021-01-08T06:00:00  a             0.553864
   8 │ 2021-01-08T07:00:00  a             0.264023
  ⋮  │          ⋮             ⋮             ⋮
  94 │ 2021-01-08T18:00:00  d             0.33525
  95 │ 2021-01-08T19:00:00  d             0.939387
  96 │ 2021-01-08T20:00:00  d             0.193733
  97 │ 2021-01-08T21:00:00  d             0.252985
  98 │ 2021-01-08T22:00:00  d             0.558636
  99 │ 2021-01-08T23:00:00  d             0.929831
 100 │ 2021-01-09T00:00:00  d       missing
                                       85 rows omitted])

This representation avoids storing duplicate :time and :id column values and allows us to perform normal n-dimensional array operation over the dataset more efficiently.

If we look a little closer we'll also find that several of these "key" columns align across the dataframes, while others do not.

For example, the :time columns across train_input tables align. Similarly the :id columns match for both train_input_prices and train_output_prices.

@assert issetequal(d[(:train, :input, :temp)].time, d[(:train, :input, :load)].time)
@assert issetequal(d[(:train, :input, :prices)].id, d[(:train, :output, :prices)].id)

However, not all time or id columns need to align.

@assert !issetequal(d[(:train, :input, :prices)].time, d[(:train, :output, :prices)].time)
@assert !issetequal(d[(:train, :input, :temp)].id, d[(:train, :input, :load)].id)

It turns out we can summarize these alignment "constraints" pretty concisely.

  1. All time columns must align within each of the 4 train/predict x input/output combinations.
  2. All id columns must align for each prices, temp and load.

With AxisSets.jl we can declaratively state these alignment assumptions using Patterns.

Constraint patterns on :time

time_constraints = Pattern[
    # All train input time keys should match
    (:train, :input, :_, :time),

    # All train output time keys should match
    (:train, :output, :_, :time),

    # All predict input time keys should match
    (:predict, :input, :_, :time),

    # All predict output time keys should match
    (:predict, :output, :_, :time),
]
4-element Vector{AxisSets.Pattern}:
 Pattern((:train, :input, :_, :time))
 Pattern((:train, :output, :_, :time))
 Pattern((:predict, :input, :_, :time))
 Pattern((:predict, :output, :_, :time))

Constraint patterns on :id

id_constraints = Pattern[
    # All ids for each data type should align across
    (:__, :prices, :id),
    (:__, :temp, :id),
    (:__, :load, :id),
]
3-element Vector{AxisSets.Pattern}:
 Pattern((:__, :prices, :id))
 Pattern((:__, :temp, :id))
 Pattern((:__, :load, :id))

KeyedDataset

How can we make the constraint Patterns and component KeyedArrays more useful to us? Well, we can now combine our constraints and component KeyedArrays into a KeyedDataset.

ds = KeyedDataset(components...; constraints=vcat(time_constraints, id_constraints))
KeyedDataset with:
  8 components
    (:train, :input, :prices) => 145x4x4 KeyedArray{Union{Missing, Float64}} with dimension time[1], id[5], lag
    (:train, :input, :load) => 145x2 KeyedArray{Union{Missing, Float64}} with dimension time[1], id[7]
    (:train, :input, :temp) => 145x3 KeyedArray{Union{Missing, Float64}} with dimension time[1], id[6]
    (:train, :output, :prices) => 145x4 KeyedArray{Union{Missing, Float64}} with dimension time[2], id[5]
    (:predict, :input, :prices) => 25x4x4 KeyedArray{Union{Missing, Float64}} with dimension time[3], id[5], lag
    (:predict, :input, :load) => 25x2 KeyedArray{Union{Missing, Float64}} with dimension time[3], id[7]
    (:predict, :input, :temp) => 25x3 KeyedArray{Union{Missing, Float64}} with dimension time[3], id[6]
    (:predict, :output, :prices) => 25x4 KeyedArray{Union{Missing, Float64}} with dimension time[4], id[5]
  7 constraints
    [1] (:train, :input, :_, :time) ∈ 145-element Vector{Dates.DateTime}
    [2] (:train, :output, :_, :time) ∈ 145-element Vector{Dates.DateTime}
    [3] (:predict, :input, :_, :time) ∈ 25-element Vector{Dates.DateTime}
    [4] (:predict, :output, :_, :time) ∈ 25-element Vector{Dates.DateTime}
    [5] (:__, :prices, :id) ∈ 4-element Vector{Symbol}
    [6] (:__, :temp, :id) ∈ 3-element Vector{Symbol}
    [7] (:__, :load, :id) ∈ 2-element Vector{Symbol}

The objective of this type is to address two primary issues:

  1. Ensure that our data wrangling operations won't violate our constraints outlined above.
  2. Provide batched operations to minimize verbose data wrangling operations.

Let's perform some common operations:

We often want to filter out ids being consider if they have too many missing values. Lets try just applying this filtering rule to each component of our dataset

unique(
    axiskeys(Impute.filter(x -> mean(ismissing, x) < 0.1, v; dims=:id), :id)
    for (k, v) in ds.data
)
4-element Vector{Vector{Symbol}}:
 [:a, :b, :c, :d]
 [:p, :q]
 [:x, :y, :z]
 [:a, :b, :d]

We can see that doing this results in inconsistent :id keys across our components. Now lets try applying a batched version of that filtering rule across the entire dataset.

ds = Impute.filter(x -> mean(ismissing, x) < 0.1, ds; dims=:id)
unique(axiskeys(ds, :id))
3-element Vector{Vector{Symbol}}:
 [:a, :b, :d]
 [:p, :q]
 [:x, :y, :z]

Notice how our returned KeyedDataset respects the :id constraints we provided above. Another kind of filtering we often do is dropping hours with any missing data after this point.

ds = Impute.filter(ds; dims=:time)
unique(axiskeys(ds, :time))
4-element Vector{Vector{Dates.DateTime}}:
 [Dates.DateTime("2021-01-01T01:00:00"), Dates.DateTime("2021-01-01T02:00:00"), Dates.DateTime("2021-01-01T06:00:00"), Dates.DateTime("2021-01-01T07:00:00"), Dates.DateTime("2021-01-01T08:00:00"), Dates.DateTime("2021-01-01T09:00:00"), Dates.DateTime("2021-01-01T10:00:00"), Dates.DateTime("2021-01-01T13:00:00"), Dates.DateTime("2021-01-01T14:00:00"), Dates.DateTime("2021-01-01T15:00:00")  …  Dates.DateTime("2021-01-06T02:00:00"), Dates.DateTime("2021-01-06T04:00:00"), Dates.DateTime("2021-01-06T05:00:00"), Dates.DateTime("2021-01-06T08:00:00"), Dates.DateTime("2021-01-06T09:00:00"), Dates.DateTime("2021-01-06T12:00:00"), Dates.DateTime("2021-01-06T13:00:00"), Dates.DateTime("2021-01-06T15:00:00"), Dates.DateTime("2021-01-06T19:00:00"), Dates.DateTime("2021-01-06T21:00:00")]
 [Dates.DateTime("2021-01-02T00:00:00"), Dates.DateTime("2021-01-02T01:00:00"), Dates.DateTime("2021-01-02T02:00:00"), Dates.DateTime("2021-01-02T03:00:00"), Dates.DateTime("2021-01-02T04:00:00"), Dates.DateTime("2021-01-02T05:00:00"), Dates.DateTime("2021-01-02T07:00:00"), Dates.DateTime("2021-01-02T08:00:00"), Dates.DateTime("2021-01-02T09:00:00"), Dates.DateTime("2021-01-02T10:00:00")  …  Dates.DateTime("2021-01-07T13:00:00"), Dates.DateTime("2021-01-07T14:00:00"), Dates.DateTime("2021-01-07T15:00:00"), Dates.DateTime("2021-01-07T16:00:00"), Dates.DateTime("2021-01-07T17:00:00"), Dates.DateTime("2021-01-07T18:00:00"), Dates.DateTime("2021-01-07T19:00:00"), Dates.DateTime("2021-01-07T20:00:00"), Dates.DateTime("2021-01-07T21:00:00"), Dates.DateTime("2021-01-07T22:00:00")]
 [Dates.DateTime("2021-01-07T00:00:00"), Dates.DateTime("2021-01-07T01:00:00"), Dates.DateTime("2021-01-07T06:00:00"), Dates.DateTime("2021-01-07T07:00:00"), Dates.DateTime("2021-01-07T08:00:00"), Dates.DateTime("2021-01-07T09:00:00"), Dates.DateTime("2021-01-07T10:00:00"), Dates.DateTime("2021-01-07T12:00:00"), Dates.DateTime("2021-01-07T13:00:00"), Dates.DateTime("2021-01-07T14:00:00"), Dates.DateTime("2021-01-07T15:00:00"), Dates.DateTime("2021-01-07T16:00:00"), Dates.DateTime("2021-01-07T17:00:00"), Dates.DateTime("2021-01-07T18:00:00"), Dates.DateTime("2021-01-07T19:00:00"), Dates.DateTime("2021-01-07T20:00:00"), Dates.DateTime("2021-01-07T23:00:00")]
 [Dates.DateTime("2021-01-08T00:00:00"), Dates.DateTime("2021-01-08T01:00:00"), Dates.DateTime("2021-01-08T02:00:00"), Dates.DateTime("2021-01-08T03:00:00"), Dates.DateTime("2021-01-08T04:00:00"), Dates.DateTime("2021-01-08T05:00:00"), Dates.DateTime("2021-01-08T06:00:00"), Dates.DateTime("2021-01-08T07:00:00"), Dates.DateTime("2021-01-08T08:00:00"), Dates.DateTime("2021-01-08T09:00:00")  …  Dates.DateTime("2021-01-08T14:00:00"), Dates.DateTime("2021-01-08T15:00:00"), Dates.DateTime("2021-01-08T16:00:00"), Dates.DateTime("2021-01-08T17:00:00"), Dates.DateTime("2021-01-08T18:00:00"), Dates.DateTime("2021-01-08T19:00:00"), Dates.DateTime("2021-01-08T20:00:00"), Dates.DateTime("2021-01-08T21:00:00"), Dates.DateTime("2021-01-08T22:00:00"), Dates.DateTime("2021-01-08T23:00:00")]

You'll notice that we may have up-to 4 unique :time keys among our 8 components. This is because we only expect keys to align across each :train/predict and input/output combinations as described above.

Finally, we should be able to restrict the component KeyedArrays to disallowmissing.

ds = map(disallowmissing, ds)
KeyedDataset with:
  8 components
    (:train, :input, :prices) => 78x3x4 KeyedArray{Float64} with dimension time[1], id[5], lag
    (:train, :input, :load) => 78x2 KeyedArray{Float64} with dimension time[1], id[7]
    (:train, :input, :temp) => 78x3 KeyedArray{Float64} with dimension time[1], id[6]
    (:train, :output, :prices) => 123x3 KeyedArray{Float64} with dimension time[2], id[5]
    (:predict, :input, :prices) => 17x3x4 KeyedArray{Float64} with dimension time[3], id[5], lag
    (:predict, :input, :load) => 17x2 KeyedArray{Float64} with dimension time[3], id[7]
    (:predict, :input, :temp) => 17x3 KeyedArray{Float64} with dimension time[3], id[6]
    (:predict, :output, :prices) => 24x3 KeyedArray{Float64} with dimension time[4], id[5]
  7 constraints
    [1] (:train, :input, :_, :time) ∈ 78-element Vector{Dates.DateTime}
    [2] (:train, :output, :_, :time) ∈ 123-element Vector{Dates.DateTime}
    [3] (:predict, :input, :_, :time) ∈ 17-element Vector{Dates.DateTime}
    [4] (:predict, :output, :_, :time) ∈ 24-element Vector{Dates.DateTime}
    [5] (:__, :prices, :id) ∈ 3-element Vector{Symbol}
    [6] (:__, :temp, :id) ∈ 3-element Vector{Symbol}
    [7] (:__, :load, :id) ∈ 2-element Vector{Symbol}

Another common operation is to mutate the key values in batches. In this case, we'll say that we need to convert the :time keys to ZonedDateTimes.

ds = rekey(k -> ZonedDateTime.(k, tz"UTC"), ds, :time)
KeyedDataset with:
  8 components
    (:train, :input, :prices) => 78x3x4 KeyedArray{Float64} with dimension time[1], id[5], lag
    (:train, :input, :load) => 78x2 KeyedArray{Float64} with dimension time[1], id[7]
    (:train, :input, :temp) => 78x3 KeyedArray{Float64} with dimension time[1], id[6]
    (:train, :output, :prices) => 123x3 KeyedArray{Float64} with dimension time[2], id[5]
    (:predict, :input, :prices) => 17x3x4 KeyedArray{Float64} with dimension time[3], id[5], lag
    (:predict, :input, :load) => 17x2 KeyedArray{Float64} with dimension time[3], id[7]
    (:predict, :input, :temp) => 17x3 KeyedArray{Float64} with dimension time[3], id[6]
    (:predict, :output, :prices) => 24x3 KeyedArray{Float64} with dimension time[4], id[5]
  7 constraints
    [1] (:train, :input, :_, :time) ∈ 78-element Vector{TimeZones.ZonedDateTime}
    [2] (:train, :output, :_, :time) ∈ 123-element Vector{TimeZones.ZonedDateTime}
    [3] (:predict, :input, :_, :time) ∈ 17-element Vector{TimeZones.ZonedDateTime}
    [4] (:predict, :output, :_, :time) ∈ 24-element Vector{TimeZones.ZonedDateTime}
    [5] (:__, :prices, :id) ∈ 3-element Vector{Symbol}
    [6] (:__, :temp, :id) ∈ 3-element Vector{Symbol}
    [7] (:__, :load, :id) ∈ 2-element Vector{Symbol}

Okay, so now that all of our data manipulation is complete we want to combine all our components into 4 simple 2-d matrices

results = (
    X = hcat(
        flatten(ds[(:train, :input, :prices)], (:id, :lag) => :id),
        ds[(:train, :input, :temp)],
        ds[(:train, :input, :load)],
    ),
    y = ds[(:train, :output, :prices)],
    X̂ = hcat(
        flatten(ds[(:predict, :input, :prices)], (:id, :lag) => :id),
        ds[(:predict, :input, :temp)],
        ds[(:predict, :input, :load)],
    ),
    ŷ = ds[(:predict, :output, :prices)],
)
results.X
2-dimensional KeyedArray(NamedDimsArray(...)) with keys:
↓   time ∈ 78-element Vector{TimeZones.ZonedDateTime}
→   id ∈ 17-element Vector{Any}
And data, 78×17 Matrix{Float64}:
                                            …  (:z)       (:p)          (:q)
   ZonedDateTime(2021, 1, 1, 1, tz"UTC")        0.613292   0.767626      0.0821134
   ZonedDateTime(2021, 1, 1, 2, tz"UTC")        0.231773   0.920107      0.543866
   ZonedDateTime(2021, 1, 1, 6, tz"UTC")        0.870389   0.993471      0.540498
   ZonedDateTime(2021, 1, 1, 7, tz"UTC")        0.88743    0.279791      0.394165
   ZonedDateTime(2021, 1, 1, 8, tz"UTC")    …   0.62167    0.547112      0.712014
   ZonedDateTime(2021, 1, 1, 9, tz"UTC")        0.872754   0.214497      0.208461
   ⋮                                        ⋱   ⋮                       
   ZonedDateTime(2021, 1, 6, 8, tz"UTC")        0.944367   0.795146      0.664972
   ZonedDateTime(2021, 1, 6, 9, tz"UTC")        0.611067   0.592278      0.113658
   ZonedDateTime(2021, 1, 6, 12, tz"UTC")   …   0.790809   0.696242      0.560985
   ZonedDateTime(2021, 1, 6, 13, tz"UTC")       0.772407   0.0932628     0.620784
   ZonedDateTime(2021, 1, 6, 15, tz"UTC")       0.528693   0.000701836   0.949602
   ZonedDateTime(2021, 1, 6, 19, tz"UTC")       0.194426   0.280996      0.392469
   ZonedDateTime(2021, 1, 6, 21, tz"UTC")       0.462426   0.397736      0.308557