Imputation
Impute.Imputor — TypeImputorAn imputor stores information about imputing values in AbstractArrays and Tables.tables. New imputation methods are expected to subtype Imputor and, at minimum, implement the _impute!(data::AbstractArrays, imp::<MyImputor>) method.
While fallback impute and impute! methods are provided to extend your _impute! methods to n-dimensional arrays and tables, you can always override these methods to change the behaviour as necessary.
Impute.impute! — Methodimpute!(data::AbstractArray, imp) -> dataJust returns the data when the array doesn't contain missings
Impute.impute! — Methodimpute!(data::AbstractArray{Missing}, imp) -> dataJust return the data when the array only contains missings
Impute.impute! — Methodimpute!(data::A, imp; dims=:, kwargs...) -> AImpute the missing values in the array data using the imputor imp. Optionally, you can specify the dimension to impute along.
Arguments
data::AbstractArray{Union{T, Missing}}: the data to be impute along dimensionsdimsimp::Imputor: the Imputor method to use
Keyword Arguments
dims=:: The dimension to impute along.:rowsand:colsare also supported for matrices.
Returns
AbstractArray{Union{T, Missing}}: the inputdatawith values imputed
NOTES
- Matrices have a deprecated
dims=2special case asdims=:is a breaking change - Mutation isn't guaranteed for all array types, hence we return the result
eachsliceis used internally which requires Julia 1.1
Example
julia> using Impute: Interpolate, impute!
julia> M = [1.0 2.0 missing missing 5.0; 1.1 2.2 3.3 missing 5.5]
2×5 Matrix{Union{Missing, Float64}}:
1.0 2.0 missing missing 5.0
1.1 2.2 3.3 missing 5.5
julia> impute!(M, Interpolate(); dims=1)
2×5 Matrix{Union{Missing, Float64}}:
1.0 2.0 3.0 4.0 5.0
1.1 2.2 3.3 4.4 5.5
julia> M
2×5 Matrix{Union{Missing, Float64}}:
1.0 2.0 3.0 4.0 5.0
1.1 2.2 3.3 4.4 5.5Impute.impute! — Methodimpute!(table, imp; cols=nothing) -> tableImputes the data in a table by imputing the values 1 column at a time; if this is not the desired behaviour custom imputor methods should overload this method.
Arguments
imp::Imputor: the Imputor method to usetable: the data to impute
Keyword Arguments
cols: The columns to impute along (default is to impute all columns)
Returns
- the input
datawith values imputed
Example
julia> using DataFrames; using Impute: Interpolate, impute
julia> df = DataFrame(:a => [1.0, 2.0, missing, missing, 5.0], :b => [1.1, 2.2, 3.3, missing, 5.5])
5×2 DataFrame
Row │ a b
│ Float64? Float64?
─────┼──────────────────────
1 │ 1.0 1.1
2 │ 2.0 2.2
3 │ missing 3.3
4 │ missing missing
5 │ 5.0 5.5
julia> impute(df, Interpolate())
5×2 DataFrame
Row │ a b
│ Float64? Float64?
─────┼────────────────────
1 │ 1.0 1.1
2 │ 2.0 2.2
3 │ 3.0 3.3
4 │ 4.0 4.4
5 │ 5.0 5.5Impute.impute! — Methodimpute!(data::T, imp; kwargs...) -> T where T <: AbstractVector{<:NamedTuple}Special case rowtables which are arrays, but we want to fallback to the tables method.
Impute.impute — Methodimpute(data::T, imp; kwargs...) -> TReturns a new copy of the data with the missing data imputed by the imputor imp. For matrices and tables, data is imputed one variable/column at a time. If this is not the desired behaviour then you should overload this method or specify a different dims value.
Arguments
data: the data to be imputeimp::Imputor: the Imputor method to use
Returns
- the input
datawith values imputed
Example
julia> using Impute: Interpolate, impute
julia> v = [1.0, 2.0, missing, missing, 5.0]
5-element Vector{Union{Missing, Float64}}:
1.0
2.0
missing
missing
5.0
julia> impute(v, Interpolate())
5-element Vector{Union{Missing, Float64}}:
1.0
2.0
3.0
4.0
5.0Interpolation
Impute.interp — FunctionImpute.interp(data; dims=1)Performs linear interpolation between the nearest values in an vector. See Impute.Interpolate for details.
Example
julia> using DataFrames; using Impute: Impute
julia> df = DataFrame(:a => [1.0, 2.0, missing, missing, 5.0], :b => [1.1, 2.2, 3.3, missing, 5.5])
5×2 DataFrame
Row │ a b
│ Float64? Float64?
─────┼──────────────────────
1 │ 1.0 1.1
2 │ 2.0 2.2
3 │ missing 3.3
4 │ missing missing
5 │ 5.0 5.5
julia> Impute.interp(df)
5×2 DataFrame
Row │ a b
│ Float64? Float64?
─────┼────────────────────
1 │ 1.0 1.1
2 │ 2.0 2.2
3 │ 3.0 3.3
4 │ 4.0 4.4
5 │ 5.0 5.5Impute.Interpolate — TypeInterpolate(; limit=nothing, r=nothing)Performs linear interpolation between the nearest values in an vector. The current implementation is univariate, so each variable in a table or matrix will be handled independently.
!!! Missing values at the head or tail of the array cannot be interpolated if there are no existing values on both sides. As a result, this method does not guarantee that all missing values will be imputed.
Keyword Arguments
limit::Union{UInt, Nothing}: Optionally limit the gap sizes that can be interpolated.r::Union{RoundingMode, Nothing}: Optionally specify a rounding mode. AvoidsInexactErrors when interpolating over integers.
Example
julia> using Impute: Interpolate, impute
julia> M = [1.0 2.0 missing missing missing 6.0; 1.1 missing missing 4.4 5.5 6.6]
2×6 Matrix{Union{Missing, Float64}}:
1.0 2.0 missing missing missing 6.0
1.1 missing missing 4.4 5.5 6.6
julia> impute(M, Interpolate(); dims=:rows)
2×6 Matrix{Union{Missing, Float64}}:
1.0 2.0 3.0 4.0 5.0 6.0
1.1 2.2 3.3 4.4 5.5 6.6
julia> impute(M, Interpolate(; limit=2); dims=:rows)
2×6 Matrix{Union{Missing, Float64}}:
1.0 2.0 missing missing missing 6.0
1.1 2.2 3.3 4.4 5.5 6.6K-Nearest Neighbors (KNN)
Impute.knn — FunctionImpute.knn(; k=1, threshold=0.5, dist=Euclidean())Imputation using k-Nearest Neighbor algorithm.
Keyword Arguments
k::Int: number of nearest neighborsdist::MinkowskiMetric: distance metric suppports byNearestNeighbors.jl(Euclidean, Chebyshev, Minkowski and Cityblock)threshold::AbsstractFloat: thershold for missing neighbors
Reference
- Troyanskaya, Olga, et al. "Missing value estimation methods for DNA microarrays." Bioinformatics 17.6 (2001): 520-525.
Example
julia> using Impute, Missings
julia> data = allowmissing(reshape(sin.(1:20), 5, 4)); data[[2, 3, 7, 9, 13, 19]] .= missing; data
5×4 Matrix{Union{Missing, Float64}}:
0.841471 -0.279415 -0.99999 -0.287903
missing missing -0.536573 -0.961397
missing 0.989358 missing -0.750987
-0.756802 missing 0.990607 missing
-0.958924 -0.544021 0.650288 0.912945
julia> result = Impute.knn(data; dims=:cols)
5×4 Matrix{Union{Missing, Float64}}:
0.841471 -0.279415 -0.99999 -0.287903
-0.756802 0.989358 -0.536573 -0.961397
-0.756802 0.989358 -0.536573 -0.750987
-0.756802 -0.544021 0.990607 0.912945
-0.958924 -0.544021 0.650288 0.912945Impute.KNN — TypeKNN(; kwargs...)Imputation using k-Nearest Neighbor algorithm.
Keyword Arguments
k::Int: number of nearest neighborsdist::MinkowskiMetric: distance metric suppports byNearestNeighbors.jl(Euclidean, Chebyshev, Minkowski and Cityblock)threshold::AbstractFloat: threshold for missing neighbors
Reference
- Troyanskaya, Olga, et al. "Missing value estimation methods for DNA microarrays." Bioinformatics 17.6 (2001): 520-525.
Last Observation Carried Forward (LOCF)
Impute.locf — FunctionImpute.locf(data; dims=1)Iterates forwards through the data and fills missing data with the last existing observation. See Impute.LOCF for details.
Example
julia> using DataFrames; using Impute: Impute
julia> df = DataFrame(:a => [1.0, 2.0, missing, missing, 5.0], :b => [1.1, 2.2, 3.3, missing, 5.5])
5×2 DataFrame
Row │ a b
│ Float64? Float64?
─────┼──────────────────────
1 │ 1.0 1.1
2 │ 2.0 2.2
3 │ missing 3.3
4 │ missing missing
5 │ 5.0 5.5
julia> Impute.locf(df)
5×2 DataFrame
Row │ a b
│ Float64? Float64?
─────┼────────────────────
1 │ 1.0 1.1
2 │ 2.0 2.2
3 │ 2.0 3.3
4 │ 2.0 3.3
5 │ 5.0 5.5Impute.LOCF — TypeLOCF(; limit=nothing)Last observation carried forward (LOCF) iterates forwards through the data and fills missing data with the last existing observation. The current implementation is univariate, so each variable in a table or matrix will be handled independently.
See also:
Impute.NOCB: Next Observation Carried Backward
!!! Missing elements at the head of the array may not be imputed if there is no existing observation to carry forward. As a result, this method does not guarantee that all missing values will be imputed.
Keyword Arguments
limit::Union{UInt, Nothing}: Optionally limits the amount of consecutive missing values to replace.
Example
julia> using Impute: LOCF, impute
julia> M = [1.0 2.0 missing missing missing 6.0; 1.1 missing missing 4.4 5.5 6.6]
2×6 Matrix{Union{Missing, Float64}}:
1.0 2.0 missing missing missing 6.0
1.1 missing missing 4.4 5.5 6.6
julia> impute(M, LOCF(); dims=:rows)
2×6 Matrix{Union{Missing, Float64}}:
1.0 2.0 2.0 2.0 2.0 6.0
1.1 1.1 1.1 4.4 5.5 6.6
julia> impute(M, LOCF(; limit=2); dims=:rows)
2×6 Matrix{Union{Missing, Float64}}:
1.0 2.0 2.0 2.0 missing 6.0
1.1 1.1 1.1 4.4 5.5 6.6Next Observation Carried Backward (NOCB)
Impute.nocb — FunctionImpute.nocb(data; dims=1)Iterates backwards through the data and fills missing data with the next existing observation. See Impute.NOCB for details.
Example
julia> using DataFrames; using Impute: Impute
julia> df = DataFrame(:a => [1.0, 2.0, missing, missing, 5.0], :b => [1.1, 2.2, 3.3, missing, 5.5])
5×2 DataFrame
Row │ a b
│ Float64? Float64?
─────┼──────────────────────
1 │ 1.0 1.1
2 │ 2.0 2.2
3 │ missing 3.3
4 │ missing missing
5 │ 5.0 5.5
julia> Impute.nocb(df)
5×2 DataFrame
Row │ a b
│ Float64? Float64?
─────┼────────────────────
1 │ 1.0 1.1
2 │ 2.0 2.2
3 │ 5.0 3.3
4 │ 5.0 5.5
5 │ 5.0 5.5Impute.NOCB — TypeNOCB(; limit=nothing)Next observation carried backward (NOCB) iterates backwards through the data and fills missing data with the next existing observation.
See also:
Impute.LOCF: Last Observation Carried Forward
!!! Missing elements at the tail of the array may not be imputed if there is no existing observation to carry backward. As a result, this method does not guarantee that all missing values will be imputed.
Keyword Arguments
limit::Union{UInt, Nothing}: Optionally limits the amount of consecutive missing values to replace.
Example
julia> using Impute: NOCB, impute
julia> M = [1.0 2.0 missing missing missing 6.0; 1.1 missing missing 4.4 5.5 6.6]
2×6 Matrix{Union{Missing, Float64}}:
1.0 2.0 missing missing missing 6.0
1.1 missing missing 4.4 5.5 6.6
julia> impute(M, NOCB(); dims=:rows)
2×6 Matrix{Union{Missing, Float64}}:
1.0 2.0 6.0 6.0 6.0 6.0
1.1 4.4 4.4 4.4 5.5 6.6
julia> impute(M, NOCB(; limit=2); dims=:rows)
2×6 Matrix{Union{Missing, Float64}}:
1.0 2.0 missing 6.0 6.0 6.0
1.1 4.4 4.4 4.4 5.5 6.6Replacement
Impute.replace — FunctionImpute.replace(data; values)Replace missings with one of the specified constant values, depending on the input type. If multiple values of the same type are provided then the first one will be used. If the input data is of a different type then the no replacement will be performed.
Keyword Arguments
values::Tuple: A scalar or tuple of different values that should be used to replace missings. Typically, one value per type you're considering imputing for.
Example
julia> using DataFrames, Impute
julia> df = DataFrame(
:a => [1.1, 2.2, missing, missing, 5.5],
:b => [1, 2, 3, missing, 5],
:c => ["v", "w", "x", "y", missing],
)
5×3 DataFrame
Row │ a b c
│ Float64? Int64? String?
─────┼─────────────────────────────
1 │ 1.1 1 v
2 │ 2.2 2 w
3 │ missing 3 x
4 │ missing missing y
5 │ 5.5 5 missing
julia> Impute.replace(df; values=(NaN, -9999, "NULL"))
5×3 DataFrame
Row │ a b c
│ Float64? Int64? String?
─────┼───────────────────────────
1 │ 1.1 1 v
2 │ 2.2 2 w
3 │ NaN 3 x
4 │ NaN -9999 y
5 │ 5.5 5 NULLImpute.Replace — TypeReplace(; value)Replace missings with one of the specified constant values, depending on the input type. If multiple values of the same type are provided then the first one will be used. If the input data is of a different type then the no replacement will be performed.
Keyword Arguments
values::Tuple: A scalar or tuple of different values that should be used to replace missings. Typically, one value per type you're considering imputing for.
Example
julia> using Impute: Replace, impute
julia> M = [1.0 2.0 missing missing 5.0; 1.1 2.2 3.3 missing 5.5]
2×5 Matrix{Union{Missing, Float64}}:
1.0 2.0 missing missing 5.0
1.1 2.2 3.3 missing 5.5
julia> impute(M, Replace(; values=0.0); dims=2)
2×5 Matrix{Union{Missing, Float64}}:
1.0 2.0 0.0 0.0 5.0
1.1 2.2 3.3 0.0 5.5Simple Random Sample (SRS)
Impute.srs — FunctionImpute.srs(data; rng=Random.GLOBAL_RNG)Simple Random Sampling (SRS) imputation is a method for imputing both continuous and categorical variables. Furthermore, it completes imputation while preserving the distributional properties of the variables (e.g., mean, standard deviation).
Example
julia> using DataFrames; using Random; using Impute: Impute
julia> df = DataFrame(:a => [1.0, 2.0, missing, missing, 5.0], :b => [1.1, 2.2, 3.3, missing, 5.5])
5×2 DataFrame
│ Row │ a │ b │
│ │ Float64? │ Float64? │
├─────┼──────────┼──────────┤
│ 1 │ 1.0 │ 1.1 │
│ 2 │ 2.0 │ 2.2 │
│ 3 │ missing │ 3.3 │
│ 4 │ missing │ missing │
│ 5 │ 5.0 │ 5.5 │
julia> Impute.srs(df; rng=MersenneTwister(1234))
5×2 DataFrame
│ Row │ a │ b │
│ │ Float64? │ Float64? │
├─────┼──────────┼──────────┤
│ 1 │ 1.0 │ 1.1 │
│ 2 │ 2.0 │ 2.2 │
│ 3 │ 1.0 │ 3.3 │
│ 4 │ 2.0 │ 3.3 │
│ 5 │ 5.0 │ 5.5 │Impute.SRS — TypeSRS(; rng=Random.GLOBAL_RNG)Simple Random Sampling (SRS) imputation is a method for imputing both continuous and categorical variables. Furthermore, it completes imputation while preserving the distributional properties of the variables (e.g., mean, standard deviation).
The basic idea is that for a given variable, x, with missing data, we randomly draw from the observed values of x to impute the missing elements. Since the random draws from x for imputation are done in proportion to the frequency distribution of the values in x, the univariate distributional properties are generally not impacted; this is true for both categorical and continuous data.
Keyword Arguments
rng::AbstractRNG: A random number generator to use for observation selection
Example
julia> using Random; using Impute: SRS, impute
julia> M = [1.0 2.0 missing missing 5.0; 1.1 2.2 3.3 missing 5.5]
2×5 Matrix{Union{Missing, Float64}}:
1.0 2.0 missing missing 5.0
1.1 2.2 3.3 missing 5.5
julia> impute(M, SRS(; rng=MersenneTwister(1234)); dims=:rows)
2×5 Matrix{Union{Missing, Float64}}:
1.0 2.0 1.0 2.0 5.0
1.1 2.2 3.3 3.3 5.5Substitute
Impute.substitute — FunctionImpute.substitute(data; statistic=nothing)
Impute.substitute(data; weights=nothing)Substitute missing values with a summary statistic over the non-missing values.
Keyword Arguments
statistic: A summary statistic function to be applied to the non-missing values. This function should return a value of the same type as the input dataeltype. If this function isn't passed in then thedefaultstatsfunction is used to make a best guess.weights: A set of statistical weights to apply to themeanormedianindefaultstats.
See Substitute for details on substitution rules defined in defaultstats.
Example
julia> using DataFrames, Impute
julia> df = DataFrame(
:a => [8.9, 2.2, missing, missing, 1.3, 6.2, 3.7, 4.8],
:b => [2, 6, 3, missing, 7, 1, 9, missing],
:c => [true, false, true, true, false, missing, false, true],
)
8×3 DataFrame
Row │ a b c
│ Float64? Int64? Bool?
─────┼─────────────────────────────
1 │ 8.9 2 true
2 │ 2.2 6 false
3 │ missing 3 true
4 │ missing missing true
5 │ 1.3 7 false
6 │ 6.2 1 missing
7 │ 3.7 9 false
8 │ 4.8 missing true
julia> Impute.substitute(df)
8×3 DataFrame
Row │ a b c
│ Float64? Int64? Bool?
─────┼─────────────────────────
1 │ 8.9 2 true
2 │ 2.2 6 false
3 │ 4.25 3 true
4 │ 4.25 4 true
5 │ 1.3 7 false
6 │ 6.2 1 true
7 │ 3.7 9 false
8 │ 4.8 4 trueImpute.Substitute — TypeSubstitute(; statistic=Impute.defaultstats)Substitute missing values with a summary statistic over the non-missing values.
Keyword Arguments
statistic: A summary statistic function to be applied to the non-missing values. This function should return a value of the same type as the input dataeltype. If this function isn't passed in then theImpute.defaultstatsfunction is used to make a best guess.
Example
julia> using Statistics; using Impute: Substitute, impute
julia> M = [1.0 2.0 missing missing 5.0; 1.1 2.2 3.3 missing 5.5]
2×5 Matrix{Union{Missing, Float64}}:
1.0 2.0 missing missing 5.0
1.1 2.2 3.3 missing 5.5
julia> impute(M, Substitute(); dims=:rows)
2×5 Matrix{Union{Missing, Float64}}:
1.0 2.0 2.0 2.0 5.0
1.1 2.2 3.3 2.75 5.5
julia> impute(M, Substitute(; statistic=mean); dims=:rows)
2×5 Matrix{Union{Missing, Float64}}:
1.0 2.0 2.66667 2.66667 5.0
1.1 2.2 3.3 3.025 5.5Impute.defaultstats — Functiondefaultstats(data[, wv])A set of default substitution rules using either median or mode based on the eltype of the input data. Specific rules are summarized as follows.
Boolelements usemodeRealelements usemedianIntegerelements wherenunique(data) / length(data) < 0.25usemode(ratings, categorical codings, etc)Integerelements with mostly unique values usemedian!Number(non-numeric) elements usemodeas the safest fallback
SVD
Impute.svd — FunctionImpute.svd(; kwargs...)Imputes the missing values in a matrix using an expectation maximization (EM) algorithm over low-rank SVD approximations.
Keyword Arguments
init::Imputor: initialization method for missing values (default: Substitute())rank::Union{Int, Nothing}: rank of the SVD approximation (default: nothing meaning start and 0 and increase)tol::Float64: convergence tolerance (default: 1e-10)maxiter::Int: Maximum number of iterations if convergence is not achieved (default: 100)limits::Unoin{Tuple{Float64, Float64}, Nothing}: Bound the possible approximation values (default: nothing)verbose::Bool: Whether to display convergence progress (default: true)
References
- Troyanskaya, Olga, et al. "Missing value estimation methods for DNA microarrays." Bioinformatics 17.6 (2001): 520-525.
Example
julia> using Impute, Missings
julia> data = allowmissing(reshape(sin.(1:20), 5, 4)); data[[2, 3, 7, 9, 13, 19]] .= missing; data
5×4 Matrix{Union{Missing, Float64}}:
0.841471 -0.279415 -0.99999 -0.287903
missing missing -0.536573 -0.961397
missing 0.989358 missing -0.750987
-0.756802 missing 0.990607 missing
-0.958924 -0.544021 0.650288 0.912945
julia> result = Impute.svd(data; dims=:cols)
5×4 Matrix{Union{Missing, Float64}}:
0.841471 -0.279415 -0.99999 -0.287903
0.220258 0.555829 -0.536573 -0.961397
-0.372745 0.989358 0.533193 -0.750987
-0.756802 0.253309 0.990607 0.32315
-0.958924 -0.544021 0.650288 0.912945Impute.SVD — TypeSVD(; kwargs...)Imputes the missing values in a matrix using an expectation maximization (EM) algorithm over low-rank SVD approximations.
Keyword Arguments
init::Imputor: initialization method for missing values (default: Substitute())rank::Union{Int, Nothing}: rank of the SVD approximation (default: nothing meaning start and 0 and increase)tol::Float64: convergence tolerance (default: 1e-10)maxiter::Int: Maximum number of iterations if convergence is not achieved (default: 100)limits::Union{Tuple{Float64, Float64}, Nothing}: Bound the possible approximation values (default: nothing)verbose::Bool: Whether to display convergence progress (default: true)
References
- Troyanskaya, Olga, et al. "Missing value estimation methods for DNA microarrays." Bioinformatics 17.6 (2001): 520-525.