Imputation
Impute.Imputor
— TypeImputor
An imputor stores information about imputing values in AbstractArray
s and Tables.table
s. New imputation methods are expected to subtype Imputor
and, at minimum, implement the _impute!(data::AbstractArrays, imp::<MyImputor>)
method.
While fallback impute
and impute!
methods are provided to extend your _impute!
methods to n-dimensional arrays and tables, you can always override these methods to change the behaviour as necessary.
Impute.impute!
— Methodimpute!(data::AbstractArray, imp) -> data
Just returns the data
when the array doesn't contain missing
s
Impute.impute!
— Methodimpute!(data::AbstractArray{Missing}, imp) -> data
Just return the data
when the array only contains missing
s
Impute.impute!
— Methodimpute!(data::A, imp; dims=:, kwargs...) -> A
Impute the missing
values in the array data
using the imputor imp
. Optionally, you can specify the dimension to impute along.
Arguments
data::AbstractArray{Union{T, Missing}}
: the data to be impute along dimensionsdims
imp::Imputor
: the Imputor method to use
Keyword Arguments
dims=:
: The dimension to impute along.:rows
and:cols
are also supported for matrices.
Returns
AbstractArray{Union{T, Missing}}
: the inputdata
with values imputed
NOTES
- Matrices have a deprecated
dims=2
special case asdims=:
is a breaking change - Mutation isn't guaranteed for all array types, hence we return the result
eachslice
is used internally which requires Julia 1.1
Example
julia> using Impute: Interpolate, impute!
julia> M = [1.0 2.0 missing missing 5.0; 1.1 2.2 3.3 missing 5.5]
2×5 Matrix{Union{Missing, Float64}}:
1.0 2.0 missing missing 5.0
1.1 2.2 3.3 missing 5.5
julia> impute!(M, Interpolate(); dims=1)
2×5 Matrix{Union{Missing, Float64}}:
1.0 2.0 3.0 4.0 5.0
1.1 2.2 3.3 4.4 5.5
julia> M
2×5 Matrix{Union{Missing, Float64}}:
1.0 2.0 3.0 4.0 5.0
1.1 2.2 3.3 4.4 5.5
Impute.impute!
— Methodimpute!(table, imp; cols=nothing) -> table
Imputes the data in a table by imputing the values 1 column at a time; if this is not the desired behaviour custom imputor methods should overload this method.
Arguments
imp::Imputor
: the Imputor method to usetable
: the data to impute
Keyword Arguments
cols
: The columns to impute along (default is to impute all columns)
Returns
- the input
data
with values imputed
Example
julia> using DataFrames; using Impute: Interpolate, impute
julia> df = DataFrame(:a => [1.0, 2.0, missing, missing, 5.0], :b => [1.1, 2.2, 3.3, missing, 5.5])
5×2 DataFrame
Row │ a b
│ Float64? Float64?
─────┼──────────────────────
1 │ 1.0 1.1
2 │ 2.0 2.2
3 │ missing 3.3
4 │ missing missing
5 │ 5.0 5.5
julia> impute(df, Interpolate())
5×2 DataFrame
Row │ a b
│ Float64? Float64?
─────┼────────────────────
1 │ 1.0 1.1
2 │ 2.0 2.2
3 │ 3.0 3.3
4 │ 4.0 4.4
5 │ 5.0 5.5
Impute.impute!
— Methodimpute!(data::T, imp; kwargs...) -> T where T <: AbstractVector{<:NamedTuple}
Special case rowtables which are arrays, but we want to fallback to the tables method.
Impute.impute
— Methodimpute(data::T, imp; kwargs...) -> T
Returns a new copy of the data
with the missing data imputed by the imputor imp
. For matrices and tables, data is imputed one variable/column at a time. If this is not the desired behaviour then you should overload this method or specify a different dims
value.
Arguments
data
: the data to be imputeimp::Imputor
: the Imputor method to use
Returns
- the input
data
with values imputed
Example
julia> using Impute: Interpolate, impute
julia> v = [1.0, 2.0, missing, missing, 5.0]
5-element Vector{Union{Missing, Float64}}:
1.0
2.0
missing
missing
5.0
julia> impute(v, Interpolate())
5-element Vector{Union{Missing, Float64}}:
1.0
2.0
3.0
4.0
5.0
Interpolation
Impute.interp
— FunctionImpute.interp(data; dims=1)
Performs linear interpolation between the nearest values in an vector. See Impute.Interpolate
for details.
Example
julia> using DataFrames; using Impute: Impute
julia> df = DataFrame(:a => [1.0, 2.0, missing, missing, 5.0], :b => [1.1, 2.2, 3.3, missing, 5.5])
5×2 DataFrame
Row │ a b
│ Float64? Float64?
─────┼──────────────────────
1 │ 1.0 1.1
2 │ 2.0 2.2
3 │ missing 3.3
4 │ missing missing
5 │ 5.0 5.5
julia> Impute.interp(df)
5×2 DataFrame
Row │ a b
│ Float64? Float64?
─────┼────────────────────
1 │ 1.0 1.1
2 │ 2.0 2.2
3 │ 3.0 3.3
4 │ 4.0 4.4
5 │ 5.0 5.5
Impute.Interpolate
— TypeInterpolate(; limit=nothing, r=nothing)
Performs linear interpolation between the nearest values in an vector. The current implementation is univariate, so each variable in a table or matrix will be handled independently.
!!! Missing values at the head or tail of the array cannot be interpolated if there are no existing values on both sides. As a result, this method does not guarantee that all missing values will be imputed.
Keyword Arguments
limit::Union{UInt, Nothing}
: Optionally limit the gap sizes that can be interpolated.r::Union{RoundingMode, Nothing}
: Optionally specify a rounding mode. AvoidsInexactError
s when interpolating over integers.
Example
julia> using Impute: Interpolate, impute
julia> M = [1.0 2.0 missing missing missing 6.0; 1.1 missing missing 4.4 5.5 6.6]
2×6 Matrix{Union{Missing, Float64}}:
1.0 2.0 missing missing missing 6.0
1.1 missing missing 4.4 5.5 6.6
julia> impute(M, Interpolate(); dims=:rows)
2×6 Matrix{Union{Missing, Float64}}:
1.0 2.0 3.0 4.0 5.0 6.0
1.1 2.2 3.3 4.4 5.5 6.6
julia> impute(M, Interpolate(; limit=2); dims=:rows)
2×6 Matrix{Union{Missing, Float64}}:
1.0 2.0 missing missing missing 6.0
1.1 2.2 3.3 4.4 5.5 6.6
K-Nearest Neighbors (KNN)
Impute.knn
— FunctionImpute.knn(; k=1, threshold=0.5, dist=Euclidean())
Imputation using k-Nearest Neighbor algorithm.
Keyword Arguments
k::Int
: number of nearest neighborsdist::MinkowskiMetric
: distance metric suppports byNearestNeighbors.jl
(Euclidean, Chebyshev, Minkowski and Cityblock)threshold::AbsstractFloat
: thershold for missing neighbors
Reference
- Troyanskaya, Olga, et al. "Missing value estimation methods for DNA microarrays." Bioinformatics 17.6 (2001): 520-525.
Example
julia> using Impute, Missings
julia> data = allowmissing(reshape(sin.(1:20), 5, 4)); data[[2, 3, 7, 9, 13, 19]] .= missing; data
5×4 Matrix{Union{Missing, Float64}}:
0.841471 -0.279415 -0.99999 -0.287903
missing missing -0.536573 -0.961397
missing 0.989358 missing -0.750987
-0.756802 missing 0.990607 missing
-0.958924 -0.544021 0.650288 0.912945
julia> result = Impute.knn(data; dims=:cols)
5×4 Matrix{Union{Missing, Float64}}:
0.841471 -0.279415 -0.99999 -0.287903
-0.756802 0.989358 -0.536573 -0.961397
-0.756802 0.989358 -0.536573 -0.750987
-0.756802 -0.544021 0.990607 0.912945
-0.958924 -0.544021 0.650288 0.912945
Impute.KNN
— TypeKNN(; kwargs...)
Imputation using k-Nearest Neighbor algorithm.
Keyword Arguments
k::Int
: number of nearest neighborsdist::MinkowskiMetric
: distance metric suppports byNearestNeighbors.jl
(Euclidean, Chebyshev, Minkowski and Cityblock)threshold::AbstractFloat
: threshold for missing neighbors
Reference
- Troyanskaya, Olga, et al. "Missing value estimation methods for DNA microarrays." Bioinformatics 17.6 (2001): 520-525.
Last Observation Carried Forward (LOCF)
Impute.locf
— FunctionImpute.locf(data; dims=1)
Iterates forwards through the data
and fills missing data with the last existing observation. See Impute.LOCF
for details.
Example
julia> using DataFrames; using Impute: Impute
julia> df = DataFrame(:a => [1.0, 2.0, missing, missing, 5.0], :b => [1.1, 2.2, 3.3, missing, 5.5])
5×2 DataFrame
Row │ a b
│ Float64? Float64?
─────┼──────────────────────
1 │ 1.0 1.1
2 │ 2.0 2.2
3 │ missing 3.3
4 │ missing missing
5 │ 5.0 5.5
julia> Impute.locf(df)
5×2 DataFrame
Row │ a b
│ Float64? Float64?
─────┼────────────────────
1 │ 1.0 1.1
2 │ 2.0 2.2
3 │ 2.0 3.3
4 │ 2.0 3.3
5 │ 5.0 5.5
Impute.LOCF
— TypeLOCF(; limit=nothing)
Last observation carried forward (LOCF) iterates forwards through the data
and fills missing data with the last existing observation. The current implementation is univariate, so each variable in a table or matrix will be handled independently.
See also:
Impute.NOCB
: Next Observation Carried Backward
!!! Missing elements at the head of the array may not be imputed if there is no existing observation to carry forward. As a result, this method does not guarantee that all missing values will be imputed.
Keyword Arguments
limit::Union{UInt, Nothing}
: Optionally limits the amount of consecutive missing values to replace.
Example
julia> using Impute: LOCF, impute
julia> M = [1.0 2.0 missing missing missing 6.0; 1.1 missing missing 4.4 5.5 6.6]
2×6 Matrix{Union{Missing, Float64}}:
1.0 2.0 missing missing missing 6.0
1.1 missing missing 4.4 5.5 6.6
julia> impute(M, LOCF(); dims=:rows)
2×6 Matrix{Union{Missing, Float64}}:
1.0 2.0 2.0 2.0 2.0 6.0
1.1 1.1 1.1 4.4 5.5 6.6
julia> impute(M, LOCF(; limit=2); dims=:rows)
2×6 Matrix{Union{Missing, Float64}}:
1.0 2.0 2.0 2.0 missing 6.0
1.1 1.1 1.1 4.4 5.5 6.6
Next Observation Carried Backward (NOCB)
Impute.nocb
— FunctionImpute.nocb(data; dims=1)
Iterates backwards through the data
and fills missing data with the next existing observation. See Impute.NOCB
for details.
Example
julia> using DataFrames; using Impute: Impute
julia> df = DataFrame(:a => [1.0, 2.0, missing, missing, 5.0], :b => [1.1, 2.2, 3.3, missing, 5.5])
5×2 DataFrame
Row │ a b
│ Float64? Float64?
─────┼──────────────────────
1 │ 1.0 1.1
2 │ 2.0 2.2
3 │ missing 3.3
4 │ missing missing
5 │ 5.0 5.5
julia> Impute.nocb(df)
5×2 DataFrame
Row │ a b
│ Float64? Float64?
─────┼────────────────────
1 │ 1.0 1.1
2 │ 2.0 2.2
3 │ 5.0 3.3
4 │ 5.0 5.5
5 │ 5.0 5.5
Impute.NOCB
— TypeNOCB(; limit=nothing)
Next observation carried backward (NOCB) iterates backwards through the data
and fills missing data with the next existing observation.
See also:
Impute.LOCF
: Last Observation Carried Forward
!!! Missing elements at the tail of the array may not be imputed if there is no existing observation to carry backward. As a result, this method does not guarantee that all missing values will be imputed.
Keyword Arguments
limit::Union{UInt, Nothing}
: Optionally limits the amount of consecutive missing values to replace.
Example
julia> using Impute: NOCB, impute
julia> M = [1.0 2.0 missing missing missing 6.0; 1.1 missing missing 4.4 5.5 6.6]
2×6 Matrix{Union{Missing, Float64}}:
1.0 2.0 missing missing missing 6.0
1.1 missing missing 4.4 5.5 6.6
julia> impute(M, NOCB(); dims=:rows)
2×6 Matrix{Union{Missing, Float64}}:
1.0 2.0 6.0 6.0 6.0 6.0
1.1 4.4 4.4 4.4 5.5 6.6
julia> impute(M, NOCB(; limit=2); dims=:rows)
2×6 Matrix{Union{Missing, Float64}}:
1.0 2.0 missing 6.0 6.0 6.0
1.1 4.4 4.4 4.4 5.5 6.6
Replacement
Impute.replace
— FunctionImpute.replace(data; values)
Replace missing
s with one of the specified constant values, depending on the input type. If multiple values of the same type are provided then the first one will be used. If the input data is of a different type then the no replacement will be performed.
Keyword Arguments
values::Tuple
: A scalar or tuple of different values that should be used to replace missings. Typically, one value per type you're considering imputing for.
Example
julia> using DataFrames, Impute
julia> df = DataFrame(
:a => [1.1, 2.2, missing, missing, 5.5],
:b => [1, 2, 3, missing, 5],
:c => ["v", "w", "x", "y", missing],
)
5×3 DataFrame
Row │ a b c
│ Float64? Int64? String?
─────┼─────────────────────────────
1 │ 1.1 1 v
2 │ 2.2 2 w
3 │ missing 3 x
4 │ missing missing y
5 │ 5.5 5 missing
julia> Impute.replace(df; values=(NaN, -9999, "NULL"))
5×3 DataFrame
Row │ a b c
│ Float64? Int64? String?
─────┼───────────────────────────
1 │ 1.1 1 v
2 │ 2.2 2 w
3 │ NaN 3 x
4 │ NaN -9999 y
5 │ 5.5 5 NULL
Impute.Replace
— TypeReplace(; value)
Replace missing
s with one of the specified constant values, depending on the input type. If multiple values of the same type are provided then the first one will be used. If the input data is of a different type then the no replacement will be performed.
Keyword Arguments
values::Tuple
: A scalar or tuple of different values that should be used to replace missings. Typically, one value per type you're considering imputing for.
Example
julia> using Impute: Replace, impute
julia> M = [1.0 2.0 missing missing 5.0; 1.1 2.2 3.3 missing 5.5]
2×5 Matrix{Union{Missing, Float64}}:
1.0 2.0 missing missing 5.0
1.1 2.2 3.3 missing 5.5
julia> impute(M, Replace(; values=0.0); dims=2)
2×5 Matrix{Union{Missing, Float64}}:
1.0 2.0 0.0 0.0 5.0
1.1 2.2 3.3 0.0 5.5
Simple Random Sample (SRS)
Impute.srs
— FunctionImpute.srs(data; rng=Random.GLOBAL_RNG)
Simple Random Sampling (SRS) imputation is a method for imputing both continuous and categorical variables. Furthermore, it completes imputation while preserving the distributional properties of the variables (e.g., mean, standard deviation).
Example
julia> using DataFrames; using Random; using Impute: Impute
julia> df = DataFrame(:a => [1.0, 2.0, missing, missing, 5.0], :b => [1.1, 2.2, 3.3, missing, 5.5])
5×2 DataFrame
│ Row │ a │ b │
│ │ Float64? │ Float64? │
├─────┼──────────┼──────────┤
│ 1 │ 1.0 │ 1.1 │
│ 2 │ 2.0 │ 2.2 │
│ 3 │ missing │ 3.3 │
│ 4 │ missing │ missing │
│ 5 │ 5.0 │ 5.5 │
julia> Impute.srs(df; rng=MersenneTwister(1234))
5×2 DataFrame
│ Row │ a │ b │
│ │ Float64? │ Float64? │
├─────┼──────────┼──────────┤
│ 1 │ 1.0 │ 1.1 │
│ 2 │ 2.0 │ 2.2 │
│ 3 │ 1.0 │ 3.3 │
│ 4 │ 2.0 │ 3.3 │
│ 5 │ 5.0 │ 5.5 │
Impute.SRS
— TypeSRS(; rng=Random.GLOBAL_RNG)
Simple Random Sampling (SRS) imputation is a method for imputing both continuous and categorical variables. Furthermore, it completes imputation while preserving the distributional properties of the variables (e.g., mean, standard deviation).
The basic idea is that for a given variable, x
, with missing data, we randomly draw from the observed values of x
to impute the missing elements. Since the random draws from x
for imputation are done in proportion to the frequency distribution of the values in x
, the univariate distributional properties are generally not impacted; this is true for both categorical and continuous data.
Keyword Arguments
rng::AbstractRNG
: A random number generator to use for observation selection
Example
julia> using Random; using Impute: SRS, impute
julia> M = [1.0 2.0 missing missing 5.0; 1.1 2.2 3.3 missing 5.5]
2×5 Matrix{Union{Missing, Float64}}:
1.0 2.0 missing missing 5.0
1.1 2.2 3.3 missing 5.5
julia> impute(M, SRS(; rng=MersenneTwister(1234)); dims=:rows)
2×5 Matrix{Union{Missing, Float64}}:
1.0 2.0 1.0 2.0 5.0
1.1 2.2 3.3 3.3 5.5
Substitute
Impute.substitute
— FunctionImpute.substitute(data; statistic=nothing)
Impute.substitute(data; weights=nothing)
Substitute missing values with a summary statistic over the non-missing values.
Keyword Arguments
statistic
: A summary statistic function to be applied to the non-missing values. This function should return a value of the same type as the input dataeltype
. If this function isn't passed in then thedefaultstats
function is used to make a best guess.weights
: A set of statistical weights to apply to themean
ormedian
indefaultstats
.
See Substitute for details on substitution rules defined in defaultstats
.
Example
julia> using DataFrames, Impute
julia> df = DataFrame(
:a => [8.9, 2.2, missing, missing, 1.3, 6.2, 3.7, 4.8],
:b => [2, 6, 3, missing, 7, 1, 9, missing],
:c => [true, false, true, true, false, missing, false, true],
)
8×3 DataFrame
Row │ a b c
│ Float64? Int64? Bool?
─────┼─────────────────────────────
1 │ 8.9 2 true
2 │ 2.2 6 false
3 │ missing 3 true
4 │ missing missing true
5 │ 1.3 7 false
6 │ 6.2 1 missing
7 │ 3.7 9 false
8 │ 4.8 missing true
julia> Impute.substitute(df)
8×3 DataFrame
Row │ a b c
│ Float64? Int64? Bool?
─────┼─────────────────────────
1 │ 8.9 2 true
2 │ 2.2 6 false
3 │ 4.25 3 true
4 │ 4.25 4 true
5 │ 1.3 7 false
6 │ 6.2 1 true
7 │ 3.7 9 false
8 │ 4.8 4 true
Impute.Substitute
— TypeSubstitute(; statistic=Impute.defaultstats)
Substitute missing values with a summary statistic over the non-missing values.
Keyword Arguments
statistic
: A summary statistic function to be applied to the non-missing values. This function should return a value of the same type as the input dataeltype
. If this function isn't passed in then theImpute.defaultstats
function is used to make a best guess.
Example
julia> using Statistics; using Impute: Substitute, impute
julia> M = [1.0 2.0 missing missing 5.0; 1.1 2.2 3.3 missing 5.5]
2×5 Matrix{Union{Missing, Float64}}:
1.0 2.0 missing missing 5.0
1.1 2.2 3.3 missing 5.5
julia> impute(M, Substitute(); dims=:rows)
2×5 Matrix{Union{Missing, Float64}}:
1.0 2.0 2.0 2.0 5.0
1.1 2.2 3.3 2.75 5.5
julia> impute(M, Substitute(; statistic=mean); dims=:rows)
2×5 Matrix{Union{Missing, Float64}}:
1.0 2.0 2.66667 2.66667 5.0
1.1 2.2 3.3 3.025 5.5
Impute.defaultstats
— Functiondefaultstats(data[, wv])
A set of default substitution rules using either median
or mode
based on the eltype
of the input data
. Specific rules are summarized as follows.
Bool
elements usemode
Real
elements usemedian
Integer
elements wherenunique(data) / length(data) < 0.25
usemode
(ratings, categorical codings, etc)Integer
elements with mostly unique values usemedian
!Number
(non-numeric) elements usemode
as the safest fallback
SVD
Impute.svd
— FunctionImpute.svd(; kwargs...)
Imputes the missing values in a matrix using an expectation maximization (EM) algorithm over low-rank SVD approximations.
Keyword Arguments
init::Imputor
: initialization method for missing values (default: Substitute())rank::Union{Int, Nothing}
: rank of the SVD approximation (default: nothing meaning start and 0 and increase)tol::Float64
: convergence tolerance (default: 1e-10)maxiter::Int
: Maximum number of iterations if convergence is not achieved (default: 100)limits::Unoin{Tuple{Float64, Float64}, Nothing}
: Bound the possible approximation values (default: nothing)verbose::Bool
: Whether to display convergence progress (default: true)
References
- Troyanskaya, Olga, et al. "Missing value estimation methods for DNA microarrays." Bioinformatics 17.6 (2001): 520-525.
Example
julia> using Impute, Missings
julia> data = allowmissing(reshape(sin.(1:20), 5, 4)); data[[2, 3, 7, 9, 13, 19]] .= missing; data
5×4 Matrix{Union{Missing, Float64}}:
0.841471 -0.279415 -0.99999 -0.287903
missing missing -0.536573 -0.961397
missing 0.989358 missing -0.750987
-0.756802 missing 0.990607 missing
-0.958924 -0.544021 0.650288 0.912945
julia> result = Impute.svd(data; dims=:cols)
5×4 Matrix{Union{Missing, Float64}}:
0.841471 -0.279415 -0.99999 -0.287903
0.220258 0.555829 -0.536573 -0.961397
-0.372745 0.989358 0.533193 -0.750987
-0.756802 0.253309 0.990607 0.32315
-0.958924 -0.544021 0.650288 0.912945
Impute.SVD
— TypeSVD(; kwargs...)
Imputes the missing values in a matrix using an expectation maximization (EM) algorithm over low-rank SVD approximations.
Keyword Arguments
init::Imputor
: initialization method for missing values (default: Substitute())rank::Union{Int, Nothing}
: rank of the SVD approximation (default: nothing meaning start and 0 and increase)tol::Float64
: convergence tolerance (default: 1e-10)maxiter::Int
: Maximum number of iterations if convergence is not achieved (default: 100)limits::Union{Tuple{Float64, Float64}, Nothing}
: Bound the possible approximation values (default: nothing)verbose::Bool
: Whether to display convergence progress (default: true)
References
- Troyanskaya, Olga, et al. "Missing value estimation methods for DNA microarrays." Bioinformatics 17.6 (2001): 520-525.