{missRanger} is a multivariate imputation algorithm based on random forests. It is a fast alternative to the beautiful ‘MissForest’ algorithm of Stekhoven and Buehlmann (2011), and uses the {ranger} package (Wright and Ziegler 2017) to fit the random forests.
The algorithm iterates until the average out-of-bag (OOB) error of the forests stops improving. The missing values are filled by OOB predictions of the best iteration.
missRanger(data, . ~ 1)
would impute all variables
univariately, while missRanger(data, Species ~ Sepal.Width)
would use Sepal.Width
to impute Species
.library(missRanger)
set.seed(3)
iris_NA <- generateNA(iris, p = 0.1)
head(iris_NA)
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> 1 5.1 3.5 1.4 0.2 setosa
#> 2 4.9 3.0 1.4 NA setosa
#> 3 4.7 3.2 1.3 0.2 setosa
#> 4 4.6 3.1 1.5 0.2 <NA>
#> 5 NA 3.6 1.4 0.2 setosa
#> 6 5.4 3.9 1.7 0.4 <NA>
imp <- missRanger(iris_NA, num.trees = 100)
#>
#> Variables to impute: Sepal.Length, Sepal.Width, Petal.Length, Petal.Width, Species
#> Variables used to impute: Sepal.Length, Sepal.Width, Petal.Length, Petal.Width, Species
#>
#> iter 1
#> | | | 0% | |============== | 20% | |============================ | 40% | |========================================== | 60% | |======================================================== | 80% | |======================================================================| 100%
#> iter 2
#> | | | 0% | |============== | 20% | |============================ | 40% | |========================================== | 60% | |======================================================== | 80% | |======================================================================| 100%
#> iter 3
#> | | | 0% | |============== | 20% | |============================ | 40% | |========================================== | 60% | |======================================================== | 80% | |======================================================================| 100%
#> iter 4
#> | | | 0% | |============== | 20% | |============================ | 40% | |========================================== | 60% | |======================================================== | 80% | |======================================================================| 100%
head(imp)
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> 1 5.100000 3.5 1.4 0.2000000 setosa
#> 2 4.900000 3.0 1.4 0.1608667 setosa
#> 3 4.700000 3.2 1.3 0.2000000 setosa
#> 4 4.600000 3.1 1.5 0.2000000 setosa
#> 5 5.061255 3.6 1.4 0.2000000 setosa
#> 6 5.400000 3.9 1.7 0.4000000 setosa
It worked, but the new values appear overly exact. To avoid this, we can add predictive mean matching (PMM) to the OOB predictions:
imp <- missRanger(iris_NA, pmm.k = 5, num.trees = 100, verbose = 0)
head(imp)
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> 1 5.1 3.5 1.4 0.2 setosa
#> 2 4.9 3.0 1.4 0.2 setosa
#> 3 4.7 3.2 1.3 0.2 setosa
#> 4 4.6 3.1 1.5 0.2 setosa
#> 5 5.4 3.6 1.4 0.2 setosa
#> 6 5.4 3.9 1.7 0.4 setosa
missRanger()
offers many options. How would we use one
feature per split (mtry = 1) with 200 trees?
Setting data_only = FALSE
(or
keep_forests = TRUE
) returns a “missRanger” object. With
keep_forests = TRUE
, this allows for out-of-sample
applications:
imp <- missRanger(
iris_NA, pmm.k = 5, num.trees = 100, keep_forests = TRUE, verbose = 0
)
imp
#> missRanger object. Extract imputed data via $data
#> - best iteration: 3
#> - best average OOB imputation error: 0.1468982
summary(imp)
#> missRanger object. Extract imputed data via $data
#> - best iteration: 3
#> - best average OOB imputation error: 0.1468982
#>
#> Sequence of OOB prediction errors:
#>
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> [1,] 1.0000000 1.1108502 0.39671941 0.18322253 0.06666667
#> [2,] 0.2224743 0.5371919 0.06000731 0.05568752 0.03703704
#> [3,] 0.1732113 0.4517314 0.02408501 0.05583381 0.02962963
#> [4,] 0.1796650 0.4715697 0.02106975 0.05502143 0.03703704
#>
#> Mean performance per iteration:
#> [1] 0.5514918 0.1824796 0.1468982 0.1528726
#>
#> First rows of imputed data:
#>
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> 1 5.1 3.5 1.4 0.2 setosa
#> 2 4.9 3.0 1.4 0.2 setosa
#> 3 4.7 3.2 1.3 0.2 setosa
# Out-of-sample application
# saveRDS(imp, file = "imputation_model.rds")
# imp <- readRDS("imputation_model.rds")
predict(imp, head(iris_NA))
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> 1 5.1 3.5 1.4 0.2 setosa
#> 2 4.9 3.0 1.4 0.2 setosa
#> 3 4.7 3.2 1.3 0.2 setosa
#> 4 4.6 3.1 1.5 0.2 setosa
#> 5 5.1 3.6 1.4 0.2 setosa
#> 6 5.4 3.9 1.7 0.4 setosa
By default, missRanger()
uses all columns to impute all
columns with missings.
This can be modified by passing a formula: The left hand side specifies the variables to be imputed, while the right hand side lists the variables used for imputation.
# Impute all variables with all (default)
m <- missRanger(iris_NA, formula = . ~ ., pmm.k = 5, num.trees = 100, verbose = 0)
# Don't use Species for imputation
m <- missRanger(iris_NA, . ~ . - Species, pmm.k = 5, num.trees = 100, verbose = 0)
# Impute Sepal.Length by Species (or not?)
m <- missRanger(iris_NA, Sepal.Length ~ Species, pmm.k = 5, num.trees = 100)
#>
#> Variables to impute: Sepal.Length
#> Variables used to impute:
#>
#> iter 1
#> | | | 0% | |======================================================================| 100%
head(m)
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> 1 5.1 3.5 1.4 0.2 setosa
#> 2 4.9 3.0 1.4 NA setosa
#> 3 4.7 3.2 1.3 0.2 setosa
#> 4 4.6 3.1 1.5 0.2 <NA>
#> 5 6.2 3.6 1.4 0.2 setosa
#> 6 5.4 3.9 1.7 0.4 <NA>
# Only univariate imputation was done! Why? Because Species contains missing values
# itself and needs to appear on the LHS as well:
m <- missRanger(iris_NA, Sepal.Length + Species ~ Species, pmm.k = 5, num.trees = 100)
#>
#> Variables to impute: Sepal.Length, Species
#> Variables used to impute: Species
#>
#> iter 1
#> | | | 0% | |=================================== | 50% | |======================================================================| 100%
head(m)
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> 1 5.1 3.5 1.4 0.2 setosa
#> 2 4.9 3.0 1.4 NA setosa
#> 3 4.7 3.2 1.3 0.2 setosa
#> 4 4.6 3.1 1.5 0.2 versicolor
#> 5 6.5 3.6 1.4 0.2 setosa
#> 6 5.4 3.9 1.7 0.4 versicolor
# Impute all variables univariately
m <- missRanger(iris_NA, . ~ 1, verbose = 0)
missRanger()
fits a random forest per variable and
iteration. Thus, imputation can take long. Some tweaks:
num.trees = 100
.max.depth = 6
.min.node.size = 100
.sample.fraction = 0.2
.max.iter = 3
.The first three items also help to greatly reduce the size of the
models, which might become relevant in out-of-sample applications with
keep_forests = TRUE
.
case.weights
to reduce impact of rows with
many missingsUsing the case.weights
argument, you can pass case
weights to the imputation models. For instance, this allows to reduce
the contribution of rows with many missings: