mice                  package:mice                  R Documentation

_M_u_l_t_i_v_a_r_i_a_t_e _I_m_p_u_t_a_t_i_o_n _b_y _C_h_a_i_n_e_d _E_q_u_a_t_i_o_n_s

_D_e_s_c_r_i_p_t_i_o_n:

     Produces an object of class "mids", which stands for 'multiply
     imputed data set'.

_U_s_a_g_e:

     mice(data, m = 5, 
         imputationMethod = vector("character",length=ncol(data)), 
         predictorMatrix = (1 - diag(1, ncol(data))),
         visitSequence = (1:ncol(data))[apply(is.na(data),2,any)], 
         defaultImputationMethod=c("pmm","logreg","polyreg"),
         maxit = 5, 
         diagnostics = TRUE, 
         printFlag = TRUE,
         seed = NA)

_A_r_g_u_m_e_n_t_s:

    data: A data frame or a matrix containing the incomplete data.
          Missing values are coded as NA's.

       m: Number of multiple imputations. If omitted, m=5 is used.

imputationMethod: Can be either a string, or a vector of strings with
          length ncol(data), specifying the elementary imputation
          method to be used for each column in data. If specified as a
          single string, the given method will be used for all columns.
          The default imputation method (when no argument is specified)
          depends on the measurement level of the target column and are
          specified by the 'defaultImputationMethod' argument. Columns
          that need not be imputed have method '""'. See details for
          more inromation.

predictorMatrix: A square matrix of size 'ncol(data)' containing 0/1
          data specifying the set of predictors to be used for each
          target column. Rows correspond to target variables (i.e.
          variables to be imputed), in the sequence as they appear in
          data. A value of '1' means that the column variable is used
          as a predictor for the target variable (in the rows). The
          diagonal of 'predictorMatrix' must be zero. The default for
          'predictorMatrix' is that all other columns are used as
          predictors (sometimes called massive imputation).

visitSequence: A vector of integers of arbitrary length, specifying the
          column indices of the visiting sequence. The visiting
          sequence is the column order that is used to impute the data
          during one iteration of the algorithm. A column may be
          visited more than once. All incomplete columns that are used
          as predictors should be visited, or else the function will
          stop with an error. The default sequence 1:ncol(data) implies
          that columns are imputed from left to right.

defaultImputationMethod: A vector of three strings containing the
          default imputation methods for numerical columns, factor 
          columns with 2 levels, and factor columns with more than two
          levels, respectively. If nothing is  specified, the following
          defaults will be used: 'pmm', predictive mean matching
          (numeric data); 'logreg', logistic regression imputation
          (binary data, factor with 2 levels); 'polyreg', polytomous
          regression imputation categorical data (factor >= 2 levels).

   maxit: A scalar giving the number of iterations. The default is 5.

diagnostics: A Boolean flag. If 'TRUE', diagnostic information will be
          appended to the value of the function. If 'FALSE', only the
          imputed data are saved. The default is 'TRUE'.

printFlag: 

    seed: An integer between 0 and 1000 that is used by the set.seed
          function for offsetting the random number generator. Default
          is to leave the random number generator alone.

_D_e_t_a_i_l_s:

     Generates multiple imputations for incomplete multivariate data by
     Gibbs Sampling. Missing data can occur anywhere in the data. The
     algorithm imputes an incomplete column (the target column) by
     generating oappropriate imputation values given other columns in
     the data. Each incomplete column must act as a target column, and
     has its own specific set of predictors. The default predictor set
     consists of all other columns in the data. For predictors that are
     incomplete themselves, the most recently generated imputations are
     used to complete the predictors prior to imputation of the target
     column. 

     A separate univariate imputation model can be specified for each
     column. The default imputation method depends on the measurement
     level of the target column. In addition to these, several other
     methods are provided. Users may also write their own imputation
     functions, and call these from within the algorithm. 

     In some cases, an imputation model may need transformed data in
     addition to the original data (e.g. log or quadratic transforms).
     In order to maintain consistency among different transformations
     of the same data, the function has a special built-in method using
     the '~' mechanism. This method can be used to ensure that a data
     transform always depends on the most recently generated
     imputations in the untransformed (active) column.  

     The data may contain categorical variables that are used in a
     regressions on other variables. The algorithm creates dummy
     variables for the categories of these variables, and imputes these
     from the corresponding categorical variable. 

     Built-in imputation methods are:

     _n_o_r_m Bayesian linear regression (Numeric)

     _p_m_m Predictive mean matching (Numeric)   

     _m_e_a_n Unconditional mean imputation (Numeric)

     _l_o_g_r_e_g Logistic regression (2 categories)        

     _l_o_g_r_e_g_2 Logistic regression (direct minimization)(2 categories)

     _p_o_l_y_r_e_g Polytomous logistic regression (>= 2 categories)

     _l_d_a Linear discriminant analysis (>= 2 categories)        

     _s_a_m_p_l_e Random sample from the observed values (Any)

     _S_p_e_c_i_a_l _m_e_t_h_o_d If the first character of the elementary method is
          a '~', then the string is interpreted as the formula argument
          in a call to 'model.frame(formula, data[!r[,j],])'. This
          provides a simple mechanism for specifying a large variety of
          dependencies among the variables. For example transformed
          versions of imputed variables, recodes, interactions, sum
          scores, and so on, that may themselves be needed in other
          parts of the algoritm, can be specified in this way. Note
          that the '~' mechanism works only on those entries which have
          missing values in the target column. The user should make
          sure that the combined observed and imputed parts of the
          target column make sense. One easy way to create consistency
          is by coding all entries in the target as 'NA', but for large
          data sets, this could be inefficient. Moreover, this will not
          work in S-Plus 4.5. Though not strictly needed, it is often
          useful to specify 'visitSequence' such that the column that
          is imputed by the '~' mechanism is visited each time after
          one of its predictors was visited. In that way, deterministic
          relation between columns will always be synchronized.

     For example, for the j'th column, the 'impute.norm' function that
     implements the  Bayesian linear regression method can be called by
     specifying the string "norm"  as the j'th entry in the vector of
     strings. 

     The user can write his or her own imputation function, say
     'impute.myfunc', and call it for all columns by specifying
     'imputationMethod="myfunc"', or for specific columns by specifying
     'imputationMethod=c("norm","myfunc",...)'.

     _side effects:_ Some elementary imputation method require access
     to the nnet or MASS libraries of Venables & Ripley. Where needed,
     these libraries will be attached.

_V_a_l_u_e:

     An object of class mids, which stands for 'multiply imputed data
     set'. For  a description of the object, see the documentation on
     'mids'.

_A_u_t_h_o_r(_s):

     Stef van Buuren, Karin Oudshoorn, 2000

_R_e_f_e_r_e_n_c_e_s:

     Van Buuren, S. and Oudshoorn, C.G.M.. (1999). Flexible
     multivariate imputation by MICE. Report PG/VGZ/99.054, TNO
     Prevention and Health, Leiden. 

     Van Buuren, S. & Oudshoorn, C.G.M. (2000). Multivariate Imputation
     by Chained Equations:   MICE V1.0 User's manual. Report
     PG/VGZ/00.038, TNO Prevention and Health, Leiden. 

     Van Buuren, S., Boshuizen, H.C. and Knook, D.L. (1999). Multiple
     imputation of missing blood pressure covariates in survival
     analysis. Statistics in Medicine, 18, 681-694. 

     Brand, J.P.L. (1999). Development, implementation and evaluation
     of multiple imputation strategies for the statistical analysis of
     incomplete data sets. Dissertation, TNO Prevention and Health,
     Leiden and Erasmus University, Rotterdam.

_S_e_e _A_l_s_o:

     'complete', 'mids', 'lm.mids', 'set.seed'

_E_x_a_m_p_l_e_s:

     data(nhanes)
     imp <- mice(nhanes)     # do default multiple imputation on a numeric matrix
     imp
     imp$imputations$bmi     # and list the actual imputations 
     complete(imp)       # show the first completed data matrix
     lm.mids(chl~age+bmi+hyp, imp)   # repeated linear regression on imputed data

     data(nhanes2)
     mice(nhanes2,im=c("sample","pmm","logreg","norm")) # imputation on mixed data with a different method per column

