--- title: "Matching Methods" author: "Noah Greifer" date: "`r Sys.Date()`" output: html_vignette: toc: true vignette: > %\VignetteIndexEntry{Matching Methods} %\VignetteEngine{knitr::rmarkdown_notangle} %\VignetteEncoding{UTF-8} bibliography: references.bib link-citations: true --- ```{r setup, include=FALSE} knitr::opts_chunk$set(echo = TRUE) options(width = 200) ``` ## Introduction `MatchIt` implements several matching methods with a variety of options. Though the help pages for the individual methods describes each method and how they can be used, this vignette provides a broad overview of the available matching methods and their associated options. The choice of matching method depends on the goals of the analysis (e.g., the estimand, whether low bias or high precision is important) and the unique qualities of each dataset to be analyzed, so there is no single optimal choice for any given analysis. A benefit of nonparametric preprocessing through matching is that a number of matching methods can be tried and their quality assessed without consulting the outcome, reducing the possibility of capitalizing on chance while allowing for the benefits of an exploratory analysis in the design phase [@ho2007]. This vignette describes each matching method available in `MatchIt` and the various options that are allowed with matching methods and the consequences of their use. For a brief introduction to the use of `MatchIt` functions, see `vignette("MatchIt")`. For details on how to assess and report covariate balance, see `vignette("assessing-balance")`. For details on how to estimate treatment effects and standard errors after matching, see `vignette("estimating-effects")`. ## Matching Matching as implemented in `MatchIt` is a form of *subset selection*, that is, the pruning and weighting of units to arrive at a (weighted) subset of the units from the original dataset. Ideally, and if done successfully, subset selection produces a new sample where the treatment is unassociated with the covariates so that a comparison of the outcomes treatment and control groups is not confounded by the measured and balanced covariates. Although statistical estimation methods like regression can also be used to remove confounding due to measured covariates, @ho2007 argue that fitting regression models in matched samples reduces the dependence of the validity of the estimated treatment effect on the correct specification of the model. Matching is nonparametric in the sense that the estimated weights and pruning of the sample are not direct functions of estimated model parameters but rather depend on the organization of discrete units in the sample; this is in contrast to propensity score weighting (also known as inverse probability weighting), where the weights come more directly from the estimated propensity score model and therefore are more sensitive to its correct specification. These advantages, as well as the intuitive understanding of matching by the public compared to regression or weighting, make it a robust and effective way to estimate treatment effects. It is important to note that this implementation of matching differs from the methods described by Abadie and Imbens [-@abadie2006; -@abadie2016] and implemented in the `Matching` R package and `teffects` routine in Stata. That form of matching is *matching imputation*, where the missing potential outcomes for each unit are imputed using the observed outcomes of paired units. This is a critical distinction because matching imputation is a specific estimation method with its own effect and standard error estimators, in contrast to subset selection, which is a preprocessing method that does not require specific estimators and is broadly compatible with other parametric and nonparametric analyses. The benefits of matching imputation are that its theoretical properties (i.e., the rate of convergence and asymptotic variance of the estimator) are well understood, it can be used in a straightforward way to estimate not just the average treatment effect in the treated (ATT) but also the average treatment effect in the population (ATE), and additional effective matching methods can be used in the imputation (e.g., kernel matching). The benefits of matching as nonparametric preprocessing are that it is far more flexible with respect to the types of effects that can be estimated because it does not involve any specific estimator, its empirical and finite-sample performance has been examined in depth and is generally well understood, and it aligns well with the design of experiments, which are more familiar to non-technical audiences. In addition to subset selection, matching often (though not always) involves a form of *stratification*, the assignment of units to pairs or strata containing multiple units. The distinction between subset selection and stratification is described by @zubizarreta2014, who separate them into two separate steps. In `MatchIt`, with almost all matching methods, subset selection is performed by stratification; for example, treated units are paired with control units, and unpaired units are then dropped from the matched sample. With some methods, subclasses are used to assign matching or stratification weights to individual units, which increase or decrease each unit's leverage in a subsequent analysis. There has been some debate about the importance of stratification after subset selection; while some authors have argued that, with some forms of matching, pair membership is incidental [@stuart2008; @schafer2008], others have argued that correctly incorporating pair membership into effect estimation can improve the quality of inferences [@austin2014a; @wan2019]. For methods that allow it, `MatchIt` includes stratum membership as an additional output of each matching specification. How these strata can be used is detailed in `vignette("estimating-effects")`. At the heart of `MatchIt` are three classes of methods: distance matching, stratum matching, and pure subset selection. *Distance matching* involves considering a focal group (usually the treated group) and selecting members of the non-focal group (i.e., the control group) to pair with each member of the focal group based on the *distance* between units, which can be computed in one of several ways. Members of either group that are not paired are dropped from the sample. Nearest neighbor matching (`method = "nearest"`), optimal pair matching (`method = "optimal"`), optimal full matching (`method = "full"`), generalized full matching (`method = "quick"`), and genetic matching (`method = "genetic"`) are the methods of distance matching implemented in `MatchIt`. Typically, only the average treatment in the treated (ATT) or average treatment in the control (ATC), if the control group is the focal group, can be estimated after distance matching in `MatchIt` (full matching is an exception, described later). *Stratum matching* involves creating strata based on unique values of the covariates and assigning units with those covariate values into those strata. Any units that are in strata that lack either treated or control units are then dropped from the sample. Strata can be formed using the raw covariates (`method = "exact"`), coarsened versions of the covariates (`method = "cem"`), or coarsened versions of the propensity score (`method = "subclass"`). When no units are discarded, either the ATT, ATC, or ATE can be estimated after stratum matching, though often some units are discarded, especially with exact and coarsened exact matching, making the estimand less clear. For use in estimating marginal treatment effects after exact matching, stratification weights are computed for the matched units first by computing a new "stratum propensity score" for each unit, which is the proportion of treated units in its stratum. The formulas for computing inverse probability weights from standard propensity scores are then applied to the new stratum propensity scores to form the new weights. Pure subset selection involves selecting a subset of units form the original sample without considering the distance between individual units or strata that units might fall into. Subsets are selected to optimize a criterion subject to constraint on balance and remaining sample size. Cardinality and profile matching (`method = "cardinality"`) are the methods of pure subset selection implemented in `MatchIt`. Both methods allow the user to specify the largest imbalance allowed in the resulting matched sample, and an optimization routine attempts to find the largest matched sample that satisfies those balance constraints. While cardinality matching does not target a specific estimand, profile matching can be used to target the ATT, ATC, or ATE. Below, we describe each of the matching methods implemented in `MatchIt`. ## Matching Methods ### Nearest Neighbor Matching (`method = "nearest"`) Nearest neighbor matching is also known as greedy matching. It involves running through the list of treated units and selecting the closest eligible control unit to be paired with each treated unit. It is greedy in the sense that each pairing occurs without reference to how other units will be or have been paired, and therefore does not aim to optimize any criterion. Nearest neighbor matching is the most common form of matching used [@thoemmes2011; @zakrison2018] and has been extensively studied through simulations. See `?method_nearest` for the documentation for `matchit()` with `method = "nearest"`. Nearest neighbor matching requires the specification of a distance measure to define which control unit is closest to each treated unit. The default and most common distance is the *propensity score difference*, which is the difference between the propensity scores of each treated and control unit [@stuart2010]. Another popular distance is the Mahalanobis distance, described in the section "Mahalanobis distance matching" below. The order in which the treated units are to be paired must also be specified and has the potential to change the quality of the matches [@austin2013b; @rubin1973]; this is specified by the `m.order` argument. With propensity score matching, the default is to go in descending order from the highest propensity score; doing so allows the units that would have the hardest time finding close matches to be matched first [@rubin1973]. Other orderings are possible, including random ordering, which can be tried multiple times until an adequate matched sample is found. When matching with replacement (i.e., where each control unit can be reused to be matched with any number of treated units), the matching order doesn't matter. When using a matching ratio greater than 1 (i.e., when more than 1 control units are requested to be matched to each treated unit), matching occurs in a cycle, where each treated unit is first paired with one control unit, and then each treated unit is paired with a second control unit, etc. Ties are broken deterministically based on the order of the units in the dataset to ensure that multiple runs of the same specification yield the same result (unless the matching order is requested to be random). Nearest neighbor matching is implemented in `MatchIt` using internal C++ code through `Rcpp`. When matching on a propensity score, this makes matching extremely fast, even for large datasets. Using a caliper on the propensity score (described below) makes it even faster. Run times may be a bit longer when matching on other distance measures (e.g., the Mahalanobis distance). In contrast to optimal pair matching (described below), nearest neighbor matching does not require computing the full distance matrix between units, which makes it more applicable to large datasets. ### Optimal Pair Matching (`method = "optimal"`) Optimal pair matching (often just called optimal matching) is very similar to nearest neighbor matching in that it attempts to pair each treated unit with one or more control units. Unlike nearest neighbor matching, however, it is "optimal" rather than greedy; it is optimal in the sense that it attempts to choose matches that collectively optimize an overall criterion [@hansen2006; @gu1993]. The criterion used is the sum of the absolute pair distances in the matched sample. See `?method_optimal` for the documentation for `matchit()` with `method = "optimal"`. Optimal pair matching in `MatchIt` depends on the `fullmatch()` function in the `optmatch` package [@hansen2006]. Like nearest neighbor matching, optimal pair matching requires the specification of a distance measure between units. Optimal pair matching can be thought of simply as an alternative to selecting the order of the matching for nearest neighbor matching. Optimal pair matching and nearest neighbor matching often yield the same or very similar matched samples; indeed, some research has indicated that optimal pair matching is not much better than nearest neighbor matching at yielding balanced matched samples [@austin2013b]. The `tol` argument in `fullmatch()` can be supplied to `matchit()` with `method = "optimal"`; this controls the numerical tolerance used to determine whether the optimal solution has been found. The default is fairly high and, for smaller problems, should be set much lower (e.g., by setting `tol = 1e-7`). ### Optimal Full Matching (`method = "full"`) Optimal full matching (often just called full matching) assigns every treated and control unit in the sample to one subclass each [@hansen2004; @stuart2008a]. Each subclass contains one treated unit and one or more control units or one control units and one or more treated units. It is optimal in the sense that the chosen number of subclasses and the assignment of units to subclasses minimize the sum of the absolute within-subclass distances in the matched sample. Weights are computed based on subclass membership, and these weights then function like propensity score weights and can be used to estimate a weighted treatment effect, ideally free of confounding by the measured covariates. See `?method_full` for the documentation for `matchit()` with `method = "full"`. Optimal full matching in `MatchIt` depends on the `fullmatch()` function in the `optmatch` package [@hansen2006]. Like the other distance matching methods, optimal full matching requires the specification of a distance measure between units. It can be seen a combination of distance matching and stratum matching: subclasses are formed with varying numbers of treated and control units, as with stratum matching, but the subclasses are formed based on minimizing within-pair distances and do not involve forming strata based on any specific variable, similar to distance matching. Unlike other distance matching methods, full matching can be used to estimate the ATE. Full matching can also be seen as a form of propensity score weighting that is less sensitive to the form of the propensity score model because the original propensity scores are used just to create the subclasses, not to form the weights directly [@austin2015a]. In addition, full matching does not have to rely on estimated propensity scores to form the subclasses and weights; other distance measures are allowed as well. Although full matching uses all available units, there is a loss in precision due to the weights. Units may be weighted in such a way that they contribute less to the sample than would unweighted units, so the effective sample size (ESS) of the full matching weighted sample may be lower than even that of 1:1 pair matching. Balance is often far better after full matching than it is with 1:k matching, making full matching a good option to consider especially when 1:k matching is not effective or when the ATE is the target estimand. The specification of the full matching optimization problem can be customized by supplying additional arguments that are passed to `optmatch::fullmatch()`, such as `min.controls`, `max.controls`, `mean.controls`, and `omit.fraction`. As with optimal pair matching, the numerical tolerance value can be set much lower than the default with small problems by setting, e.g., `tol = 1e-7`. ### Generalized Full Matching (`method = "quick"`) Generalized full matching is a variant of full matching that uses a special fast clustering algorithm to dramatically speed up the matching, even for large datasets [@savjeGeneralizedFullMatching2021]. Like with optimal full matching, generalized full matching assigns every unit to a subclass. What makes generalized full match "generalized" is that the user can customize the matching in a number of ways, such as by specifying an arbitrary minimum number of units from each treatment group or total number of units per subclass, or by allowing not all units from a treatment group to have to be matched. Generalized full matching minimizes the largest within-subclass distances in the matched sample, but it does so in a way that is not completely optimal (though the solution is often very close to the optimal solution). Matching weights are computed based on subclass membership, and these weights then function like propensity score weights and can be used to estimate a weighted treatment effect, ideally free of confounding by the measured covariates. See `?method_quick` for the documentation for `matchit()` with `method = "quick"`. Generalized full matching in `MatchIt` depends on the `quickmatch()` function in the `quickmatch` package [@savjeQuickmatchQuickGeneralized2018]. Generalized full matching includes different options for customization than optimal full matching. The user cannot supply their own distance matrix, but propensity scores and distance metrics that are computed from the supplied covariates (e.g., Mahalanobis distance) are allowed. Calipers can only be placed on the propensity score, if supplied. As with optimal full matching, generalized full matching can target the ATE. Matching performance tends to be similar between the two methods, but generalized full matching will be much quicker and can accommodate larger datasets, making it a good substitute. Generalized full matching is often faster than even nearest neighbor matching, especially for large datasets. ### Genetic Matching (`method = "genetic"`) Genetic matching is less a specific form of matching and more a way of specifying a distance measure for another form of matching. In practice, though, the form of matching used is nearest neighbor pair matching. Genetic matching uses a genetic algorithm, which is an optimization routine used for non-differentiable objective functions, to find scaling factors for each variable in a generalized Mahalanobis distance formula [@diamond2013]. The criterion optimized by the algorithm is one based on covariate balance. Once the scaling factors have been found, nearest neighbor matching is performed on the scaled generalized Mahalanobis distance. See `?method_genetic` for the documentation for `matchit()` with `method = "genetic"`. Genetic matching in `MatchIt` depends on the `GenMatch()` function in the `Matching` package [@sekhon2011] to perform the genetic search and uses the `Match()` function to perform the nearest neighbor match using the scaled generalized Mahalanobis distance. Genetic matching considers the generalized Mahalanobis distance between a treated unit $i$ and a control unit $j$ as $$\delta_{GMD}(\mathbf{x}_i,\mathbf{x}_j, \mathbf{W})=\sqrt{(\mathbf{x}_i - \mathbf{x}_j)'(\mathbf{S}^{-1/2})'\mathbf{W}(\mathbf{S}^{-1/2})(\mathbf{x}_i - \mathbf{x}_j)}$$ where $\mathbf{x}$ is a $p \times 1$ vector containing the value of each of the $p$ included covariates for that unit, $\mathbf{S}^{-1/2}$ is the Cholesky decomposition of the covariance matrix $\mathbf{S}$ of the covariates, and $\mathbf{W}$ is a diagonal matrix with scaling factors $w$ on the diagonal: $$ \mathbf{W}=\begin{bmatrix} w_1 & & & \\ & w_2 & & \\ & & \ddots &\\ & & & w_p \\ \end{bmatrix} $$ When $w_k=1$ for all covariates $k$, the computed distance is the standard Mahalanobis distance between units. Genetic matching estimates the optimal values of the $w_k$s, where a user-specified criterion is used to define what is optimal. The default is to maximize the smallest p-value among balance tests for the covariates in the matched sample (both Kolmogorov-Smirnov tests and t-tests for each covariate). In `MatchIt`, if a propensity score is specified, the default is to include the propensity score and the covariates in $\mathbf{x}$ and to optimize balance on the covariates. When `distance = "mahalanobis"` or the `mahvars` argument is specified, the propensity score is left out of $\mathbf{x}$. In all other respects, genetic matching functions just like nearest neighbor matching except that the matching itself is carried out by `Matching::Match()` instead of by `MatchIt`. When using `method = "genetic"` in `MatchIt`, additional arguments passed to `Matching::GenMatch()` to control the genetic search process should be specified; in particular, the `pop.size` argument should be increased from its default of 100 to a much higher value. Doing so will make the algorithm take more time to finish but will generally improve the quality of the resulting matches. Different functions can be supplied to be used as the objective in the optimization using the `fit.func` argument. ### Exact Matching (`method = "exact"`) Exact matching is a form of stratum matching that involves creating subclasses based on unique combinations of covariate values and assigning each unit into their corresponding subclass so that only units with identical covariate values are placed into the same subclass. Any units that are in subclasses lacking either treated or control units will be dropped. Exact matching is the most powerful matching method in that no functional form assumptions are required on either the treatment or outcome model for the method to remove confounding due to the measured covariates; the covariate distributions are exactly balanced. The problem with exact matching is that in general, few if any units will remain after matching, so the estimated effect will only generalize to a very limited population and can lack precision. Exact matching is particularly ineffective with continuous covariates, for which it might be that no two units have the same value, and with many covariates, for which it might be the case that no two units have the same combination of all covariates; this latter problem is known as the "curse of dimensionality". See `?method_exact` for the documentation for `matchit()` with `method = "exact"`. It is possible to use exact matching on some covariates and another form of matching on the rest. This makes it possible to have exact balance on some covariates (typically categorical) and approximate balance on others, thereby gaining the benefits of both exact matching and the other matching method used. To do so, the other matching method should be specified in the `method` argument to `matchit()` and the `exact` argument should be specified to contain the variables on which exact matching is to be done. ### Coarsened Exact Matching (`method = "cem"`) Coarsened exact matching (CEM) is a form of stratum matching that involves first coarsening the covariates by creating bins and then performing exact matching on the new coarsened versions of the covariates [@iacus2012]. The degree and method of coarsening can be controlled by the user to manage the trade-off between exact and approximate balancing. For example, coarsening a covariate to two bins will mean that units that differ greatly on the covariate might be placed into the same subclass, while coarsening a variable to five bins may require units to be dropped due to not finding matches. Like exact matching, CEM is susceptible to the curse of dimensionality, making it a less viable solution with many covariates, especially with few units. Dropping units can also change the target population of the estimated effect. See `?method_cem` for the documentation for `matchit()` with `method = "cem"`. CEM in `MatchIt` does not depend on any other package to perform the coarsening and matching, though it used to rely on the `cem` package. ### Subclassification (`method = "subclass"`) Propensity score subclassification can be thought of as a form of coarsened exact matching with the propensity score as the sole covariate to be coarsened and matched on. The bins are usually based on specified quantiles of the propensity score distribution either in the treated group, control group, or overall, depending on the desired estimand. Propensity score subclassification is an old and well-studied method, though it can perform poorly compared to other, more modern propensity score methods such as full matching and weighting [@austin2010]. See `?method_subclass` for the documentation for `matchit()` with `method = "subclass"`. The binning of the propensity scores is typically based on dividing the distribution of covariates into approximately equally sized bins. The user specifies the number of subclasses using the `subclass` argument and which group should be used to compute the boundaries of the bins using the `estimand` argument. Sometimes, subclasses can end up with no units from one of the treatment groups; by default, `matchit()` moves a unit from an adjacent subclass into the lacking one to ensure that each subclass has at least one unit from each treatment group. The minimum number of units required in each subclass can be chosen by the `min.n` argument to `matchit()`. If set to 0, an error will be thrown if any subclass lacks units from one of the treatment groups. Moving units from one subclass to another generally worsens the balance in the subclasses but can increase precision. The default number of subclasses is 6, which is arbitrary and should not be taken as a recommended value. Although early theory has recommended the use of 5 subclasses, in general there is an optimal number of subclasses that is typically much larger than 5 but that varies among datasets [@orihara2021]. Rather than trying to figure this out for oneself, one can use optimal full matching (i.e., with `method = "full"`) or generalized full matching (`method = "quick"`) to optimally create subclasses that optimize a within-subclass distance criterion. The output of propensity score subclassification includes the assigned subclasses and the subclassification weights. Effects can be estimated either within each subclass and then averaged across them, or a single marginal effect can be estimated using the subclassification weights. This latter method has been called marginal mean weighting through subclassification [MMWS; @hong2010] and fine stratification weighting [@desai2017]. It is also implemented in the `WeightIt` package. ### Cardinality and Profile Matching (`method = "cardinality"`) Cardinality and profile matching are pure subset selection methods that involve selecting a subset of the original sample without considering the distance between individual units or assigning units to pairs or subclasses. They can be thought of as a weighting method where the weights are restricted to be zero or one. Cardinality matching involves finding the largest sample that satisfies user-supplied balance constraints and constraints on the ratio of matched treated to matched control units [@zubizarretaMatchingBalancePairing2014]. It does not consider a specific estimand and can be a useful alternative to matching with a caliper for handling data with little overlap [@visconti2018]. Profile matching involves identifying a target distribution (e.g., the full sample for the ATE or the treated units for the ATT) and finding the largest subset of the treated and control groups that satisfy user-supplied balance constraints with respect to that target [@cohnProfileMatchingGeneralization2021]. See `?method_cardinality` for the documentation for using `matchit()` with `method = "cardinality"`, including which inputs are required to request either cardinality matching or profile matching. Subset selection is performed by solving a mixed integer programming optimization problem with linear constraints. The problem involves maximizing the size of the matched sample subject to constraints on balance and sample size. For cardinality matching, the balance constraints refer to the mean difference for each covariate between the matched treated and control groups, and the sample size constraints require the matched treated and control groups to be the same size (or differ by a user-supplied factor). For profile matching, the balance constraints refer to the mean difference for each covariate between each treatment group and the target distribution; for the ATE, this requires the mean of each covariate in each treatment group to be within a given tolerance of the mean of the covariate in the full sample, and for the ATT, this requires the mean of each covariate in the control group to be within a given tolerance of the mean of the covariate in the treated group, which is left intact. The balance tolerances are controlled by the `tols` and `std.tols` arguments. One can also create pairs in the matched sample by using the `mahvars` argument, which requests that optimal Mahalanobis matching be done after subset selection; doing so can add additional precision and robustness [@zubizarretaMatchingBalancePairing2014]. The optimization problem requires a special solver to solve. Currently, the available options in `MatchIt` are the HiGHS solver (through the `highs` package), the GLPK solver (through the `Rglpk` package), the SYMPHONY solver (through the `Rsymphony` package), and the Gurobi solver (through the `gurobi` package). The differences among the solvers are in performance; Gurobi is by far the best (fastest, least likely to fail to find a solution), but it is proprietary (though has a free trial and academic license) and is a bit more complicated to install. HiGHS is the default due to being open source, easily installed, and with performance comparable to Gurobi. The `designmatch` package also provides an implementation of cardinality matching with more options than `MatchIt` offers. ## Customizing the Matching Specification In addition to the specific matching method, other options are available for many of the matching methods to further customize the matching specification. These include different specifications of the distance measure, methods to perform alternate forms of matching in addition to the main method, prune units far from other units prior to matching, restrict possible matches, etc. Not all options are compatible with all matching methods. ### Specifying the propensity score or other distance measure (`distance`) The distance measure is used to define how close two units are. In nearest neighbor matching, this is used to choose the nearest control unit to each treated unit. In optimal matching, this is used in the criterion that is optimized. By default, the distance measure is the propensity score difference, and the argument supplied to `distance` corresponds to the method of estimating the propensity score. In `MatchIt`, propensity scores are often labeled as "distance" values, even though the propensity score itself is not a distance measure. This is to reflect that the propensity score is used in creating the distance value, but other scores could be used, such as prognostic scores for prognostic score matching [@hansen2008a]. The propensity score is more like a "position" value, in that it reflects the position of each unit in the matching space, and the difference between positions is the distance between them. If the the argument to `distance` is one of the allowed methods for estimating propensity scores (see `?distance` for these values) or is a numeric vector with one value per unit, the distance between units will be computed as the pairwise difference between propensity scores or the supplied values. Propensity scores are also used in propensity score subclassification and can optionally be used in genetic matching as a component of the generalized Mahalanobis distance. For exact, coarsened exact, and cardinality matching, the `distance` argument is ignored. The default `distance` argument is `"glm"`, which estimates propensity scores using logistic regression or another generalized linear model. The `link` and `distance.options` arguments can be supplied to further specify the options for the propensity score models, including whether to use the raw propensity score or a linearized version of it (e.g., the logit of a logistic regression propensity score, which has been commonly referred to and recommended in the propensity score literature [@austin2011a; @stuart2010]). Allowable options for the propensity score model include parametric and machine learning-based models, each of which have their strengths and limitations and may perform differently depending on the unique qualities of each dataset. We recommend multiple types of models be tried to find one that yields the best balance, as there is no way to make a single recommendation that will work for all cases. The `distance` argument can also be specified as a method of computing pairwise distances from the covariates directly (i.e., without estimating propensity scores). The options include `"mahalanobis"`, `"robust_mahalanobis"`, `"euclidean"`, and `"scaled_euclidean"`. These methods compute a distance metric for a treated unit $i$ and a control unit $j$ as $$\delta(\mathbf{x}_i,\mathbf{x}_j)=\sqrt{(\mathbf{x}_i - \mathbf{x}_j)'S^{-1}(\mathbf{x}_i - \mathbf{x}_j)}$$ where $\mathbf{x}$ is a $p \times 1$ vector containing the value of each of the $p$ included covariates for that unit, $S$ is a scaling matrix, and $S^{-1}$ is the (generalized) inverse of $S$. For Mahalanobis distance matching, $S$ is the pooled covariance matrix of the covariates [@rubinBiasReductionUsing1980]; for Euclidean distance matching, $S$ is the identity matrix (i.e., no scaling); and for scaled Euclidean distance matching, $S$ is the diagonal of the pooled covariance matrix (containing just the variances). The robust Mahalanobis distance is computed not on the covariates directly but rather on their ranks and uses a correction for ties (see @rosenbaumDesignObservationalStudies2010, ch 8). For creating close pairs, matching with these distance measures tends work better than propensity score matching because paired units will have close values on all of the covariates, whereas propensity score-paired units may be close on the propensity score but not on any of the covariates themselves. This feature was the basis of King and Nielsen's [-@king2019] warning against using propensity scores for matching. That said, they do not always outperform propensity score matching [@ripolloneImplicationsPropensityScore2018]. `distance` can also be supplied as a matrix of distance values between units. This makes it possible to use handcrafted distance matrices or distances created outside `MatchIt`. Only nearest neighbor, optimal pair, and optimal full matching allow this specification. The propensity score can have uses other than as the basis for matching. It can be used to define a region of common support, outside which units are dropped prior to matching; this is implemented by the `discard` option. It can also be used to define a caliper, the maximum distance two units can be before they are prohibited from being paired with each other; this is implemented by the `caliper` argument. To estimate or supply a propensity score for one of these purposes but not use it as the distance measure for matching (i.e., to perform Mahalanobis distance matching instead), the `mahvars` argument can be specified. These options are described below. ### Implementing common support restrictions (`discard`) The region of *common support* is the region of overlap between treatment groups. A common support restriction discards units that fall outside of the region of common support, preventing them from being matched to other units and included in the matched sample. This can reduce the potential for extrapolation and help the matching algorithms to avoid overly distant matches from occurring. In `MatchIt`, the `discard` option implements a common support restriction based on the propensity score. The argument can be supplied as `"treated"`, `"control"`, or `"both"`, which discards units in the corresponding group that fall outside the region of common support for the propensity score. The `reestimate` argument can be supplied to choose whether to re-estimate the propensity score in the remaining units. **If units from the treated group are discarded based on a common support restriction, the estimand no longer corresponds to the ATT.** ### Caliper matching (`caliper`) A *caliper* can be though of as a ring around each unit that limits to which other units that unit can be paired. Calipers are based on the propensity score or other covariates. Two units whose distance on a calipered covariate is larger than the caliper width for that covariate are not allowed to be matched to each other. Any units for which there are no available matches within the caliper are dropped from the matched sample. Calipers ensure paired units are close to each other on the calipered covariates, which can ensure good balance in the matched sample. Multiple variables can be supplied to `caliper` to enforce calipers on all of them simultaneously. Using calipers can be a good alternative to exact or coarsened exact matching to ensure only similar units are paired with each other. The `std.caliper` argument controls whether the provided calipers are in raw units or standard deviation units. When negative calipers are supplied, this forces units whose distance on the calipered covariate is *smaller* than the absolute caliper width for that covariate to be disallowed from being matched to each other. **If units from the treated group are left unmatched due to a caliper, the estimand no longer corresponds to the ATT.** ### Mahalanobis distance matching (`mahvars`) To perform Mahalanobis distance matching without the need to estimate or use a propensity score, the `distance` argument can be set to `"mahalanobis"`. If a propensity score is to be estimated or used for a different purpose, such as in a common support restriction or a caliper, but you still want to perform Mahalanobis distance matching, variables should be supplied to the `mahvars` argument. The propensity scores will be generated using the `distance` specification, and matching will occur not on the covariates supplied to the main formula of `matchit()` but rather on the covariates supplied to `mahvars`. To perform Mahalanobis distance matching within a propensity score caliper, for example, the `distance` argument should be set to the method of estimating the propensity score (e.g., `"glm"` for logistic regression), the `caliper` argument should be specified to the desired caliper width, and `mahvars` should be specified to perform Mahalanobis distance matching on the desired covariates within the caliper. `mahvars` has a special meaning for genetic matching and cardinality matching; see their respective help pages for details. ### Exact matching (`exact`) To perform exact matching on all supplied covariates, the `method` argument can be set to `"exact"`. To perform exact matching only on some covariates and some other form of matching within exact matching strata on other covariates, the `exact` argument can be used. Covariates supplied to the `exact` argument will be matched exactly, and the form of matching specified by `method` (e.g., `"nearest"` for nearest neighbor matching) will take place within each exact matching stratum. This can be a good way to gain some of the benefits of exact matching without completely succumbing to the curse of dimensionality. As with exact matching performed with `method = "exact"`, any units in strata lacking members of one of the treatment groups will be left unmatched. Note that although matching occurs within each exact matching stratum, propensity score estimation and computation of the Mahalanobis or other distance matrix occur in the full sample. **If units from the treated group are unmatched due to an exact matching restriction, the estimand no longer corresponds to the ATT.** ### Anti-exact matching (`antiexact`) Anti-exact matching adds a restriction such that a treated and control unit with same values of any of the specified anti-exact matching variables cannot be paired. This can be useful when finding comparison units outside of a unit's group, such as when matching units in one group to units in another when units within the same group might otherwise be close matches. See examples [here](https://stackoverflow.com/questions/66526115/propensity-score-matching-with-panel-data) and [here](https://stackoverflow.com/questions/61120201/avoiding-duplicates-from-propensity-score-matching?rq=1). A similar effect can be implemented by supplying negative caliper values. ### Matching with replacement (`replace`) Nearest neighbor matching and genetic matching have the option of matching with or without replacement, and this is controlled by the `replace` argument. Matching without replacement means that each control unit is matched to only one treated unit, while matching with replacement means that control units can be reused and matched to multiple treated units. Matching without replacement carries certain statistical benefits in that weights for each unit can be omitted or are more straightforward to include and dependence between units depends only on pair membership. However, it is not asymptotically consistent unless the propensity scores for all treated units are below .5 and there are many more control units than treated units [@savjeInconsistencyMatchingReplacement2022]. Special standard error estimators are sometimes required for estimating effects after matching with replacement [@austin2020a], and methods for accounting for uncertainty are not well understood for non-continuous outcomes. Matching with replacement will tend to yield better balance though, because the problem of "running out" of close control units to match to treated units is avoided, though the reuse of control units will decrease the effect sample size, thereby worsening precision [@austin2013b]. (This problem occurs in the Lalonde dataset used in `vignette("MatchIt")`, which is why nearest neighbor matching without replacement is not very effective there.) After matching with replacement, control units are assigned to more than one subclass, so the `get_matches()` function should be used instead of `match_data()` after matching with replacement if subclasses are to be used in follow-up analyses; see `vignette("estimating-effects")` for details. The `reuse.max` argument can also be used with `method = "nearest"` to control how many times each control unit can be reused as a match. Setting `reuse.max = 1` is equivalent to requiring matching without replacement (i.e., because each control can be used only once). Other values allow control units to be matched more than once, though only up to the specified number of times. Higher values will tend to improve balance at the cost of precision. ### $k$:1 matching (`ratio`) The most common form of matching, 1:1 matching, involves pairing one control unit with each treated unit. To perform $k$:1 matching (e.g., 2:1 or 3:1), which pairs (up to) $k$ control units with each treated unit, the `ratio` argument can be specified. Performing $k$:1 matching can preserve precision by preventing too many control units from being unmatched and dropped from the matched sample, though the gain in precision by increasing $k$ diminishes rapidly after 4 [@rosenbaum2020]. Importantly, for $k>1$, the matches after the first match will generally be worse than the first match in terms of closeness to the treated unit, so increasing $k$ can also worsen balance [@rassenOnetomanyPropensityScore2012]. @austin2010a found that 1:1 or 1:2 matching generally performed best in terms of mean squared error. In general, it makes sense to use higher values of $k$ while ensuring that balance is satisfactory. With nearest neighbor and optimal pair matching, variable $k$:1 matching, in which the number of controls matched to each treated unit varies, can also be used; this can have improved performance over "fixed" $k$:1 matching [@ming2000; @rassenOnetomanyPropensityScore2012]. See `?method_nearest` and `?method_optimal` for information on implementing variable $k$:1 matching. ### Matching order (`m.order`) For nearest neighbor matching (including genetic matching), units are matched in an order, and that order can affect the quality of individual matches and of the resulting matched sample. With `method = "nearest"`, the allowable options to `m.order` to control the matching order are `"largest"`, `"smallest"`, `"closest"`, `"farthest"`, `"random"`, and `"data"`. With `method = "genetic"`, all but `"closest"` and `"farthest"` can be used. Requesting `"largest"` means that treated units with the largest propensity scores, i.e., those least like the control units, will be matched first, which prevents them from having bad matches after all the close control units have been used up. `"smallest"` means that treated units with the smallest propensity scores are matched first. `"closest"` means that potential pairs with the smallest distance between units will be matched first, which ensures that the best possible matches are included in the matched sample but can yield poor matches for units whose best match is far from them; this makes it particularly useful when matching with a caliper. `"farthest"` means that closest pairs with the largest distance between them will be matched first, which ensures the hardest units to match are given the best chance to find matches. `"random"` matches in a random order, and `"data"` matches in order of the data. A propensity score is required for `"largest"` and `"smallest"` but not for the other options. @rubin1973 recommends using `"largest"` or `"random"`, though @austin2013b recommends against `"largest"` and instead favors `"closest"` or `"random"`. `"closest"` and `"smallest"` are best for prioritizing the best possible matches, while `"farthest"` and `"largest"` are best for preventing extreme pairwise distances between matched units. ## Choosing a Matching Method Choosing the best matching method for one's data depends on the unique characteristics of the dataset as well as the goals of the analysis. For example, because different matching methods can target different estimands, when certain estimands are desired, specific methods must be used. On the other hand, some methods may be more effective than others when retaining the target estimand is less important. Below we provide some guidance on choosing a matching method. Remember that multiple methods can (and should) be tried as long as the treatment effect is not estimated until a method has been settled on. The criteria on which a matching specification should be judged are balance and remaining (effective) sample size after matching. Assessing balance is described in `vignette("assessing-balance")`. A typical workflow is similar to that demonstrated in `vignette("MatchIt")`: try a matching method, and if it yields poor balance or an unacceptably low remaining sample size, try another, until a satisfactory specification has been found. It is important to assess balance broadly (i.e., beyond comparing the means of the covariates in the treated and control groups), and the search for a matching specification should not stop when a threshold is reached, but should attempt to come as close as possible to perfect balance [@ho2007]. Even if the first matching specification appears successful at reducing imbalance, there may be another specification that could reduce it even further, thereby increasing the robustness of the inference and the plausibility of an unbiased effect estimate. If the target of inference is the ATE, optimal or generalized full matching, subclassification, or profile matching can be used. If the target of inference is the ATT or ATC, any matching method may be used. When retaining the target estimand is not so important, additional options become available that involve discarding units in such a way that the original estimand is distorted. These include matching with a caliper, matching within a region of common support, cardinality matching, or exact or coarsened exact matching, perhaps on a subset of the covariates. Because exact and coarsened exact matching aim to balance the entire joint distribution of covariates, they are the most powerful methods. If it is possible to perform exact matching, this method should be used. If continuous covariates are present, coarsened exact matching can be tried. Care should be taken with retaining the target population and ensuring enough matched units remain; unless the control pool is much larger than the treated pool, it is likely some (or many) treated units will be discarded, thereby changing the estimand and possibly dramatically reducing precision. These methods are typically only available in the most optimistic of circumstances, but they should be used first when those circumstances arise. It may also be useful to combine exact or coarsened exact matching on some covariates with another form of matching on the others (i.e., by using the `exact` argument). When estimating the ATE, either subclassification, full matching, or profile matching can be used. Optimal and generalized full matching can be effective because they optimize a balance criterion, often leading to better balance. With full matching, it's also possible to exact match on some variables and match using the Mahalanobis distance, eliminating the need to estimate propensity scores. Profile matching also ensures good balance, but because units are only given weights of zero or one, a solution may not be feasible and many units may have to be discarded. For large datasets, neither optimal full matching nor profile matching may be possible, in which case generalized full matching and subclassification are faster solutions. When using subclassification, the number of subclasses should be varied. With large samples, higher numbers of subclasses tend to yield better performance; one should not immediately settle for the default (6) or the often-cited recommendation of 5 without trying several other numbers. The documentation for `cobalt::bal.compute()` contains an example of using balance to select the optimal number of subclasses. When estimating the ATT, a variety of methods can be tried. Genetic matching can perform well at achieving good balance because it directly optimizes covariate balance. With larger datasets, it may take a long time to reach a good solution (though that solution will tend to be good as well). Profile matching also will achieve good balance if a solution is feasible because balance is controlled by the user. Optimal pair matching and nearest neighbor matching without replacement tend to perform similarly to each other; nearest neighbor matching may be preferable for large datasets that cannot be handled by optimal matching. Nearest neighbor, optimal, and genetic matching allow some customizations like including covariates on which to exactly match, using the Mahalanobis distance instead of a propensity score difference, and performing $k$:1 matching with $k>1$. Nearest neighbor matching with replacement, full matching, and subclassification all involve weighting the control units with nonuniform weights, which often allows for improved balancing capabilities but can be accompanied by a loss in effective sample size, even when all units are retained. There is no reason not to try many of these methods, varying parameters here and there, in search of good balance and high remaining sample size. As previously mentioned, no single method can be recommended above all others because the optimal specification depends on the unique qualities of each dataset. When the target population is less important, for example, when engaging in treatment effect discovery or when when the sampled population is not of particular interest (e.g., it corresponds to an arbitrarily chosen hospital or school; see @mao2018 for these and other reasons why retaining the target population may not be important), other methods that do not retain the characteristics of the original sample become available. These include matching with a caliper (on the propensity score or on the covariates themselves), cardinality matching, and more restrictive forms of matching like exact and coarsened exact matching, either on all covariates or just a subset, that are prone to discard units from the sample in such a way that the target population is changed. @austin2013b and Austin and Stuart [-@austin2015c; -@austin2015a] have found that caliper matching can be a particularly effective modification to nearest neighbor matching for eliminating imbalance and reducing bias when the target population is less relevant, but when inference to a specific target population is desired, using calipers can induce bias due to incomplete matching [@rosenbaum1985; @wang2020]. Cardinality matching can be particularly effective in data with little overlap between the treatment groups [@visconti2018] and can perform better than caliper matching [@delosangelesresaDirectStableWeight2020]. It is important not to rely excessively on theoretical or simulation-based findings or specific recommendations when making choices about the best matching method to use. For example, although nearest neighbor matching without replacement balance covariates better than did subclassification with five or ten subclasses in Austin's [-@austin2009c] simulation, this does not imply it will be superior in all datasets. Likewise, though @rosenbaum1985a and @austin2011a both recommend using a caliper of .2 standard deviations of the logit of the propensity score, this does not imply that caliper will be optimal in all scenarios, and other widths should be tried, though it should be noted that tightening the caliper on the propensity score can sometimes degrade performance [@king2019]. For large datasets (i.e., in 10,000s to millions), some matching methods will be too slow to be used at scale. Instead, users should consider generalized full matching, subclassification, or coarsened exact matching, which are all very fast and designed to work with large datasets. Nearest neighbor matching on the propensity score has been optimized to run quickly for large datasets as well. ## Reporting the Matching Specification When reporting the results of a matching analysis, it is important to include the relevant details of the final matching specification and the process of arriving at it. Using `print()` on the `matchit` object synthesizes information on how the above arguments were used to provide a description of the matching specification. It is best to be as specific as possible to ensure the analysis is replicable and to allow audiences to assess its validity. Although citations recommending specific matching methods can be used to help justify a choice, the only sufficient justification is adequate balance and remaining sample size, regardless of published recommendations for specific methods. See `vignette("assessing-balance")` for instructions on how to assess and report the quality of a matching specification. After matching and estimating an effect, details of the effect estimation must be included as well; see `vignette("estimating-effects")` for instructions on how to perform and report on the analysis of a matched dataset. ## References