% Generated by roxygen2: do not edit by hand
% Please edit documentation in R/PrInDT.R
\name{PrInDT}
\alias{PrInDT}
\title{The basic undersampling loop for classification}
\usage{
PrInDT(datain, classname, ctestv=NA, N, percl, percs=1, conf.level=0.95, thres=0.5,
       stratvers=0, strat=NA, seedl=TRUE)
}
\arguments{
\item{datain}{Input data frame with class factor variable 'classname' and the\cr
influential variables, which need to be factors or numericals (transform logicals and character variables to factors)}

\item{classname}{Name of class variable (character)}

\item{ctestv}{Vector of character strings of forbidden split results;\cr
Example: ctestv <- rbind('variable1 == \{value1, value2\}','variable2 <= value3'), where
character strings specified in 'value1', 'value2' are not allowed as results of a splitting operation in variable 1 in a tree.\cr
For restrictions of the type 'variable <= xxx', all split results in a tree are excluded with 'variable <= yyy' and yyy <= xxx.\cr
Trees with split results specified in 'ctestv' are not accepted during optimization.\cr
A concrete example is: 'ctestv <- rbind('ETH == \{C2a, C1a\}','AGE <= 20')' for variables 'ETH' and 'AGE' and values 'C2a','C1a', and '20';\cr
If no restrictions exist, the default = NA is used.}

\item{N}{Number (> 2) of repetitions (integer)}

\item{percl}{Undersampling percentage of larger class (numerical, > 0 and <= 1)}

\item{percs}{Undersampling percentage of smaller class (numerical, > 0 and <= 1);\cr
default = 1}

\item{conf.level}{(1 - significance level) in function \code{ctree} (numerical, > 0 and <= 1);\cr
default = 0.95}

\item{thres}{Probability threshold for prediction of smaller class (numerical, >= 0 and < 1); default = 0.5}

\item{stratvers}{Version of stratification;\cr
= 0: none (default),\cr
= 1: stratification according to the percentages of the values of the factor variable 'strat',\cr
> 1: stratification with minimum number "stratvers" of observations per value of "strat"}

\item{strat}{Name of one (!) stratification variable for undersampling (character);\cr
default = NA (no stratification)}

\item{seedl}{Should the seed for random numbers be set (TRUE / FALSE)?\cr
default = TRUE}
}
\value{
\describe{
\item{tree1st}{best tree on full sample}
\item{tree2nd}{2nd-best tree on full sample}
\item{tree3rd}{3rd-best tree on full sample}
\item{treet1st}{best tree on test sample}
\item{treet2nd}{2nd-best tree on test sample}
\item{treet3rd}{3rd-best tree on test sample}
\item{ba1st}{accuracies: largeClass, smallClass, balanced of 'tree1st', both for full and test sample}
\item{ba2nd}{accuracies: largeClass, smallClass, balanced of 'tree2nd', both for full and test sample}
\item{ba3rd}{accuracies: largeClass, smallClass, balanced of 'tree3rd', both for full and test sample}
\item{baen}{accuracies: largeClass, smallClass, balanced of ensemble of all interpretable, 3 best acceptable, and all acceptable trees on full sample}
\item{bafull}{vector of balanced accuracies of all trees from undersampling}
\item{batest}{vector of test accuracies of all trees from undersampling}
\item{dataout}{transformed data set 'datain' for further analyses}
\item{treeAll}{tree based on all observations}
\item{baAll}{balanced accuracy of 'treeAll'}
\item{interpAll}{criterion of interpretability of 'treeall' (TRUE / FALSE)}
\item{confAll}{confusion matrix of 'treeAll'}
}
}
\description{
The function PrInDT uses ctrees (conditional inference trees from the package "party") for optimal modeling of
the relationship between the two-class factor variable 'classname' and all other factor and numerical variables
in the data frame 'datain' by means of 'N' repetitions of undersampling. The optimization citerion is the balanced accuracy 
on the full sample. The trees generated from undersampling can be restricted by not accepting trees 
including split results specified in the character strings of the vector 'ctestv'.\cr
The undersampling percentages are 'percl' for the larger class and 'percs' for the smaller class (default = 1).\cr
The probability threshold 'thres' for the prediction of the smaller class may be specified (default = 0.5).\cr
Undersampling may be stratified in two ways by the feature 'strat'.
}
\details{
For the optimzation of the trees, we employ a method we call Sumping (Subsampling umbrella of 
model parameters), a variant of Bumping (Bootstrap umbrella of model parameters) (Tibshirani 
& Knight, 1999) which use subsampling instead of bootstrapping. The aim of the 
optimization is to identify conditional inference trees with maximum predictive power
on the full sample under interpretability restrictions.

\strong{References} \cr
-- Tibshirani, R., Knight, K. 1999. Model Search and Inference By Bootstrap "bumping".
Journal of Computational and Graphical Statistics, Vol. 8, No. 4 (Dec., 1999), pp. 671-686 \cr
-- Weihs, C., Buschfeld, S. 2021a. Combining Prediction and Interpretation in  Decision Trees (PrInDT) - 
a Linguistic Example. arXiv:2103.02336

Standard output can be produced by means of \code{print(name)} or just \code{ name } as well as \code{plot(name)} where 'name' is the output data 
frame of the function.\cr
The plot function will produce a series of more than one plot. If you use R, you might want to specify \code{windows(record=TRUE)} before 
\code{plot(name)} to save the whole series of plots. In R-Studio this functionality is provided automatically.
}
\examples{
datastrat <- PrInDT::data_zero
data <- na.omit(datastrat) # cleaned full data: no NAs
# interpretation restrictions (split exclusions)
ctestv <- rbind('ETH == {C2a, C1a}','MLU == {1, 3}') # split exclusions
N <- 41  # no. of repetitions
conf.level <- 0.99 # 1 - significance level (mincriterion) in ctree
percl <- 0.08  # undersampling percentage of the larger class
percs <- 0.95 # undersampling percentage of the smaller class
# calls of PrInDT
out <- PrInDT(data,"real",ctestv,N,percl,percs,conf.level) # unstratified
out # print best model and ensembles as well as all observations
plot(out)
out <- PrInDT(data,"real",ctestv,N,percl,percs,conf.level,stratvers=1,
              strat="SEX") # percentage stratification
out <- PrInDT(data,"real",ctestv,N,percl,percs,conf.level,stratvers=50,
              strat="SEX") # stratification with minimum no. of tokens
out <- PrInDT(data,"real",ctestv,N,percl,percs,conf.level,thres=0.4) # threshold = 0.4

}
