\name{model.diagnostics}
\alias{model.diagnostics}


\title{ Model Predictions and Diagnostics }
\description{
  Takes model object and makes predictions, runs model diagnostics, and creates graphs and tables of the results.  
}
\usage{
model.diagnostics(model.obj = NULL, qdata.trainfn = NULL, qdata.testfn = NULL, 
folder = NULL, MODELfn = NULL, response.name = NULL, unique.rowname = NULL,
 diagnostic.flag=NULL, seed = NULL, prediction.type=NULL, MODELpredfn = NULL, 
 na.action = NULL, v.fold = 10, device.type = NULL, DIAGNOSTICfn = NULL, 
 res=NULL, jpeg.res = 72, device.width = 7,  device.height = 7, units="in", 
 pointsize=12, cex=par()$cex, req.sens, req.spec, FPC, FNC, quantiles=NULL, 
 all=TRUE, subset = NULL, weights = NULL, mtry = NULL, controls = NULL, 
 xtrafo = NULL, ytrafo = NULL, scores = NULL, n.trees = NULL)
}

\arguments{

  \item{model.obj}{ \code{R} model object.  The model object to use for prediction.  The model object must be of type \code{"RF"} (random forest), \code{"QRF"} (quantile random forest), \code{"CF"} (conditional forest), or \code{"SGB"} (stochastic gradient boosting). }

  \item{qdata.trainfn}{String.  The name (full path or base name with path specified by \code{folder}) of the training data file used for building the model (file should include columns for both response and predictor variables).  The file must be a comma-delimited file \code{*.csv} with column headings. \code{qdata.trainfn} can also be an \code{R} dataframe. If predictions will be made (\code{predict = TRUE} or \code{map=TRUE}) the predictor column headers must match the names of the raster layer files, or a \code{rastLUT} must be provided to match predictor columns to the appropriate raster and band.  If \code{qdata.trainfn = NULL} (the default), a GUI interface prompts user to browse to the training data file.  }

  \item{qdata.testfn}{String.  The name (full path or base name with path specified by \code{folder}) of the independent data set for testing (validating) the model's predictions.  The file must be a comma-delimited file \code{".csv"} with column headings and the column headings must be the same as those in the training data file.  \code{qdata.testfn} can also be an \code{R} dataframe. If \code{qdata.testfn = NULL} (default), a GUI interface asks user if there is a test set available, then prompts user to browse to the test data file.  If no test set is desired (for example, cross-fold validation will be performed, or for RF models, Out-Of-Bag estimation, set \code{qdata.testfn = FALSE}. If no test set is given, and \code{qdata.testfn} is not set to \code{FALSE}, the GUI interface asks if a proportion of the data should be set aside as an independent test set.  If this is desired, the user will be prompted to specify the proportion to set aside as test data, and two new data files will be generated in the out put folder.  The new file names will be the original data file name with \code{"_train"} and \code{"_test"} appended to the end of the file names.}

  \item{folder}{ String.  The folder used for all output from predictions and/or maps.  Do not add ending slash to path string.  If \code{folder = NULL} (default), a GUI interface prompts user to browse to a folder.  To use the working directory, specify \code{folder = getwd()}.}

  \item{MODELfn}{ String.  The file name to use to save the generated model object.  If \code{MODELfn = NULL} (the default), a default name is generated by pasting \code{model.type_response.type_response.name}. If the other output filenames are left unspecified, \code{MODELfn} will be used as the basic name to generate other output filenames. The filename can be the full path, or it can be the simple basename, in which case the output will be to the folder specified by \code{folder}.}

  \item{response.name}{ String.  The name of the response variable used to build the model. The \code{response.name} must be column name from the training/test data files. If the \code{model.obj} was constructed in \code{ModelMap} with the \code{model.build()} function, then the \code{model.diagnostics()} can extract the \code{response.name} from the \code{model.obj}. If the model was constructed outside of \code{ModelMap} the you may need to specify the \code{response.name}. In particular, if a SGB model was constructed with the aid of Elith's code, it is necessary to specify the \code{response.name} argument, as all models constructed with this code are given a response name of \code{"y.data"}. If the \code{response.name} argument differs from the response name in the \code{model.obj}, the specified argument is giver preference, and a warning generated.}

  \item{unique.rowname}{ String.  The name of the unique identifier used to identify each row in the training data.  If \code{unique.rowname = NULL}, a GUI interface prompts user to select a variable from the list of column names from the training data file.  If \code{unique.rowname = FALSE}, a variable is generated of numbers from \code{1} to \code{nrow(qdata)} to index each row. }

  \item{diagnostic.flag}{ String.  The name of a column used to identify a subset of rows in the training data or test data to 
use for model diagnostics. This column must be either a logical vector (\code{TRUE} and \code{FALSE}) or a vector of zeros ond ones (where \code{0=FALSE} and \code{1=TRUE}. If this argument is used model diagnostics that depend on predicted and observed values will be calculated from a subset of the training or test data. These include confusion matrix and threshold criteria for binary response models and the scatterplot for continuous response models. The output file of predicted and observed values will have an aditional column, indicating which rows were used in the diagnostic calculations. Note that for cross validation, the entire training dataset will be used to create cross validation predictions, but that only the predictions on the the rows indicated by \code{diagnostic.flag} will be used for the diagnostics. }

  \item{seed}{ Integer.  The number used to initialize randomization to build RF or SGB models.  If you want to produce the same model later, use the same seed.  If \code{seed = NULL} (the default), a new seed is created each run. }

  \item{prediction.type}{ String. Prediction type.  \code{"TEST"}, \code{"CV"}, \code{"OOB"} or \code{"TRAIN"}.  If \code{predict = "TEST"}, validation predictions will be made on the test set provided by \code{qdata.testfn}.  If \code{predict = "CV"}, cross validation will be used on the training data provided by \code{qdata.trainfn}. If \code{model.obj} is a Random Forest model and \code{predict = "OOB"} the Out-of-Bag predictions will be calculated on the training data. If \code{model.obj} is a Stochastic Gradient Boosting model and \code{predict = "TRAIN"} the predictions will be calculated on the training data, but these predictions should be used with caution as this will lead to over optimistic estimates of model quality. A \code{*.csv} file of the unique id, observed, and predicted values is generated and put in the specified (or default) folder.}

  \item{MODELpredfn}{ String.  Model validation.  A character string used to construct the output file names for the validation diagnostics, for example the prediction \code{*.csv} file, and the graphics \code{*.jpg}, \code{*.pdf} and \code{*.ps} files.  The filename can be the full path, or it can be the simple basename, in which case the output will be to the folder specified by \code{folder}. If \code{MODELpredfn = NULL} (the default), a default name is created by pasting \code{modelfn} and \code{"_pred"}.}

  \item{na.action}{String.  Model validation.  Specifies the action to take if there are \code{NA} values in the predictor data or if there is a level or class of a categorical predictor variable in the validation test set, but not in the training data set.  By default, \code{model.daignostics()} will use the same \code{na.action} as was given to \code{model.build}. There are 2 options: (1) \code{na.action = "na.omit"} where any data point with \code{NA} or any new levels for any of the factored predictors is removed from the data; (2) \code{na.action = "na.roughfix"} where a missing categorical predictor is replaced with the most common category, and a missing continuous predictor is replaced with the median. Note: data points with missing response values will always be omitted.  }

  \item{v.fold}{ Integer (or logical \code{FALSE}).  Model validation.  The number of cross validation folds to use when making validation predictions on the training data.  Only used if  \code{prediction.type = "CV"}.}

  \item{device.type}{ String or vector of strings.  Model validation.  One or more device types for graphical output from model validation diagnostics. 

Current choices:

\tabular{lllll}{
	  \tab \tab \tab \code{"default"} \tab default graphics device\cr
	  \tab \tab \tab \code{"jpeg"} \tab *.jpg files\cr
	  \tab \tab \tab \code{"none"} \tab no graphics device generated\cr	
	  \tab \tab \tab \code{"pdf"} \tab *.pdf files\cr
	  \tab \tab \tab \code{"png"} \tab *.png files\cr
	  \tab \tab \tab \code{"postscript"} \tab *.ps files\cr
	  \tab \tab \tab \code{"tiff"} \tab *.tif files }

 }

  \item{DIAGNOSTICfn}{ String.  Model validation.  Name used as base to create names for output files from model validation diagnostics.  The filename can be the full path, or it can be the simple basename, in which case the output will be to the folder specified by \code{folder}.  Defaults to \code{DIAGNOSTICfn = MODELfn} followed by the appropriate suffixes (i.e. \code{".csv"}, \code{".jpg"}, etc...). }

  \item{res}{ Integer.  Model validation.  Pixels per inch for jpeg, png, and tiff plots.  The default is 72dpi, good for on screen viewing. For printing, suggested setting is 300dpi. }

  \item{jpeg.res}{ Integer.  Model validation.  Deprecated. Ignored unless \code{res} not provided. }

  \item{device.width}{ Integer.  Model validation.  The device width for diagnostic plots in inches. }

  \item{device.height}{ Integer.  Model validation.  The device height for diagnostic plots in inches. }

  \item{units}{ Model validation.  The units in which \code{device.height} and \code{device.width} are given. Can be \code{"px"} (pixels), \code{"in"} (inches, the default), \code{"cm"} or \code{"mm"}. }

  \item{pointsize}{ Integer.  Model validation.  The default pointsize of plotted text, interpreted as big points (1/72 inch) at \code{res} ppi}

  \item{cex}{ Integer.  Model validation.  The cex for diagnostic plots. }

  \item{req.sens}{ Numeric.  Model validation.  The required sensitivity for threshold optimization for binary response model evaluation. }

  \item{req.spec}{ Numeric.  Model validation.  The required specificity for threshold optimization for binary response model evaluation. }

  \item{FPC}{ Numeric.  Model validation.  The False Positive Cost for threshold optimization for binary response model evaluation. }

  \item{FNC}{ Numeric.  Model validation.  The False Negative Cost for threshold optimization for binary response model evaluation. }

  \item{quantiles}{ Numeric Vector.  QRF models.  The quantiles to predict. A numeric vector with values between zero and one. If model was built without specifying quantiles, quantile importance can not be calculated, but \code{quantiles} can still be used to specify prediction quantiles. If model was built with quantiles specified, then the model quantiles will be used for  importance graph.  If quantiles are not specified for model building or diagnostics, prediction quantiles will default to \code{quantiles=c(0.1,0.5,0.9)}}

  \item{all}{ Logical.  QRF models. \code{all=TRUE} uses all observations for prediction. \code{all=FALSE} uses only a certain number of observations per node for prediction (set with argument obs). Unlike in the quantredForest package itself, the default in ModelMap is \code{all=TRUE}, to more closely parallel ordinary random forest models. }

  \item{subset}{CF models. NOT SUPPORTED. Only needed for \code{prediction.type="CV"} for CF models. An optional vector specifying a subset of observations to be used in the fitting process. Note: \code{subset} is not yet supported for cross validation diagnostics.}

  \item{weights}{CF models. NOT SUPPORTED. Only needed for \code{prediction.type="CV"} for CF models. An optional vector of weights to be used in the fitting process. Non-negative integer valued weights are allowed as well as non-negative real weights. Observations are sampled (with or without replacement) according to probabilities \code{weights/sum(weights)}. The fraction of observations to be sampled (without replacement) is computed based on the sum of the weights if all weights are integer-valued and based on the number of weights greater zero else. Alternatively, \code{weights} can be a double matrix defining case weights for all \code{ncol(weights)} trees in the forest directly. This requires more storage but gives the user more control. Note: \code{weights} is not yet supported for cross validation diagnostics.}

 \item{mtry}{ Integer. Only needed for \code{prediction.type="CV"} for CF models (for RF and QRF models mtry will be determined from the model object).  Number of variables to try at each node of Random Forest trees. }

  \item{controls}{CF models. Only needed for \code{prediction.type="CV"} for CF models. An object of class \code{\link[party]{ForestControl-class}}, which can be obtained using cforest_control (and its convenience interfaces cforest_unbiased and cforest_classical). If \code{controls} is specified, then stand alone arguments \code{mtry} and \code{ntree} ignored and these parameters must be specified as part of the \code{controls} argument. If \code{controls} not specified, \code{model.build} defaults to \code{cforest_unbiased(mtry=mtry, ntree=ntree)} with the values of \code{mtry} and \code{ntree} specified by the stand alone arguments.}

  \item{xtrafo}{CF models. Only needed for \code{prediction.type="CV"} for CF models. A function to be applied to all input variables. By default, the \code{\link[party]{ptrafo}} function is applied. }

  \item{ytrafo}{CF models. Only needed for \code{prediction.type="CV"} for CF models. A function to be applied to all response variables. By default, the \code{\link[party]{ptrafo}} function is applied.  }

  \item{scores}{CF models. NOT SUPPORTED. Only needed for \code{prediction.type="CV"} for CF models. An optional named list of scores to be attached to ordered factors. Note: \code{scores} is not yet supported for cross validation diagnostics.}

  \item{n.trees}{ Integer.  SGB models.  The number of stochastic gradient boosting trees for an SGB model. If \code{n.trees=NULL} (the default) the model creation code will increase the number of trees 100 at a time until OOB error rate stops improving. The \code{gbm} function \code{gbm.perf()} will be used to select from the total calculated trees, the best number of trees for model predictions, with argument \code{method="OOB"}. The \code{gbm} package warns that \code{OOB generally underestimates the optimal number of iterations although predictive performance is reasonably competitive.} }

}
\details{

\code{model.diagnostics()}takes model object and makes predictions, runs model diagnostics, and creates graphs and tables of the results.

\code{model.diagnostics()} can be run in a traditional R command mode, where all arguments are specified in the function call.  However it can also be used in a full push button mode, where you type in the simple command \code{model.map()}, and GUI pop up windows will ask questions about the type of model, the file locations of the data, etc...

When running \code{model.map()} on non-Windows platforms, file names and folders need to be specified in the argument list, but other pushbutton selections are handled by the \code{select.list()} function, which is platform independent. 

Diagnostic predictions are made my one of four methods, and a text file is generated consisting of three columns: Observation ID, observed values and predicted values. If \code{predition.type = "CV")} an additional column indicates which cross-fold each observation fell into. If the models response type is categorical then in addition a column giving the category predicted by majority vote, there are also categories for each possible response category giving the proportion of trees that predicted that category.

A variable importance graph is made. If \code{response.type = "categorical"}, category specific graphs are generated for variable importance. These show how much the model accuracy for each category is affected when the values of each predictor variable is randomly permuted.

The package \code{corrplot} is used to generate a plot of correlation between predictor variables. If there are highly correlated predictor variables, then the variable importances of \code{"RF"}, \code{"QRF"}, \code{"SGB"} and \code{"QSGB"} models need to be interpreted with care, and users may want to consider looking at the conditional variable importances available for \code{"CF"} models produced by the \code{party} package.

If \code{model.type = "RF"}, the OOB error is plotted as a function of number of trees in the model. If \code{response.type = "binary"} or If \code{response.type = "categorical"} category specific graphs are generated for  OOB error as a function of number of trees.

If \code{response.type = "binary"}, a summary graph is made using the \code{PresenceAbsence} package and a \code{*.csv} spreadsheets are created of optimized thresholds by several methods with their associated error statistics, and predicted prevalence.

If \code{response.type = "continuous"} a scatterplot of observed vs.  predicted is created with a simple linear regression line.  The graph is labeled with slope and intercept of this line as well as Pearson's and Spearman's correlation coefficients.

If \code{response.type = "categorical"}, a confusion matrix is generated, that includes erros of ommission and comission, as well as Kappa, Percent Correctly Classified (PCC) and the Multicategorical Area Under the Curve (MAUC) as defined by Hand and Till (2001) and calculated by the package \code{HandTill2001}.

}

\note{
Importance currently unavailable for QRF models.

If you are running cross validation diagnostics on a CF model, the model parameters will NOT automatically be passed to \code{model.diagnostics()}. For cross validation, it is the users responsibility to be certain that the CF arguments are the same in \code{model.build()} and \code{model.diagnostics()}.

Also, for some CF model parameters (\code{subset}, \code{weights}, and \code{scores}) \code{ModelMap} only provides OOB and independent test set diagnostics, and does not support cross validation diagnostics.

}

\value{

The function will return a dataframe of the row ID, and the Observed and predicted values. 

For Binary response models the predicted probability of presence is returned. 

For Categorical Response models the predicted category (by majority vote) is returned as well as a column for each category giving the probability of that category. If necessary, \code{\link{make.names}} is applied to the categories to create valid column names.

For Continuous response models the predicted value is returned. 

If \code{prediction.type = "CV"} the dataframe also includes a column indicating which cross-validation fold each datapoint was in. 

}
\references{ 
Breiman, L. (2001) Random Forests. Machine Learning, 45:5-32.

Elith, J., Leathwick, J. R. and Hastie, T. (2008). A working guide to boosted regression trees. Journal of Animal Ecology. 77:802-813.

Friedman, J.H. (2001). Greedy function approximation: a gradient boosting machine. Ann. Stat., 29(5):1189-1232.

Friedman, J.H. (2002). Stochastic gradient boosting. Comput. Stat. Data An., 38(4):367-378.

Hand, D. J., & Till, R. J. (2001). A simple generalisation of the area under the ROC curve for multiple class classification problems. Machine Learning, 45(2), 171-186.

Liaw, A. and  Wiener, M. (2002). Classification and Regression by randomForest. R News 2(3), 18--22.

Ridgeway, G., (1999). The state of boosting. Comp. Sci. Stat. 31:172-181
 }

\author{ Elizabeth Freeman and Tracey Frescino }

\seealso{ \code{\link{get.test}}, \code{\link{model.build}}, \code{\link{model.mapmake}}}
\examples{
###########################################################################
############################# Run this set up code: #######################
###########################################################################

# set seed:
seed=38

# Define training and test files:

qdata.trainfn = system.file("extdata", "helpexamples","DATATRAIN.csv", package = "ModelMap")
qdata.testfn = system.file("extdata", "helpexamples","DATATEST.csv", package = "ModelMap")

# Define folder for all output:
folder=getwd()	

#identifier for individual training and test data points

unique.rowname="ID"


###########################################################################
############## Pick one of the following sets of definitions: #############
###########################################################################


########## Continuous Response, Continuous Predictors ############

#file name to store model:
MODELfn="RF_Bio_TC"				

#predictors:
predList=c("TCB","TCG","TCW")	

#define which predictors are categorical:
predFactor=FALSE	

# Response name and type:
response.name="BIO"
response.type="continuous"


########## binary Response, Continuous Predictors ############

#file name to store model:
MODELfn="RF_CONIFTYP_TC"				

#predictors:
predList=c("TCB","TCG","TCW")		

#define which predictors are categorical:
predFactor=FALSE

# Response name and type:
response.name="CONIFTYP"

# This variable is 1 if a conifer or mixed conifer type is present, 
# otherwise 0.

response.type="binary"


########## Continuous Response, Categorical Predictors ############

# In this example, NLCD is a categorical predictor.
#
# You must decide what you want to happen if there are categories
# present in the data to be predicted (either the validation/test set
# or in the image file) that were not present in the original training data.
# Choices:
#       na.action =  "na.omit"
#                    Any validation datapoint or image pixel with a value for any
#                    categorical predictor not found in the training data will be
#                    returned as NA.
#       na.action =  "na.roughfix"
#                    Any validation datapoint or image pixel with a value for any
#                    categorical predictor not found in the training data will have
#                    the most common category for that predictor substituted,
#                    and the a prediction will be made.

# You must also let R know which of the predictors are categorical, in other
# words, which ones R needs to treat as factors.
# This vector must be a subset of the predictors given in predList

#file name to store model:
MODELfn="RF_BIO_TCandNLCD"			

#predictors:
predList=c("TCB","TCG","TCW","NLCD")

#define which predictors are categorical:
predFactor=c("NLCD")

# Response name and type:
response.name="BIO"
response.type="continuous"



###########################################################################
########################### build model: ##################################
###########################################################################


### create model ###

model.obj = model.build( model.type="RF",
                       qdata.trainfn=qdata.trainfn,
                       folder=folder,		
                       unique.rowname=unique.rowname,	
                       MODELfn=MODELfn,
                       predList=predList,
                       predFactor=predFactor,
                       response.name=response.name,
                       response.type=response.type,
                       seed=seed,
                       na.action="na.roughfix"
)

###########################################################################
#### Then Run this code make validation predictions and diagnostics: ######
###########################################################################


### for Out-of-Bag predictions ###

MODELpredfn<-paste(MODELfn,"_OOB",sep="")
PRED.OOB<-model.diagnostics( 	model.obj=model.obj,
				qdata.trainfn=qdata.trainfn,
                   		folder=folder,		
                  	 	unique.rowname=unique.rowname,
                	# Model Validation Arguments
                   		prediction.type="OOB",
                   		MODELpredfn=MODELpredfn,
                   		device.type=c("default","jpeg","pdf"),	
                   		na.action="na.roughfix"
)
PRED.OOB

### for Cross-Validation predictions ###

#MODELpredfn<-paste(MODELfn,"_CV",sep="")
#PRED.CV<-model.diagnostics( 	model.obj=model.obj,
#                   		qdata.trainfn=qdata.trainfn,
#                   		folder=folder,		
#                   		unique.rowname=unique.rowname,
#                   		seed=seed,
#                	# Model Validation Arguments
#                   		prediction.type="CV",
#                   		MODELpredfn=MODELpredfn,
#                   		device.type=c("default","jpeg","pdf"),	
#                   		v.fold=10,
#                   		na.action="na.roughfix"
#)
#PRED.CV

### for Independent Test Set predictions ###

#MODELpredfn<-paste(MODELfn,"_TEST",sep="")
#PRED.TEST<-model.diagnostics( 	model.obj=model.obj,
#                   		qdata.testfn=qdata.testfn,
#                   		folder=folder,		
#                   		unique.rowname=unique.rowname,
#                	# Model Validation Arguments
#                   		prediction.type="TEST",
#                   		MODELpredfn=MODELpredfn,
#                   		device.type=c("default","jpeg","pdf"),	
#                   		na.action="na.roughfix"
#)
#PRED.TEST

}

\keyword{ models }

