--- title: "Reading and preparing the data" author: "Hakon K. Gjessing and Julia Romanowska" date: "`r Sys.Date()`" output: rmarkdown::html_vignette vignette: > %\VignetteEngine{knitr::rmarkdown} %\VignetteIndexEntry{Reading and preparing the data} %\VignetteEncoding{UTF-8} --- ```{r setup,include=FALSE} knitr::opts_chunk$set( echo = TRUE ) library( Haplin, quietly = TRUE ) ``` # Reading the data into Haplin Haplin reads data in two formats: 1. Haplin's own text file format; 2. PED format (as generated by `plink 1.9`). Both types of data are read in through the use of `genDataRead` function: ```{r} dir.exmpl <- system.file( "extdata", package = "Haplin" ) exemplary.file1 <- paste0( dir.exmpl, "/HAPLIN.trialdata.txt" ) my.gen.data.haplin <- genDataRead( file.in = exemplary.file1, file.out = "trial_data1", dir.out = ".", format = "haplin", n.vars = 0 ) exemplary.file3 <- paste0( dir.exmpl, "/exmpl_data.ped" ) my.gen.data <- genDataRead( exemplary.file3, file.out = "ped_data", dir.out = ".", format = "ped" ) ``` The function reads in all the data in the file, creates `ff` objects to store the genetic information and `data.frame` to store covariate data (if any). These objects are saved in `.RData` and `.ffData` files, which can be later on easily uploaded to R (with `genDataLoad`) and re-used. _CAUTION:_ This can take a long time for large datasets (such as from GWAS analysis, e.g., reading in a 7 GB file will take ca.15 minutes), however, this needs to be run only once and then, the next time you need to use the data, use the `genDataLoad` function (see section [Re-using the data](#reuse), below). Be careful NOT TO DELETE the output files .ffData and .RData! The `genDataRead` function returns a list object with three elements: * _cov.data_ - a `data.frame` with covariate data (if available in the input file); * _gen.data_ - the genetic part of data; * _aux_ - additional meta-information, used internal. To see all the available arguments and usage examples, type: ```{r eval=FALSE} ?genDataRead example( genDataRead ) ``` (this works also with any other function) ## Reading covariate data The PED-formatted data has by default 6 first columns reserved for family information and covariates. However, if the user has covariate data that are in a separate file, this can be read together with the genotype data, using the same function `genDataRead`. The main file can be either in haplin or PED format. ```{r add_cov_data_read} add.cov.file <- paste0( dir.exmpl, "/add_cov_data2.dat" ) my.gen.data.haplin3 <- genDataRead( file.in = exemplary.file1, file.out = "trial_data3", dir.out = ".", format = "haplin", n.vars = 0, cov.file.in = add.cov.file ) my.gen.data.haplin3 add.cov.file2 <- paste0( dir.exmpl, "/add_cov_data.dat" ) my.gen.data2 <- genDataRead( exemplary.file3, file.out = "ped_data2", dir.out = ".", format = "ped", cov.file.in = add.cov.file2 ) my.gen.data2 ``` _NOTE:_ The file with the additional information should have a header with names of the data columns! # Accessing the information in the loaded data The object created by `genDataRead` includes a lot of information. We have created functions that will help the user to navigate it. ## Displaying and extracting phenotype information First of all, when you type in the name of the object, a short summary will be displayed: ```{r show_summary} my.gen.data ``` If you want to show and/or extract part of phenotype information, you can use the `showPheno` function: ```{r showPheno_demo} # by default - showing first 5 entries: showPheno( my.gen.data ) # getting all the info: head( showPheno( my.gen.data, n = "all" ), n = 20 ) showPheno( my.gen.data, from = 4, to = 15 ) # show information about females only: head( showPheno( my.gen.data, sex = 2, n = "all" ), n = 20 ) ``` The output can be saved to an object: ```{r showPheno_demo2} females.pheno <- showPheno( my.gen.data, sex = 2 ) head( females.pheno ) ``` With the functions `nindiv` and `nfam`, you can get the number of individuals or number of families in your data: ```{r nindiv_demo} nindiv( my.gen.data ) nfam( my.gen.data ) ``` Note that the `nfam` function assumes that your dataset is from a triad or dyad study, i.e., includes information on the child and at least one parent. ## Displaying and extracting genotype information The function `nsnps` will tell us how many markers/SNPs there is in the data. Be careful since per default it assumes that the data is triad data (i.e., mother, father, and child were genotyped), so if your data is from a case-control study, be sure to specify that with argument `design = "cc"`. ```{r nsnps} nsnps( my.gen.data ) ``` To get the SNP names use: ```{r showSNPnames} showSNPnames( my.gen.data ) # by default - showing only first 5 SNPs showSNPnames( my.gen.data, from = 12, to = 31 ) ``` To extract and/or show genotypes for specific individuals or markers, use the `showGen` function: ```{r showGen_demo} showGen( my.gen.data, markers = c( 10,15,121 ) ) # by default - showing first 5 entries showGen( my.gen.data, from = 31, to = 231 ) ``` As above, this output can be saved to an object: ```{r showGen_demo2} subset.genes <- showGen( my.gen.data, from = 31, to = 231, markers = c( 10,15,121 ) ) subset.genes ``` _CAUTION:_ Note that these functions work only with objects resulting from `genDataRead`, and not `genDataPreprocess`, as the preprocessing disturbs the coding of the data, thus making the output not easy to understand. # Preparing your data After loading the data, it is necessary to pre-process it to the internal format used by Haplin. This is done by evoking the command: ```{r eval=FALSE} my.prepared.gen.data <- genDataPreprocess( data.in = my.gen.data, map.file = "my_gen_data.map", design = "triad", file.out = "my_prepared_gen_data", dir.out = "." ) ``` _CAUTION:_ This action can be very time-consuming for large datasets (e.g., estimated time for ca.45,000 SNPs and 1,600 individuals, a PED file of ca.700MB, is ca.6 minutes on a 8-core CPU). However, this needs to be done only once and the output, stored in small files, can be used for the subsequent analysis repeatedly. (See also section [Choosing a subset of data](#subset) ) This will also create `.RData` and `.ffData` files, which take much less space than the input PED files. Be careful not to delete these files, as they can be re-used by simply loading into R (the `genDataLoad` function) right before Haplin analysis. _NOTE:_ The information about the `my.prepared.gen.data` object can be displayed by simply writing the name of the object. The output of the `genDataPreprocess` function (or `genDataLoad`) can then be used to run the analysis. # Choosing a subset of data {#subset} If you know that you want to focus your analysis on a certain region of the entire SNP set, or perhaps you're impatient and want to check out Haplin without waiting a long time for the preprocessing to finish, you can easily choose a subset of the data to pre-process and analyze. This can be done with the command: ```{r eval=FALSE} gen.data.subset <- genDataGetPart( data.in = my.gen.data, markers = c( 3:15,22 ), design = "triad", file.out = "my_gen_data_subset", dir.out = "." ) ``` This function allows you to specify the subset in various ways: * __markers__ - numeric vector with numbers indicating which markers to choose; * __indiv.ids__ - numeric vector giving IDs of individuals. _CAUTION:_ in a standard PED file, individual IDs are not unique, so this will select all individuals with given numbers; * __rows__ - numeric vector giving the positions - this will select only these rows; * __cc__ - one or more values to choose based on the case-control status ('cc' column); * __sex__ - one or more values to choose based on the 'sex' column; * __...__ - if any additional covariate data are available in `data.in, the user can choose based on values of these. If you give a combination of these parameters, the result will be the intersection of the subsets defined by each of the parameters alone. The subset is then available in the `gen.data.subset` object and written to `file.out` file(s). These can be loaded and re-used multiple times. ## Useful shortcut functions To make it easier to choose certain subsets of data, we created the following functions: * `getChildren` - extracts only the children's data from the triad-design data; * `getMothers` and `getFathers` - extracts only the fathers or mothers, respectively; * `getFullTriads` - extracts only the full families, if the input data contains also dyads. # Re-using the data {#reuse} _IMPORTANT:_ Remember that each time you start a new R session, you need to load the data into the memory with the command: ```{r eval=FALSE} my.prepared.gen.data <- genDataLoad( filename = "my_prepared_gen_data", dir.in = "." ) ``` This takes much less time than re-reading and running the data preparations!