Here are some key data-manipulation steps on a data-frame which is how we typically organize our data in R. After having read the data into R it will typically be a data-frame, if not we can force it to be a data-frame. The basic idea of the utility functions is to get a simple and easy to type way of making simple data-manipulation on a data-frame much like what is possible in SAS or STATA.
The functions, say, dcut, dfactor and so on are all functions that basically does what the base R cut, factor do, but are easier to use in the context of data-frames and have additional functionality.
Here we work on the melanoma data that is already read into R and is a data-frame.
The structure for all functions is
to use the function on y in a dataframe grouped by x if condition ifcond is valid. The basic functions are
Data processing * dsort * dreshape * dcut * drm, drename, ddrop, dkeep, dsubset * drelevel * dlag * dfactor, dnumeric
Data aggregation * dby, dby2 * dscalar, deval, daggregate * dmean, dsd, dsum, dquantile, dcor * dtable, dcount
Data summaries * dhead, dtail, * dsummary, * dprint, dlist, dlevels, dunique
A generic function daggregate, daggr, can be called with a function as the argument
without the grouping variable (x)
A useful feature is that y and x as well as the subset condition can be specified using regular-expressions or by wildcards (default). Here to illustrate this, we compute the means of certain variables.
First just oveall
now only when days>500
and now after sex but only when days>500
dmean(melanoma,thick+I(log(thick))~sex|I(days>500))
#> sex thick I(log(thick))
#> 1 0 242.9580 5.060086
#> 2 1 320.2429 5.353321
and finally after quartiles of days (via the dcut function)
dmean(melanoma,thick+I(log(thick))~I(dcut(days)))
#> I(dcut(days)) thick I(log(thick))
#> 1 [10,1.52e+03] 482.1731 5.799525
#> 2 (1.52e+03,2.00e+03] 208.5490 4.987652
#> 3 (2.00e+03,3.04e+03] 223.2941 4.974759
#> 4 (3.04e+03,5.56e+03] 250.1961 5.120129
or summary of all variables starting with “s” and that contains “a”
melanoma=drename(melanoma,tykkelse~thick)
names(melanoma)
#> [1] "no" "status" "days" "ulc" "tykkelse" "sex"
Deleting variables
or sas style
data(melanoma)
melanoma=ddrop(melanoma,~thick+sex)
names(melanoma)
#> [1] "no" "status" "days" "ulc"
alternatively we can also keep certain variables
data(melanoma)
melanoma=dkeep(melanoma,~thick+sex+status+days)
names(melanoma)
#> [1] "thick" "sex" "status" "days"
This can also be done with direct asignment
The dkeep function can also be used to re-ordering the variables in the data-frame
data(melanoma)
dstr(melanoma)
#> 'data.frame': 205 obs. of 6 variables:
#> $ no : int 789 13 97 16 21 469 685 7 932 944 ...
#> $ status: int 3 3 2 3 1 1 1 1 3 1 ...
#> $ days : int 10 30 35 99 185 204 210 232 232 279 ...
#> $ ulc : int 1 0 0 0 1 1 1 1 1 1 ...
#> $ thick : int 676 65 134 290 1208 484 516 1288 322 741 ...
#> $ sex : int 1 1 1 0 1 1 1 1 0 0 ...
#> Warning in structure(res, ngroupvar = 0, class = c("daggregate", class(res))): Calling 'structure(NULL, *)' is deprecated, as NULL cannot have attributes.
#> Consider 'structure(list(), *)' instead.
The data can in Rstudio be seen as a data-table but to list certain parts of the data in output window
dlist(melanoma)
#> no status days ulc thick sex
#> 1 789 3 10 1 676 1
#> 2 13 3 30 0 65 1
#> 3 97 2 35 0 134 1
#> 4 16 3 99 0 290 0
#> 5 21 1 185 1 1208 1
#> ---
#> 201 317 2 4492 1 706 1
#> 202 798 2 4668 0 612 0
#> 203 806 2 4688 0 48 0
#> 204 606 2 4926 0 226 0
#> 205 328 2 5565 0 290 0
dlist(melanoma, ~.|sex==1)
#> no status days ulc thick
#> 1 789 3 10 1 676
#> 2 13 3 30 0 65
#> 3 97 2 35 0 134
#> 5 21 1 185 1 1208
#> 6 469 1 204 1 484
#> ---
#> 191 445 2 3909 1 806
#> 195 415 2 4119 0 65
#> 197 175 2 4207 0 65
#> 198 493 2 4310 0 210
#> 201 317 2 4492 1 706
dlist(melanoma, ~ulc+days+thick+sex|sex==1)
#> ulc days thick sex
#> 1 1 10 676 1
#> 2 0 30 65 1
#> 3 0 35 134 1
#> 5 1 185 1208 1
#> 6 1 204 484 1
#> ---
#> 191 1 3909 806 1
#> 195 0 4119 65 1
#> 197 0 4207 65 1
#> 198 0 4310 210 1
#> 201 1 4492 706 1
Getting summaries
dsummary(melanoma)
#> no status days ulc thick
#> Min. : 2.0 Min. :1.00 Min. : 10 Min. :0.000 Min. : 10
#> 1st Qu.:222.0 1st Qu.:1.00 1st Qu.:1525 1st Qu.:0.000 1st Qu.: 97
#> Median :469.0 Median :2.00 Median :2005 Median :0.000 Median : 194
#> Mean :463.9 Mean :1.79 Mean :2153 Mean :0.439 Mean : 292
#> 3rd Qu.:731.0 3rd Qu.:2.00 3rd Qu.:3042 3rd Qu.:1.000 3rd Qu.: 356
#> Max. :992.0 Max. :3.00 Max. :5565 Max. :1.000 Max. :1742
#> sex
#> Min. :0.0000
#> 1st Qu.:0.0000
#> Median :0.0000
#> Mean :0.3854
#> 3rd Qu.:1.0000
#> Max. :1.0000
or for specfic variables
dsummary(melanoma,~thick+status+sex)
#> thick status sex
#> Min. : 10 Min. :1.00 Min. :0.0000
#> 1st Qu.: 97 1st Qu.:1.00 1st Qu.:0.0000
#> Median : 194 Median :2.00 Median :0.0000
#> Mean : 292 Mean :1.79 Mean :0.3854
#> 3rd Qu.: 356 3rd Qu.:2.00 3rd Qu.:1.0000
#> Max. :1742 Max. :3.00 Max. :1.0000
Summaries in different groups (sex)
dsummary(melanoma,thick+days+status~sex)
#> sex: 0
#> thick days status
#> Min. : 10.0 Min. : 99 Min. :1.000
#> 1st Qu.: 97.0 1st Qu.:1636 1st Qu.:2.000
#> Median : 162.0 Median :2059 Median :2.000
#> Mean : 248.6 Mean :2283 Mean :1.833
#> 3rd Qu.: 306.0 3rd Qu.:3131 3rd Qu.:2.000
#> Max. :1742.0 Max. :5565 Max. :3.000
#> ------------------------------------------------------------
#> sex: 1
#> thick days status
#> Min. : 16.0 Min. : 10 Min. :1.000
#> 1st Qu.: 105.0 1st Qu.:1052 1st Qu.:1.000
#> Median : 258.0 Median :1860 Median :2.000
#> Mean : 361.1 Mean :1946 Mean :1.722
#> 3rd Qu.: 484.0 3rd Qu.:2784 3rd Qu.:2.000
#> Max. :1466.0 Max. :4492 Max. :3.000
and only among those with thin-tumours or only females (sex==1)
dsummary(melanoma,thick+days+status~sex|thick<97)
#> sex: 0
#> thick days status
#> Min. :10.00 Min. : 355 Min. :1.000
#> 1st Qu.:32.00 1st Qu.:1762 1st Qu.:2.000
#> Median :64.00 Median :2227 Median :2.000
#> Mean :51.48 Mean :2425 Mean :2.034
#> 3rd Qu.:65.00 3rd Qu.:3185 3rd Qu.:2.000
#> Max. :81.00 Max. :4688 Max. :3.000
#> ------------------------------------------------------------
#> sex: 1
#> thick days status
#> Min. :16.00 Min. : 30 Min. :1.000
#> 1st Qu.:30.00 1st Qu.:1820 1st Qu.:2.000
#> Median :65.00 Median :2886 Median :2.000
#> Mean :55.75 Mean :2632 Mean :1.875
#> 3rd Qu.:81.00 3rd Qu.:3328 3rd Qu.:2.000
#> Max. :81.00 Max. :4207 Max. :3.000
dsummary(melanoma,thick+status~+1|sex==1)
#> thick status
#> Min. : 16.0 Min. :1.000
#> 1st Qu.: 105.0 1st Qu.:1.000
#> Median : 258.0 Median :2.000
#> Mean : 361.1 Mean :1.722
#> 3rd Qu.: 484.0 3rd Qu.:2.000
#> Max. :1466.0 Max. :3.000
or
dsummary(melanoma,~thick+status|sex==1)
#> thick status
#> Min. : 16.0 Min. :1.000
#> 1st Qu.: 105.0 1st Qu.:1.000
#> Median : 258.0 Median :2.000
#> Mean : 361.1 Mean :1.722
#> 3rd Qu.: 484.0 3rd Qu.:2.000
#> Max. :1466.0 Max. :3.000
To make more complex conditions need to use the I()
dsummary(melanoma,thick+days+status~sex|I(thick<97 & sex==1))
#> sex: 1
#> thick days status
#> Min. :16.00 Min. : 30 Min. :1.000
#> 1st Qu.:30.00 1st Qu.:1820 1st Qu.:2.000
#> Median :65.00 Median :2886 Median :2.000
#> Mean :55.75 Mean :2632 Mean :1.875
#> 3rd Qu.:81.00 3rd Qu.:3328 3rd Qu.:2.000
#> Max. :81.00 Max. :4207 Max. :3.000
Tables between variables
All bivariate tables
dtable(melanoma,~status+sex+ulc,level=2)
#>
#> status
#> sex 1 2 3
#> 0 28 91 7
#> 1 29 43 7
#>
#> status
#> ulc 1 2 3
#> 0 16 92 7
#> 1 41 42 7
#>
#> sex
#> ulc 0 1
#> 0 79 36
#> 1 47 43
All univariate tables
dtable(melanoma,~status+sex+ulc,level=1)
#>
#> status
#> 1 2 3
#> 57 134 14
#>
#> sex
#> 0 1
#> 126 79
#>
#> ulc
#> 0 1
#> 115 90
and with new variables
dtable(melanoma,~status+sex+ulc+dcut(days)+I(days>300),level=1)
#>
#> status
#> 1 2 3
#> 57 134 14
#>
#> sex
#> 0 1
#> 126 79
#>
#> ulc
#> 0 1
#> 115 90
#>
#> dcut(days)
#> [10,1.52e+03] (1.52e+03,2.00e+03] (2.00e+03,3.04e+03] (3.04e+03,5.56e+03]
#> 52 51 51 51
#>
#> I(days > 300)
#> FALSE TRUE
#> 11 194
To sort the data
data(melanoma)
mel= dsort(melanoma,~days)
dsort(melanoma) <- ~days
head(mel)
#> no status days ulc thick sex
#> 1 789 3 10 1 676 1
#> 2 13 3 30 0 65 1
#> 3 97 2 35 0 134 1
#> 4 16 3 99 0 290 0
#> 5 21 1 185 1 1208 1
#> 6 469 1 204 1 484 1
and to sort after multiple variables increasing and decreasing
To define a bunch of new covariates within a data-frame
data(melanoma)
melanoma= transform(melanoma, thick2=thick^2, lthick=log(thick) )
dhead(melanoma)
#> no status days ulc thick sex thick2 lthick
#> 1 789 3 10 1 676 1 456976 6.516193
#> 2 13 3 30 0 65 1 4225 4.174387
#> 3 97 2 35 0 134 1 17956 4.897840
#> 4 16 3 99 0 290 0 84100 5.669881
#> 5 21 1 185 1 1208 1 1459264 7.096721
#> 6 469 1 204 1 484 1 234256 6.182085
When the above definitions are done using a condition this can be achieved using the dtransform function that extends transform with a possible condition
On the melanoma data the variable thick gives the thickness of the melanom tumour. For some analyses we would like to make a factor depending on the thickness. This can be done in several different ways
New variable is named thickcat.0 by default.
To see levels of factors in data-frame
dlevels(melanoma)
#> thickcat.0 #levels=:4
#> [1] "[0,200]" "(200,500]" "(500,800]" "(800,2e+03]"
#> -----------------------------------------
Checking group sizes
dtable(melanoma,~thickcat.0)
#>
#> thickcat.0
#> [0,200] (200,500] (500,800] (800,2e+03]
#> 109 64 20 12
With adding to the data-frame directly
dcut(melanoma,breaks=c(0,200,500,800,2000)) <- gr.thick1~thick
dlevels(melanoma)
#> thickcat.0 #levels=:4
#> [1] "[0,200]" "(200,500]" "(500,800]" "(800,2e+03]"
#> -----------------------------------------
#> gr.thick1 #levels=:4
#> [1] "[0,200]" "(200,500]" "(500,800]" "(800,2e+03]"
#> -----------------------------------------
new variable is named thickcat.0 (after first cut-point), or to get quartiles with default names thick.cat.4
dcut(melanoma) <- ~ thick # new variable is thickcat.4
dlevels(melanoma)
#> thickcat.0 #levels=:4
#> [1] "[0,200]" "(200,500]" "(500,800]" "(800,2e+03]"
#> -----------------------------------------
#> gr.thick1 #levels=:4
#> [1] "[0,200]" "(200,500]" "(500,800]" "(800,2e+03]"
#> -----------------------------------------
#> thickcat.4 #levels=:4
#> [1] "[10,97]" "(97,194]" "(194,356]" "(356,1.74e+03]"
#> -----------------------------------------
or median groups, here starting again with the original data,
data(melanoma)
dcut(melanoma,breaks=2) <- ~ thick # new variable is thick.2
dlevels(melanoma)
#> thickcat.2 #levels=:2
#> [1] "[10,194]" "(194,1.74e+03]"
#> -----------------------------------------
to control new names
data(melanoma)
mela= dcut(melanoma,thickcat4+dayscat4~thick+days,breaks=4)
dlevels(mela)
#> thickcat4 #levels=:4
#> [1] "[10,97]" "(97,194]" "(194,356]" "(356,1.74e+03]"
#> -----------------------------------------
#> dayscat4 #levels=:4
#> [1] "[10,1.52e+03]" "(1.52e+03,2.00e+03]" "(2.00e+03,3.04e+03]"
#> [4] "(3.04e+03,5.56e+03]"
#> -----------------------------------------
or
data(melanoma)
dcut(melanoma,breaks=4) <- thickcat4+dayscat4~thick+days
dlevels(melanoma)
#> thickcat4 #levels=:4
#> [1] "[10,97]" "(97,194]" "(194,356]" "(356,1.74e+03]"
#> -----------------------------------------
#> dayscat4 #levels=:4
#> [1] "[10,1.52e+03]" "(1.52e+03,2.00e+03]" "(2.00e+03,3.04e+03]"
#> [4] "(3.04e+03,5.56e+03]"
#> -----------------------------------------
This can also be typed out more specifically
To see levels of covariates in data-frame
data(melanoma)
dcut(melanoma,breaks=4) <- thickcat4~thick
dlevels(melanoma)
#> thickcat4 #levels=:4
#> [1] "[10,97]" "(97,194]" "(194,356]" "(356,1.74e+03]"
#> -----------------------------------------
To relevel the factor
dtable(melanoma,~thickcat4)
#>
#> thickcat4
#> [10,97] (97,194] (194,356] (356,1.74e+03]
#> 56 53 45 51
melanoma = drelevel(melanoma,~thickcat4,ref="(194,356]")
dlevels(melanoma)
#> thickcat4 #levels=:4
#> [1] "[10,97]" "(97,194]" "(194,356]" "(356,1.74e+03]"
#> -----------------------------------------
#> thickcat4.(194,356] #levels=:4
#> [1] "(194,356]" "[10,97]" "(97,194]" "(356,1.74e+03]"
#> -----------------------------------------
or to take the third level in the list of levels, same as above,
melanoma = drelevel(melanoma,~thickcat4,ref=2)
dlevels(melanoma)
#> thickcat4 #levels=:4
#> [1] "[10,97]" "(97,194]" "(194,356]" "(356,1.74e+03]"
#> -----------------------------------------
#> thickcat4.(194,356] #levels=:4
#> [1] "(194,356]" "[10,97]" "(97,194]" "(356,1.74e+03]"
#> -----------------------------------------
#> thickcat4.2 #levels=:4
#> [1] "(97,194]" "[10,97]" "(194,356]" "(356,1.74e+03]"
#> -----------------------------------------
To combine levels of a factor (first combinining first 3 groups into one)
melanoma = drelevel(melanoma,~thickcat4,newlevels=1:3)
dlevels(melanoma)
#> thickcat4 #levels=:4
#> [1] "[10,97]" "(97,194]" "(194,356]" "(356,1.74e+03]"
#> -----------------------------------------
#> thickcat4.(194,356] #levels=:4
#> [1] "(194,356]" "[10,97]" "(97,194]" "(356,1.74e+03]"
#> -----------------------------------------
#> thickcat4.2 #levels=:4
#> [1] "(97,194]" "[10,97]" "(194,356]" "(356,1.74e+03]"
#> -----------------------------------------
#> thickcat4.1:3 #levels=:2
#> [1] "[10,97]-(194,356]" "(356,1.74e+03]"
#> -----------------------------------------
or to combine groups 1 and 2 into one group and 3 and 4 into another
dkeep(melanoma) <- ~thick+thickcat4
melanoma = drelevel(melanoma,gthick2~thickcat4,newlevels=list(1:2,3:4))
dlevels(melanoma)
#> thickcat4 #levels=:4
#> [1] "[10,97]" "(97,194]" "(194,356]" "(356,1.74e+03]"
#> -----------------------------------------
#> gthick2 #levels=:2
#> [1] "[10,97]-(97,194]" "(194,356]-(356,1.74e+03]"
#> -----------------------------------------
Changing order of factor levels
dfactor(melanoma,levels=c(3,1,2,4)) <- thickcat4.2~thickcat4
dlevel(melanoma,~ "thickcat4*")
#> thickcat4 #levels=:4
#> [1] "[10,97]" "(97,194]" "(194,356]" "(356,1.74e+03]"
#> -----------------------------------------
#> thickcat4.2 #levels=:4
#> [1] "(194,356]" "[10,97]" "(97,194]" "(356,1.74e+03]"
#> -----------------------------------------
dtable(melanoma,~thickcat4+thickcat4.2)
#>
#> thickcat4.2 (194,356] [10,97] (97,194] (356,1.74e+03]
#> thickcat4
#> [10,97] 0 56 0 0
#> (97,194] 0 0 53 0
#> (194,356] 45 0 0 0
#> (356,1.74e+03] 0 0 0 51
Combine levels but now control factor-level names
melanoma=drelevel(melanoma,gthick3~thickcat4,newlevels=list(group1.2=1:2,group3.4=3:4))
dlevels(melanoma)
#> thickcat4 #levels=:4
#> [1] "[10,97]" "(97,194]" "(194,356]" "(356,1.74e+03]"
#> -----------------------------------------
#> gthick2 #levels=:2
#> [1] "[10,97]-(97,194]" "(194,356]-(356,1.74e+03]"
#> -----------------------------------------
#> thickcat4.2 #levels=:4
#> [1] "(194,356]" "[10,97]" "(97,194]" "(356,1.74e+03]"
#> -----------------------------------------
#> gthick3 #levels=:2
#> [1] "group1.2" "group3.4"
#> -----------------------------------------
A numeric variable “status” with values 1,2,3 into a factor by
data(melanoma)
melanoma = dfactor(melanoma,~status, labels=c("malignant-melanoma","censoring","dead-other"))
melanoma = dfactor(melanoma,sexl~sex,labels=c("females","males"))
dtable(melanoma,~sexl+status.f)
#>
#> status.f malignant-melanoma censoring dead-other
#> sexl
#> females 28 91 7
#> males 29 43 7
A gender factor with values “M”, “F” can be converted into numerics by
melanoma = dnumeric(melanoma,~sexl)
dstr(melanoma,"sex*")
#> 'data.frame': 205 obs. of 3 variables:
#> $ sex : int 1 1 1 0 1 1 1 1 0 0 ...
#> $ sexl : Factor w/ 2 levels "females","males": 2 2 2 1 2 2 2 2 1 1 ...
#> $ sexl.n: num 2 2 2 1 2 2 2 2 1 1 ...
#> Warning in structure(res, ngroupvar = 0, class = c("daggregate", class(res))): Calling 'structure(NULL, *)' is deprecated, as NULL cannot have attributes.
#> Consider 'structure(list(), *)' instead.
dtable(melanoma,~'sex*',level=2)
#>
#> sex
#> sexl 0 1
#> females 126 0
#> males 0 79
#>
#> sex
#> sexl.n 0 1
#> 1 126 0
#> 2 0 79
#>
#> sexl
#> sexl.n females males
#> 1 126 0
#> 2 0 79
sessionInfo()
#> R version 4.4.2 (2024-10-31)
#> Platform: aarch64-apple-darwin24.2.0
#> Running under: macOS Sequoia 15.2
#>
#> Matrix products: default
#> BLAS: /Users/klaus/.asdf/installs/R/4.4.2/lib/R/lib/libRblas.dylib
#> LAPACK: /Users/klaus/.asdf/installs/R/4.4.2/lib/R/lib/libRlapack.dylib; LAPACK version 3.12.0
#>
#> locale:
#> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
#>
#> time zone: Europe/Copenhagen
#> tzcode source: internal
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] mets_1.3.5 timereg_2.0.6 survival_3.8-3
#>
#> loaded via a namespace (and not attached):
#> [1] cli_3.6.3 knitr_1.49 rlang_1.1.4
#> [4] xfun_0.50 jsonlite_1.8.9 listenv_0.9.1
#> [7] future.apply_1.11.3 lava_1.8.1 htmltools_0.5.8.1
#> [10] sass_0.4.9 rmarkdown_2.29 grid_4.4.2
#> [13] evaluate_1.0.1 jquerylib_0.1.4 fastmap_1.2.0
#> [16] mvtnorm_1.3-2 yaml_2.3.10 lifecycle_1.0.4
#> [19] numDeriv_2016.8-1.1 compiler_4.4.2 codetools_0.2-20
#> [22] Rcpp_1.0.13-1 future_1.34.0 lattice_0.22-6
#> [25] digest_0.6.37 R6_2.5.1 parallelly_1.41.0
#> [28] parallel_4.4.2 splines_4.4.2 bslib_0.8.0
#> [31] Matrix_1.7-1 tools_4.4.2 globals_0.16.3
#> [34] cachem_1.1.0