This vignette explains how to use functions for “melting” wide data tables, i.e. converting to tall/long data tables. To clarify the discussion we introduce the following three terms:
measure.vars
in reshape2/data.table.id.vars
in reshape2/data.table.The nc
functions use data.table::melt
internally:
nc::capture_melt_single
is useful when you have a wide data table with lots of columns, all of the same type, that you would like to melt/reshape into a single result/output column. This uses melt
with a character vector for measure.vars
.nc::capture_melt_multiple
is useful when you have a wide data table with columns of possibly different types that you would like to melt/reshape into multiple result/output columns. This uses melt
with a list for measure.vars
.Both are useful when you want to use a regular expression to specify both (1) the set of input columns to reshape and (2) some information to extract from those column names.
Sometimes you want to melt a “wide” data table which has several distinct pieces of information encoded in each column name. One example is the familiar iris data, which have flower part and measurement dimension encoded in each of four column names:
head(iris)
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> 1 5.1 3.5 1.4 0.2 setosa
#> 2 4.9 3.0 1.4 0.2 setosa
#> 3 4.7 3.2 1.3 0.2 setosa
#> 4 4.6 3.1 1.5 0.2 setosa
#> 5 5.0 3.6 1.4 0.2 setosa
#> 6 5.4 3.9 1.7 0.4 setosa
Those four reshape column names can be specified via a regex in nc::capture_melt_single
. The first argument is the input data table to reshape, and the subsequent arguments are interpreted as a pattern which is passed to nc::capture_first_vec
. Any input column names which match the specified regex will be passed as measure.vars
to melt
:
nc::capture_melt_single(
(iris.tall <-
iris,part=".*",
"[.]",
dim=".*",
value.name="cm"))
#> Species part dim cm
#> <fctr> <char> <char> <num>
#> 1: setosa Sepal Length 5.1
#> 2: setosa Sepal Length 4.9
#> 3: setosa Sepal Length 4.7
#> 4: setosa Sepal Length 4.6
#> 5: setosa Sepal Length 5.0
#> ---
#> 596: virginica Petal Width 2.3
#> 597: virginica Petal Width 1.9
#> 598: virginica Petal Width 2.0
#> 599: virginica Petal Width 2.3
#> 600: virginica Petal Width 1.8
Note the output above has one copy column (Species), two capture columns (part, dim), and one reshape column (cm). Internally the function joins the result of nc::capture_first_vec
(on column names) to the result of melt
(on the data).
The reshaped data can be plotted with different parts on rows and different dimensions on columns:
if(require(ggplot2)){
ggplot()+
theme_bw()+
theme(panel.spacing=grid::unit(0, "lines"))+
facet_grid(part ~ dim)+
geom_bar(aes(cm, fill=Species), data=iris.tall)
}#> Le chargement a nécessité le package : ggplot2
We could instead use capture_melt_multiple
to get multiple output columns. Like capture_melt_single
, the first argument of capture_melt_multiple
is the subject data table and the following arguments form a pattern which is matched to the input data column names. However the pattern must have at least two groups:
column
. This group is used to capture the part of the input column names which will be used for the output column names. The output will contain one column for each unique value captured in this group. nc::capture_melt_multiple(
(iris.part.cols <-
iris,column=".*",
"[.]",
dim=".*"))
#> Species dim Petal Sepal
#> <fctr> <char> <num> <num>
#> 1: setosa Length 1.4 5.1
#> 2: setosa Length 1.4 4.9
#> 3: setosa Length 1.3 4.7
#> 4: setosa Length 1.5 4.6
#> 5: setosa Length 1.4 5.0
#> ---
#> 296: virginica Width 2.3 3.0
#> 297: virginica Width 1.9 2.5
#> 298: virginica Width 2.0 3.0
#> 299: virginica Width 2.3 3.4
#> 300: virginica Width 1.8 3.0
Note that the reshaped table above contains one copy column (Species), one capture column (dim), and two reshape columns (Petal, Sepal). We can plot these data to see whether or not sepals are bigger than petals:
if(require(ggplot2)){
ggplot()+
theme_bw()+
theme(panel.spacing=grid::unit(0, "lines"))+
facet_grid(dim ~ Species)+
coord_equal()+
geom_abline(slope=1, intercept=0, color="grey")+
geom_point(aes(
Petal, Sepal),data=iris.part.cols)
}
It is clear from the plot above that sepals are indeed both longer and wider than petals, on each measured plant.
Another data set where it is useful to do column name pattern matching followed by melting is the World Health Organization data:
if(requireNamespace("tidyr")){
data(who, package="tidyr")
else{
} data.frame(id=1, new_sp_m5564=2, newrel_f65=3)
who <-
}#> Le chargement a nécessité le package : tidyr
names(who)
#> [1] "country" "iso2" "iso3" "year" "new_sp_m014"
#> [6] "new_sp_m1524" "new_sp_m2534" "new_sp_m3544" "new_sp_m4554" "new_sp_m5564"
#> [11] "new_sp_m65" "new_sp_f014" "new_sp_f1524" "new_sp_f2534" "new_sp_f3544"
#> [16] "new_sp_f4554" "new_sp_f5564" "new_sp_f65" "new_sn_m014" "new_sn_m1524"
#> [21] "new_sn_m2534" "new_sn_m3544" "new_sn_m4554" "new_sn_m5564" "new_sn_m65"
#> [26] "new_sn_f014" "new_sn_f1524" "new_sn_f2534" "new_sn_f3544" "new_sn_f4554"
#> [31] "new_sn_f5564" "new_sn_f65" "new_ep_m014" "new_ep_m1524" "new_ep_m2534"
#> [36] "new_ep_m3544" "new_ep_m4554" "new_ep_m5564" "new_ep_m65" "new_ep_f014"
#> [41] "new_ep_f1524" "new_ep_f2534" "new_ep_f3544" "new_ep_f4554" "new_ep_f5564"
#> [46] "new_ep_f65" "newrel_m014" "newrel_m1524" "newrel_m2534" "newrel_m3544"
#> [51] "newrel_m4554" "newrel_m5564" "newrel_m65" "newrel_f014" "newrel_f1524"
#> [56] "newrel_f2534" "newrel_f3544" "newrel_f4554" "newrel_f5564" "newrel_f65"
Each column which starts with new
has three distinct pieces of information encoded in its name: diagnosis type (e.g. sp or rel), gender (m or f), and age range (e.g. 5564 or 1524). We would like to use a regex to match these column names, then using the matching columns as measure.vars in a melt, then join the two results.
list(
new.diag.gender <-"new_?",
diagnosis=".*",
"_",
gender=".")
::capture_melt_single(who, new.diag.gender, ages=".*")
nc#> country iso2 iso3 year diagnosis gender ages
#> <char> <char> <char> <num> <char> <char> <char>
#> 1: Afghanistan AF AFG 1997 sp m 014
#> 2: Afghanistan AF AFG 1998 sp m 014
#> 3: Afghanistan AF AFG 1999 sp m 014
#> 4: Afghanistan AF AFG 2000 sp m 014
#> 5: Afghanistan AF AFG 2001 sp m 014
#> ---
#> 76042: Viet Nam VN VNM 2013 rel f 65
#> 76043: Wallis and Futuna Islands WF WLF 2013 rel f 65
#> 76044: Yemen YE YEM 2013 rel f 65
#> 76045: Zambia ZM ZMB 2013 rel f 65
#> 76046: Zimbabwe ZW ZWE 2013 rel f 65
#> value
#> <num>
#> 1: 0
#> 2: 30
#> 3: 8
#> 4: 52
#> 5: 129
#> ---
#> 76042: 3110
#> 76043: 2
#> 76044: 360
#> 76045: 669
#> 76046: 725
Note the output includes the new reshape column called value
by default, as in melt
. The input reshape column names which matched the specified pattern, and there is a new column for each group in that pattern. The following example shows how to rename the value
column and use numeric type conversion functions:
list(new.diag.gender, ages=list(
years.pattern <-min.years="0|[0-9]{2}", as.numeric,
max.years="[0-9]{0,2}", function(x)ifelse(x=="", Inf, as.numeric(x))))
nc::capture_melt_single(
(who.typed <-
who, years.pattern,value.name="count"))
#> country iso2 iso3 year diagnosis gender ages
#> <char> <char> <char> <num> <char> <char> <char>
#> 1: Afghanistan AF AFG 1997 sp m 014
#> 2: Afghanistan AF AFG 1998 sp m 014
#> 3: Afghanistan AF AFG 1999 sp m 014
#> 4: Afghanistan AF AFG 2000 sp m 014
#> 5: Afghanistan AF AFG 2001 sp m 014
#> ---
#> 76042: Viet Nam VN VNM 2013 rel f 65
#> 76043: Wallis and Futuna Islands WF WLF 2013 rel f 65
#> 76044: Yemen YE YEM 2013 rel f 65
#> 76045: Zambia ZM ZMB 2013 rel f 65
#> 76046: Zimbabwe ZW ZWE 2013 rel f 65
#> min.years max.years count
#> <num> <num> <num>
#> 1: 0 14 0
#> 2: 0 14 30
#> 3: 0 14 8
#> 4: 0 14 52
#> 5: 0 14 129
#> ---
#> 76042: 65 Inf 3110
#> 76043: 65 Inf 2
#> 76044: 65 Inf 360
#> 76045: 65 Inf 669
#> 76046: 65 Inf 725
str(who.typed)
#> Classes 'data.table' and 'data.frame': 76046 obs. of 10 variables:
#> $ country : chr "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" ...
#> $ iso2 : chr "AF" "AF" "AF" "AF" ...
#> $ iso3 : chr "AFG" "AFG" "AFG" "AFG" ...
#> $ year : num 1997 1998 1999 2000 2001 ...
#> $ diagnosis: chr "sp" "sp" "sp" "sp" ...
#> $ gender : chr "m" "m" "m" "m" ...
#> $ ages : chr "014" "014" "014" "014" ...
#> $ min.years: num 0 0 0 0 0 0 0 0 0 0 ...
#> $ max.years: num 14 14 14 14 14 14 14 14 14 14 ...
#> $ count : num 0 30 8 52 129 90 127 139 151 193 ...
#> - attr(*, ".internal.selfref")=<externalptr>
Note in the code/result above that non-character captured output columns can be obtained by specifying type conversion functions in the pattern.
Another example is exchange rate data from Eurostat. We first use read the entire data set into R:
system.file(
ert.gz <-"extdata", "ert_eff_ic_m.tsv.gz", package="nc", mustWork=TRUE)
data.table::fread(ert.gz, na.strings=":")
ert.all <-1:5, 1:5]
ert.all[#> exch_rt,unit,geo\\time 2020M07 2020M06 2020M05 2020M04
#> <char> <num> <num> <num> <num>
#> 1: NEER_EA19,I10,AT 100.00 100.00 100.0 100.00
#> 2: NEER_EA19,I10,AU 88.50 88.39 86.3 83.46
#> 3: NEER_EA19,I10,BE 100.00 100.00 100.0 100.00
#> 4: NEER_EA19,I10,BG 100.00 100.00 100.0 100.00
#> 5: NEER_EA19,I10,BR 38.49 39.94 38.0 40.45
We see that the first column has some CSV data which we can parse via:
ert.all[, 1]
ert.first <- c(sub("\\\\.*", "", names(ert.first)), ert.first[[1]])
csv.lines <- data.table::fread(text=paste(csv.lines, collapse="\n"))
ert.first.dt <- data.table::data.table(ert.first.dt, ert.all[,-1])
ert.wide <-1:5, 1:5]
ert.wide[#> exch_rt unit geo 2020M07 2020M06
#> <char> <char> <char> <num> <num>
#> 1: NEER_EA19 I10 AT 100.00 100.00
#> 2: NEER_EA19 I10 AU 88.50 88.39
#> 3: NEER_EA19 I10 BE 100.00 100.00
#> 4: NEER_EA19 I10 BG 100.00 100.00
#> 5: NEER_EA19 I10 BR 38.49 39.94
The wide data table can then be melted:
nc::capture_melt_single(
(ert.tall <-
ert.wide,year="[0-9]{4}", as.integer,
"M",
month="[0-9]{2}", as.integer))
#> exch_rt unit geo year month value
#> <char> <char> <char> <int> <int> <num>
#> 1: NEER_EA19 I10 AT 2020 7 100.00
#> 2: NEER_EA19 I10 AU 2020 7 88.50
#> 3: NEER_EA19 I10 BE 2020 7 100.00
#> 4: NEER_EA19 I10 BG 2020 7 100.00
#> 5: NEER_EA19 I10 BR 2020 7 38.49
#> ---
#> 2229: NEER_EA19 I10 BE 1994 1 99.07
#> 2230: NEER_EA19 I10 BG 1994 1 4808.49
#> 2231: NEER_EA19 I10 BR 1994 1 1225.11
#> 2232: NEER_EA19 I10 CA 1994 1 92.19
#> 2233: NEER_EA19 I10 CH 1994 1 83.43
After that we can create a time variable and plot via
:= data.table::as.IDate(
ert.tall[, month.IDate sprintf("%d-%d-15", year, month))]
if(require("ggplot2")){
ggplot()+
geom_hline(aes(
yintercept=value),
color="grey",
data=data.frame(value=100))+
geom_line(aes(
color=geo),
month.IDate, value, data=ert.tall[geo %in% c("CA", "US", "JP", "FR")])+
facet_grid(exch_rt ~ .)+
theme_bw()+
theme(panel.spacing=grid::unit(0, "lines"))
}
Another way to do it would be via
::capture_melt_single(ert.wide, month.POSIXct="[0-9].*", function(x){
ncas.POSIXct(strptime(paste0(x,"15"), "%YM%m%d"))
})#> exch_rt unit geo month.POSIXct value
#> <char> <char> <char> <num> <num>
#> 1: NEER_EA19 I10 AT 1594764000 100.00
#> 2: NEER_EA19 I10 AU 1594764000 88.50
#> 3: NEER_EA19 I10 BE 1594764000 100.00
#> 4: NEER_EA19 I10 BG 1594764000 100.00
#> 5: NEER_EA19 I10 BR 1594764000 38.49
#> ---
#> 2229: NEER_EA19 I10 BE 758588400 99.07
#> 2230: NEER_EA19 I10 BG 758588400 4808.49
#> 2231: NEER_EA19 I10 BR 758588400 1225.11
#> 2232: NEER_EA19 I10 CA 758588400 92.19
#> 2233: NEER_EA19 I10 CH 758588400 83.43
What if the input data set has “missing” input columns?
iris[, names(iris) != "Sepal.Length"]
iris.missing <-head(iris.missing)
#> Sepal.Width Petal.Length Petal.Width Species
#> 1 3.5 1.4 0.2 setosa
#> 2 3.0 1.4 0.2 setosa
#> 3 3.2 1.3 0.2 setosa
#> 4 3.1 1.5 0.2 setosa
#> 5 3.6 1.4 0.2 setosa
#> 6 3.9 1.7 0.4 setosa
In that case melting into multiple columns is an error by default:
list(column=".*", "[.]", dim=".*")
iris.pattern <-::capture_melt_multiple(iris.missing, iris.pattern)
nc#> Error in (function (subject.names, match.dt, no.match, fill = TRUE) : need dim=same count for each value, but have: Length=1 Width=2; please change pattern, edit input column names, or use fill=TRUE to output missing values
The error message explains that the number of input columns for each value of dim
must be the same, but there is one for Length
and two for Width
. To ignore the error and fill the output with missing values,
::capture_melt_multiple(iris.missing, iris.pattern, fill=TRUE)
nc#> Species dim Petal Sepal
#> <fctr> <char> <num> <num>
#> 1: setosa Length 1.4 NA
#> 2: setosa Length 1.4 NA
#> 3: setosa Length 1.3 NA
#> 4: setosa Length 1.5 NA
#> 5: setosa Length 1.4 NA
#> ---
#> 296: virginica Width 2.3 3.0
#> 297: virginica Width 1.9 2.5
#> 298: virginica Width 2.0 3.0
#> 299: virginica Width 2.3 3.4
#> 300: virginica Width 1.8 3.0
Note the missing values in the table above, which correspond to the missing input column in the original/wide data set.
Some real-world data sets can be reshaped into output columns with different types. An example data set from the PROVEDIt benchmark in criminology:
system.file(
peaks.csv <-"extdata", "RD12-0002_PP16HS_5sec_GM_F_1P.csv",
package="nc", mustWork=TRUE)
data.table::fread(peaks.csv)
peaks.wide <-print(data.table::data.table(
names=names(peaks.wide),
class=sapply(peaks.wide, class)),
topn=10)
#> names class
#> <char> <char>
#> 1: Sample File character
#> 2: Marker character
#> 3: Dye character
#> 4: Allele 1 character
#> 5: Size 1 numeric
#> 6: Height 1 integer
#> 7: Allele 2 character
#> 8: Size 2 numeric
#> 9: Height 2 integer
#> 10: Allele 3 character
#> ---
#> 294: Height 97 logical
#> 295: Allele 98 logical
#> 296: Size 98 logical
#> 297: Height 98 logical
#> 298: Allele 99 logical
#> 299: Size 99 logical
#> 300: Height 99 logical
#> 301: Allele 100 logical
#> 302: Size 100 logical
#> 303: Height 100 logical
There are 303 columns, with info for 100 peaks. Each peak has three features: Allele=character, Size=numeric, and Height=integer. The ending peaks are class logical because they are all missing. These data can be reshaped via
nc::capture_melt_multiple(
peaks.tall <-
peaks.wide,column=".*",
" ",
peak="[0-9]+", as.integer,
na.rm=TRUE)
#> Warning in melt.data.table(L[["data"]], measure.vars = L[["measure.vars"]], :
#> les colonnes 'measure.vars' [Allele 1, Allele 2, Allele 3, Allele 4, ...] ne
#> sont pas toutes du même type. Par ordre hiérarchique, la colonne de valeurs des
#> données transformée sera de type 'character'. Toutes les variables de mesure
#> qui ne sont pas de type 'character' seront également converties
#> automatiquement. Consultez DETAILS dans ?melt.data.table pour en savoir plus
#> sur la conversion automatique.
#> Warning in melt.data.table(L[["data"]], measure.vars = L[["measure.vars"]], :
#> les colonnes 'measure.vars' [Height 1, Height 2, Height 3, Height 4, ...] ne
#> sont pas toutes du même type. Par ordre hiérarchique, la colonne de valeurs des
#> données transformée sera de type 'integer'. Toutes les variables de mesure qui
#> ne sont pas de type 'integer' seront également converties automatiquement.
#> Consultez DETAILS dans ?melt.data.table pour en savoir plus sur la conversion
#> automatique.
#> Warning in melt.data.table(L[["data"]], measure.vars = L[["measure.vars"]], :
#> les colonnes 'measure.vars' [Size 1, Size 2, Size 3, Size 4, ...] ne sont pas
#> toutes du même type. Par ordre hiérarchique, la colonne de valeurs des données
#> transformée sera de type 'double'. Toutes les variables de mesure qui ne sont
#> pas de type 'double' seront également converties automatiquement. Consultez
#> DETAILS dans ?melt.data.table pour en savoir plus sur la conversion
#> automatique.
options(width=90)
old.opt <-print(peaks.tall)
#> Sample File Marker Dye peak Allele Height Size
#> <char> <char> <char> <int> <char> <int> <num>
#> 1: A02-RD12-0002-35-0.5PP16-001.5sec.fsa D3S1358 B 1 OL 5 94.68
#> 2: A02-RD12-0002-35-0.5PP16-001.5sec.fsa TH01 B 1 OL 3 148.69
#> 3: A02-RD12-0002-35-0.5PP16-001.5sec.fsa D21S11 B 1 OL 7 195.84
#> 4: A02-RD12-0002-35-0.5PP16-001.5sec.fsa D18S51 B 1 OL 4 285.51
#> 5: A02-RD12-0002-35-0.5PP16-001.5sec.fsa Penta E B 1 OL 3 369.36
#> ---
#> 438: A02-RD12-0002-35-0.5PP16-001.5sec.fsa FGA Y 25 OL 5 413.28
#> 439: A02-RD12-0002-35-0.5PP16-001.5sec.fsa Penta E B 26 OL 3 479.21
#> 440: A02-RD12-0002-35-0.5PP16-001.5sec.fsa FGA Y 26 OL 4 415.55
#> 441: A02-RD12-0002-35-0.5PP16-001.5sec.fsa FGA Y 27 OL 4 428.06
#> 442: A02-RD12-0002-35-0.5PP16-001.5sec.fsa FGA Y 28 OL 4 445.08
str(peaks.tall)
#> Classes 'data.table' and 'data.frame': 442 obs. of 7 variables:
#> $ Sample File: chr "A02-RD12-0002-35-0.5PP16-001.5sec.fsa" "A02-RD12-0002-35-0.5PP16-001.5sec.fsa" "A02-RD12-0002-35-0.5PP16-001.5sec.fsa" "A02-RD12-0002-35-0.5PP16-001.5sec.fsa" ...
#> $ Marker : chr "D3S1358" "TH01" "D21S11" "D18S51" ...
#> $ Dye : chr "B" "B" "B" "B" ...
#> $ peak : int 1 1 1 1 1 1 1 1 1 1 ...
#> $ Allele : chr "OL" "OL" "OL" "OL" ...
#> $ Height : int 5 3 7 4 3 4 3 197 45 4 ...
#> $ Size : num 94.7 148.7 195.8 285.5 369.4 ...
#> - attr(*, ".internal.selfref")=<externalptr>
options(old.opt)