The codelist
package has an example code list and a data
set that used codes from that code list. We will start by demonstrating
how the package works using this example code list.
Let’s load the example code list:
> library(codelist)
> data(objectcodes)
> objectcodes
code label parent locale missing1 A Toys <NA> EN 0
2 B Tools <NA> EN 0
3 A01 Teddy Bear A EN 0
4 A02 Toy Car A EN 0
5 A03 Marbles A EN 0
6 B01 Hammer B EN 0
7 B02 Electric Drill B EN 0
8 A Speelgoed <NA> NL 0
9 B Gereedschap <NA> NL 0
10 A01 Teddybeer A NL 0
11 A02 Speelgoedauto A NL 0
12 A03 Knikkers A NL 0
13 B01 Hamer B NL 0
14 B02 Boormachine B NL 0
15 X Unknown object <NA> EN 1
16 X Onbekend type voorwerp <NA> NL 1
We see that the code list contains codes for encoding various types of objects. A code list contains at the minimum a ‘code’ and ‘label’ column. The ‘code’ column can be any type; the ‘label’ column should be a character column. With the ‘parent’ column it is possible to define simple hierarchies. This columns should contain codes from the ‘code’ column. A missing value indicates a top-level code. With the ‘locale’ column it is possible to have different versions of the ‘label’ and ‘description’ (here missing) columns. It can be used for different translations as here, but could also be used for different versions of the labels and descriptions. The ‘missing’ column indicates whether or not the code should be treated as a missing value. This column should be interpretable as a logical column.
We will also load and example data set using the codes we loaded above:
> data(objectsales)
> objectsales |> head()
product unitprice quantity totalprice1 B01 70.65 67 4733.55
2 B01 76.93 76 5846.68
3 B01 43.49 100 4349.00
4 A03 3.08 26 80.08
5 A01 18.51 89 1647.39
6 A03 3.35 71 237.85
This is a data set containing the prices and sales of various
products. The ‘product’ column uses codes from the
objectcodes
code list:
> objectsales$product |> head(10)
1] "B01" "B01" "B01" "A03" "A01" "A03" "A03" "B01" "A03" "A01" [
One of the things we can do is convert the codes to their corresponding labels:
> to_labels(objectsales$product, objectcodes) |> head(10)
1] Hammer Hammer Hammer Marbles Teddy Bear Marbles
[7] Marbles Hammer Marbles Teddy Bear
[: Toys Tools Teddy Bear Toy Car Marbles Hammer Electric Drill Levels
The to_labels
function accepts a vector with codes and a
codelist
for this vector. It can get a bit tiresome to keep
having to pass in the codelist
attribute. If it is missing,
the looks for a ‘codelist’ attribute:
> attr(objectsales$product, "codelist") <- objectcodes
> to_labels(objectsales$product) |> head(10)
1] Hammer Hammer Hammer Marbles Teddy Bear Marbles
[7] Marbles Hammer Marbles Teddy Bear
[: Toys Tools Teddy Bear Toy Car Marbles Hammer Electric Drill Levels
The codelist
package also has a code
type.
Converting to a code
object adds the code
class. This will result in some formatting and later on we will see that
this also ensures that we cannot assign invalid codes to the vector:
> objectsales$product <- code(objectsales$product, objectcodes)
> objectsales$product |> head(10)
1] B01 B01 B01 A03 A01 A03 A03 B01 A03 A01
[8 Codelist: A(=Toys) B(=Tools) A01(=Teddy Bear) A02(=Toy Car) ...X(=Unknown object)
> to_labels(objectsales$product) |> head(10)
1] Hammer Hammer Hammer Marbles Teddy Bear Marbles
[7] Marbles Hammer Marbles Teddy Bear
[: Toys Tools Teddy Bear Toy Car Marbles Hammer Electric Drill Levels
For code
objects there is also the labels
method:
labels(objectsales$product) |> head(10)
The labels
method and the to_labels
function can be used to get readable output from various
R-functions:
> table(labels(objectsales$product), useNA = "ifany")
Toys Tools Teddy Bear Toy Car Marbles 0 0 29 14 16
<NA>
Hammer Electric Drill 30 2 9
> tapply(objectsales$unitprice, labels(objectsales$product), mean)
Toys Tools Teddy Bear Toy Car Marbles NA NA 19.761034 12.432857 2.480625
Hammer Electric Drill 45.303000 205.350000
> lm(unitprice ~ 0+labels(product), data = objectsales)
:
Calllm(formula = unitprice ~ 0 + labels(product), data = objectsales)
:
Coefficientslabels(product)Teddy Bear labels(product)Toy Car
19.761 12.433
labels(product)Marbles labels(product)Hammer
2.481 45.303
labels(product)Electric Drill
205.350
By default codes that are considered missing are converted to
NA
when converting to labels. This can be prevented by
setting the missing
argument to FALSE
:
> table(labels(objectsales$product, FALSE), useNA = "ifany")
Toys Tools Teddy Bear Toy Car Marbles 0 0 29 14 16
<NA>
Hammer Electric Drill Unknown object 30 2 5 4
The droplevels
removes unused codes from the levels of
the generated factor vector:
> table(labels(objectsales$product, droplevels = TRUE), useNA = "ifany")
Teddy Bear Toy Car Marbles Hammer Electric Drill 29 14 16 30 2
<NA>
9
Using the ‘locale’ column of the code list it is possible to specify
different versions of for the labels and descriptions. This can be used
the specify different translations as in this example, but can also be
used to specify different versions, for example, long and short labels.
By default all methods will use the first locale in the code list as the
defalult locale; the locale returned by the cl_locale
function:
> cl_locale(objectcodes)
1] "EN" [
Most methods also have a locale
argument with which it
is possible to specify the preferred locale (the default is used when
the preferred locale is not present). For example:
> labels(objectsales$product, locale = "NL") |> head()
1] Hamer Hamer Hamer Knikkers Teddybeer Knikkers
[7 Levels: Speelgoed Gereedschap Teddybeer Speelgoedauto Knikkers ... Boormachine
It can become tedious having to specify the locale for each function
call. The cl_locale
will look at the CLLOCALE
option, when present, to get the preferred locale. Therefore, to set a
default preferred locale:
> op <- options(CLLOCALE = "NL")
> cl_locale(objectcodes)
1] "NL"
[> tapply(objectsales$unitprice, labels(objectsales$product), mean)
Speelgoed Gereedschap Teddybeer Speelgoedauto Knikkers NA NA 19.761034 12.432857 2.480625
Hamer Boormachine 45.303000 205.350000
> # Set the locale back to the original value (unset)
> options(op)
Using the codes
function it is possible to look up the
codes based on a set of labels. For example, below we look up the code
for ‘Hammer’:
> codes("Hammer", objectcodes)
1] "B01" [
or getting the code list form the relevant variable itself using the
cl
method that returns the code list of the variable:
> codes("Hammer", cl(objectsales$product))
1] "B01" [
This could be used to make selections. For example, instead of
> subset(objectsales, product == "B02")
product unitprice quantity totalprice33 B02[Electri…] 284.85 52 14812.20
73 B02[Electri…] 125.85 73 9187.05
one can do
> subset(objectsales, product == codes("Electric Drill", cl(product)))
product unitprice quantity totalprice33 B02[Electri…] 284.85 52 14812.20
73 B02[Electri…] 125.85 73 9187.05
In general the latter is more readable and makes the intent of the code much more clear (unless one can assume that the people reading the code will now most of the product codes).
When comparing a code
object to labels, it is also
possible to use the as.label
function. This will add the
class “label” to the character vector. The comparison operator will then
first call the codes
function on the label:
> subset(objectsales, product == as.label("Electric Drill"))
product unitprice quantity totalprice33 B02[Electri…] 284.85 52 14812.20
73 B02[Electri…] 125.85 73 9187.05
This only works for the equal-to and not-equal-to operators.
Selecting this way has an advantage over selecting records based on character vectors or factor vectors. For example we could also have done the following:
> subset(objectsales, labels(product) == "Electric Drill")
product unitprice quantity totalprice33 B02[Electri…] 284.85 52 14812.20
73 B02[Electri…] 125.85 73 9187.05
However, a small, difficult to spot, spelling mistake would have resulted in:
> subset(objectsales, labels(product) == "Electric drll")
1] product unitprice quantity totalprice
[<0 rows> (or 0-length row.names)
And we could have believed that no electric drills were sold. The
codes
function will also check if the provided labels are
valid and if not will generate an error (the try
is to make
sure don’t actually throw an error).
> try({
+ subset(objectsales, product == codes("Electric drill", cl(product)))
+ })
in codes.default("Electric drill", cl(product)) :
Error in codelist in current locale. Labels not present
Since selecting on labels is a common operation, there is also the
in_labels
function that will return a logical vector
indicating whether or not a code has a label in the given set:
> subset(objectsales, in_labels(product, "Electric Drill"))
product unitprice quantity totalprice33 B02[Electri…] 284.85 52 14812.20
73 B02[Electri…] 125.85 73 9187.05
This function will of course also generate an error in case of invalid codes.
> try({
+ subset(objectsales, in_labels(product, "Electric drill"))
+ })
in codes.default(labels, codelist) :
Error in codelist in current locale. Labels not present
In the examples above we used the base function subset
,
but this will of course also work within data.tables
and
the filter
methods from dplyr
.
When the vector with codes is transformed to a code
object, it can of course also be assigned to:
> objectsales$product[10] <- "A01"
> objectsales$product[1:10]
1] B01 B01 B01 A03 A01 A03 A03 B01 A03 A01
[8 Codelist: A(=Toys) B(=Tools) A01(=Teddy Bear) A02(=Toy Car) ...X(=Unknown object)
Here the codes
function can also be of use (again, an
invalid label will result in an error so this is a safe operation):
> objectsales$product[10] <- codes("Teddy Bear", objectcodes)
> objectsales$product[1:10]
1] B01 B01 B01 A03 A01 A03 A03 B01 A03 A01
[8 Codelist: A(=Toys) B(=Tools) A01(=Teddy Bear) A02(=Toy Car) ...X(=Unknown object)
Another option is to use the as.label
function which
labels a character vector as a label:
> objectsales$product[10] <- as.label("Electric Drill")
> objectsales$product[1:10]
1] B01 B01 B01 A03 A01 A03 A03 B01 A03 B02
[8 Codelist: A(=Toys) B(=Tools) A01(=Teddy Bear) A02(=Toy Car) ...X(=Unknown object)
Each code can have parent code. With this a simple hierarchy can be
defined. At the top of the hierarchy are the codes without parent
(NA
). This is level 0. Codes with a parent in level 0 are
in level 1 etc. Note that level 0 is a higher level than level 1. The
example code list objectcodes
has two levels:
> cl_nlevels(objectcodes)
1] 2 [
> cl_levels(objectcodes)
1] 0 0 1 1 1 1 1 0 0 1 1 1 1 1 0 0 [
These levels can be used to ‘cast’ the codes to a higher level:
> objectsales$group <- levelcast(objectsales$product, 0)
> head(objectsales)
product unitprice quantity totalprice group1 B01[Hammer] 70.65 67 4733.55 B[Tools]
2 B01[Hammer] 76.93 76 5846.68 B[Tools]
3 B01[Hammer] 43.49 100 4349.00 B[Tools]
4 A03[Marbles] 3.08 26 80.08 A[Toys]
5 A01[Teddy B…] 18.51 89 1647.39 A[Toys]
6 A03[Marbles] 3.35 71 237.85 A[Toys]
This is, for example, useful to create aggregates at higher levels. For example, we can calculate the total number of toys and tools sold:
> aggregate(objectsales[c("quantity", "totalprice")],
+ objectsales[c("group")], sum)
group quantity totalprice1 A[Toys] 3274 43918.09
2 B[Tools] 1829 103011.65
3 X[Unknown…] 308 18184.42
Note that by default the code list of the vector returned by
levelcast
will be modified to only contain the codes in the
higher hierarchy (this can be suppressed using the
filter_codelist = FALSE
argument):
> cl(objectsales$group)
code label parent locale missing1 A Toys <NA> EN FALSE
2 B Tools <NA> EN FALSE
8 A Speelgoed <NA> NL FALSE
9 B Gereedschap <NA> NL FALSE
15 X Unknown object <NA> EN TRUE
16 X Onbekend type voorwerp <NA> NL TRUE
Also, when the data contains codes from different levels, trying to
cast to a level lower than that some of the codes in the vector will
result by default in an error. This can be controlled with the
over_level
argument.
Using a code
vector also has the advantage that the
codes assigned to will be validated against the code list, generating an
error when one tries assign an invalid code:
> try({
+ objectsales$product[10] <- "Q"
+ })
in `[<-.code`(`*tmp*`, 10, value = "Q") :
Error in value. Invalid codes used
This makes a code
object safer to work with than, for
example, a character of numeric vector with codes (a factor
vector will also generate a warning for invalid factor levels).
The codes
function and the as.label
function (which call the codes
function) will also generate
an error:
> try({
+ objectsales$product[10] <- as.label("Teddy bear")
+ })
in codes.default(value, codelist) :
Error in codelist in current locale. Labels not present
Assigning NA
will of course still work:
> objectsales$product[10] <- NA
A code
object is safer to work with than a factor
vector. For example:
> x <- factor(letters[1:3])
> y <- code(1:3, data.frame(code = 1:3, label = letters[1:3]))
Comparing on invalid codes works with a factor while it will generate
an error for code
objects:
> try({ x == 4 })
1] FALSE FALSE FALSE
[> try({ y == 4 })
in Ops.code(y, 4) : Invalid codes used in RHS Error
The same holds when comparing on labels:
> try({ x == "foobar" })
1] FALSE FALSE FALSE [
A code
cannot directly be compared on a label and will
generate an error even when the label is valid:
> try({ y == "a" })
in Ops.code(y, "a") :
Error RHS not of the same class as the used codes of the LHS.
One should use either the codes
or as.label
function for that:
> try({ y == as.label("a") })
1] TRUE FALSE FALSE
[> try({ y == as.label("foobar") })
in codes.default(e2, cl(e1)) :
Error in codelist in current locale. Labels not present