--- title: "Introduction to DemografixeR" author: "Matthias Brenninkmeijer" date: "`r Sys.Date()`" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Introduction to DemographixeR} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` ## Introduction Let's illustrate the usefulness of _DemografixeR_ with a simple example. Say we know the first name of a sample of customers, but useful information about gender, age or nationality is unavailable: ```{r plot_names, echo=FALSE, cache=FALSE} df <- structure(list(X1 = "Maria", X2 = "Ben", X3 = "Claudia", X4 = "Adam", X5 = "Hannah", X6 = "Robert"), class = "data.frame", row.names = "**Customers:**") knitr::kable(df, col.names = NULL) ``` It's common knowledge that names have a strong sociocultural influence - names' popularity vary across time and location - and these naming conventions may be good predictors for other useful variables such as **gender**, **age** & **nationality**. Here's where _DemografixeR_ comes in: > "_DemografixeR_ allows R users to connect directly to the (1) [_genderize.io API_](https://genderize.io), the (2) [_agify.io API_](https://agify.io/) and the (3) [_nationalize.io API_](https://nationalize.io/) to obtain the (1) **gender**, (2) **age** & (3) **nationality** of a name in a tidy format." _DemografixeR_ deals with the hassle of API pagination, missing values, duplicated names, trimming whitespace and parsing the results in a tidy format, giving the user time to analyze instead of tidying the data. To do so, _DemografixeR_ is based on these three main pillar functions, which we will use to predict the key demographic variables of the previous sample of customers, so that we can 'fix' the missing demographic information: ``` {r package structure, echo=FALSE, cache=FALSE} df <- structure(list(API = c("https://genderize.io", "https://agify.io", "https://nationalize.io"), `R function` = c("`genderize(name)`", "`agify(name)`", "`nationalize(name)`"), `Estimated variable` = c("Gender", "Age", "Nationality")), class = "data.frame", row.names = c(NA, -3L)) knitr::kable(df, row.names = FALSE) ``` They all work similarly, and allow to be integrated in multiple workflows. Using the previous group of customers, we can obtain the following results: ```{r result, echo=FALSE, cache=FALSE} df <- structure(list(X1 = c("Maria", "female", "21", "CY"), X2 = c("Ben", "male", "48", "AU"), X3 = c("Claudia", "female", "45", "CL"), X4 = c("Adam", "male", "34", "PL"), X5 = c("Hannah", "female", "27", "SL"), X6 = c("Robert", "male", "59", "US")), row.names = c("**Customers:**", "**Estimated gender:**", "**Estimated age:**", "**Estimated nationality:**" ), class = "data.frame") knitr::kable(df, col.names = NULL) ``` To see how to get to these results, read on! ## Get Started ### Setup First, we need to load the package: ```{r setup, cache=FALSE} library("DemografixeR") ``` ### API credentials The following step is optional, it is only necessary if you plan to estimate gender, age or nationality for more than 1000 different names a day. To do so, you need to obtain an API key from the following link: * [_genderize.io store_](https://store.genderize.io) To use the API key, simply save it only once with the **`save_api_key(key)`** and you're all set. All the functions will automatically retrieve the key once saved: ```{r save API function, eval=FALSE} save_api_key(key = "__YOUR_API_KEY__") ``` Please be careful when dealing with secrets/tokens/credentials and do not share them publicly. Yet, if you wish explicitly know which API key you've saved, retrieve it with the **`get_api_key()`** function. To fully remove the saved key use the **`remove_api_key()`** function. ### Gender We start by **predicting the gender** from our customers. For this we use the **`genderize(name)`** function: ```{r genderize, eval=TRUE, echo=TRUE, cache=FALSE} customers_names <- c("Maria", "Ben", "Claudia", "Adam", "Hannah", "Robert") customers_predicted_gender <- genderize(name = customers_names) customers_predicted_gender # Print results ``` We see that `genderize(name)` returns the estimated gender for each name as a `character` vector: ```{r genderize_class, cache=FALSE} class(customers_predicted_gender) ``` Yet, it is also possible to obtain a detailed `data.frame` object with additional information. _DemografixeR_ also allows to use 'pipes': ```{r genderize_dataframe_print, eval=FALSE} gender_df <- genderize(name = customers_names, simplify = FALSE) customers_names %>% genderize(simplify = FALSE) %>% knitr::kable(row.names = FALSE) ``` ```{r genderize_dataframe_calculate, cache=FALSE, echo=FALSE} df <- structure(list(name = c("Maria", "Ben", "Claudia", "Adam", "Hannah", "Robert"), type = c("gender", "gender", "gender", "gender", "gender", "gender"), gender = c("female", "male", "female", "male", "female", "male"), probability = c(0.98, 0.95, 0.98, 0.98, 0.97, 0.99), count = c(334287L, 77991L, 118604L, 116396L, 13198L, 177418L)), row.names = c(5L, 2L, 3L, 1L, 4L, 6L), class = "data.frame") knitr::kable(df, row.names = FALSE) ``` ### Age We continue with the **age** estimation of our customers. As with the **`genderize(name)`** function, the `simplify` parameter also works with the **`agify(name)`** function to retrieve a `data.frame`: ```{r agify_print, cache=FALSE, eval=FALSE} customers_predicted_age <- agify(name = customers_names, simplify = FALSE) customers_names %>% agify(simplify = FALSE) %>% knitr::kable(row.names = FALSE) ``` ```{r agify_calculate, cache=FALSE, echo=FALSE} df <- structure(list(name = c("Maria", "Ben", "Claudia", "Adam", "Hannah", "Robert"), type = c("age", "age", "age", "age", "age", "age"), age = c(21L, 48L, 45L, 34L, 27L, 59L), count = c(517258L, 75632L, 110105L, 110754L, 12843L, 160915L)), row.names = c(5L, 2L, 3L, 1L, 4L, 6L), class = "data.frame") knitr::kable(df, row.names = FALSE) ``` ### Nationality Last but not least, we finish with the **nationality** extrapolation. Equally as with the **`genderize(name)`** and **`agify(name)`** function, the `simplify` parameter also works with the **`nationalize(name)`** function to retrieve a `data.frame`: ```{r nationality_print, cache=FALSE, eval=FALSE} customers_predicted_nationality <- nationalize(name = customers_names, simplify = FALSE) customers_names %>% nationalize(simplify = FALSE) %>% knitr::kable(row.names = FALSE) ``` ```{r nationality_calculate, cache=FALSE, echo=FALSE} df <- structure(list(name = c("Maria", "Ben", "Claudia", "Adam", "Hannah", "Robert"), type = c("nationality", "nationality", "nationality", "nationality", "nationality", "nationality"), country_id = c("CY", "AU", "CL", "PL", "SL", "US"), probability = c(0.0550797623186088, 0.0665534047056411, 0.0559340459826196, 0.0905835974781003, 0.267325397602591, 0.0909442129069588)), row.names = c(5L, 2L, 3L, 1L, 4L, 6L), class = "data.frame") knitr::kable(df, row.names = FALSE) ``` ### Other parameters #### **`country_id`** parameter Responses of names will in a lot of cases be more accurate if the data is narrowed to a specific country. Luckily, both the **`genderize(name)`** and **`agify(name)`** function support passing a country code parameter (following the common ISO 3166-1 alpha-2 country code convention). For obvious reasons the **`nationalize(name)`** does not: ```{r country_id, cache=FALSE} us_customers_predicted_gender<-genderize(name = customers_names, country_id = "US") us_customers_predicted_gender us_customers_predicted_age<-agify(name = customers_names, country_id = "US") us_customers_predicted_age ``` To obtain a `data.frame` of all supported countries, use the **`supported_countries(type)`** function. Here's an example of 5 countries: ```{r supported_countries_print, cache=FALSE, eval=FALSE} supported_countries(type = "genderize") %>% head(5) %>% knitr::kable(row.names = FALSE) ``` ```{r supported_countries_calculate, cache=FALSE, echo=FALSE} df <- structure(list(country_id = c("AD", "AE", "AF", "AG", "AI"), name = c("Andorra", "United Arab Emirates", "Afghanistan", "Antigua and Barbuda", "Anguilla"), total = c(29783L, 145847L, 23531L, 1723L, 1081L)), row.names = c(NA, 5L), class = "data.frame") knitr::kable(df, row.names = FALSE) ``` In this case the `total` column reflects the number of observations the API has for each country. The beauty of the `country_id` parameter lies in that it allows to pass a single `character` string or a `character` vector with the same length as the `name` parameter. An example illustrates this better: ```{r country_id_multi_print, cache=FALSE, eval=FALSE} agify(name = c("Hannah", "Ben"), country_id = c("US", "GB"), simplify = FALSE) %>% knitr::kable(row.names = FALSE) ``` ```{r country_id_multi_calculate, cache=FALSE, echo=FALSE} df <- structure(list(name = c("Hannah", "Ben"), type = c("age", "age"), age = c(54L, 38L), count = c(67L, 1980L), country_id = c("US", "GB")), row.names = 2:1, class = "data.frame") knitr::kable(df, row.names = FALSE) ``` In this previous example we passed two names - Hannah & Ben - and two country codes - US & GB. Thus, the functions allow to pass vectorized vectors - this is especially useful for workflows where we are using a `data.frame` with a variable with names and another variable containing country codes. #### **`meta`** parameter All three functions have a parameter defined as ``meta``, which returns information about the API itself, such as: * The amount of names available in the current time window * The number of names left in the current time window * Seconds remaining until a new time window opens Here's an example: ```{r meta_parameter_print, cache=FALSE, eval=FALSE} genderize(name = "Hannah", simplify = FALSE, meta = TRUE) %>% knitr::kable(row.names = FALSE) ``` ```{r meta_parameter_calculate, cache=FALSE, echo=FALSE} df <- structure(list(name = "Hannah", type = "gender", gender = "female", probability = 0.97, count = 13198L, api_rate_limit = 1000L, api_rate_remaining = 977L, api_rate_reset = 7218L, api_request_timestamp = structure(1588715982, class = c("POSIXct", "POSIXt"), tzone = "GMT")), row.names = 1L, class = "data.frame") knitr::kable(df, row.names = FALSE) ``` #### **`sliced`** parameter The **`nationalize(name)`** function has the useful **`sliced`** parameter. Logically, names can have multiple estimated nationalities - and the **`nationalize(name)`** function automatically ranks them by probability. This logical parameter allows to 'slice'/keep only the value with the highest probability to keep a single estimate for each name (one country per name) - and is set by default to `TRUE`. But you may wish to see all to potential countries a name can be associated to. For this simply set the parameter to `FALSE`: ```{r sliced_false_print, cache=FALSE, eval=FALSE} nationalize(name = "Matthias", simplify = FALSE, sliced=FALSE) %>% knitr::kable(row.names = FALSE) ``` ```{r sliced_false_calculate, cache=FALSE, echo=FALSE} df <- structure(list(name = c("Matthias", "Matthias", "Matthias"), type = c("nationality", "nationality", "nationality"), country_id = c("DE", "AT", "CH"), probability = c(0.41616382151, 0.265062479571606, 0.1106921611552)), row.names = c(NA, 3L), class = "data.frame") knitr::kable(df, row.names = FALSE) ``` In the last example you see that instead of returning a single country code, it returns multiple country codes with their associated probability. ## Customers example Let's replicate the initial example with our group of customers. VoilĂ ! ```{r customers_end_print, echo=TRUE, cache=FALSE, eval=FALSE} library(dplyr) df<-data.frame("Customers:"=c("Maria", "Ben", "Claudia", "Adam", "Hannah", "Robert"), stringsAsFactors = FALSE, check.names = FALSE) df <- df %>% mutate(`Estimated gender:`= genderize(`Customers:`), `Estimated age:`= agify(`Customers:`), `Estimated nationality:`= nationalize(`Customers:`)) df %>% t() %>% knitr::kable(col.names = NULL) ``` ```{r customers_end_calc, echo=FALSE, cache=FALSE} df <- structure( c("Maria", "female", "21", "CY", "Ben", "male", "48", "AU", "Claudia", "female", "45", "CL", "Adam", "male", "34", "PL", "Hannah", "female", "27", "SL", "Robert", "male", "59", "US"), .Dim = c(4L, 6L), .Dimnames = list(c("Customers:", "Estimated gender:", "Estimated age:", "Estimated nationality:"), NULL)) knitr::kable(df, col.names = NULL) ``` ## Further information For more information access the package documentation at [https://matbmeijer.github.io/DemografixeR](https://matbmeijer.github.io/DemografixeR/).