tidyfinance
Working with dates can be cumbersome, in particular when combining
data sets from different contexts. When designing the output tables for
the download_data()
function we faced a trade-off between
(i) preserving original column names that experienced users might know
and for which the original sources provide documentation, and (ii)
introducing consistent columns names across data sets where necessary.
We decided to consistently use the date
column with
<date>
column type across various data sets to
facilitate joins and avoid unnecessary conversions. In this vignette, we
walk you through some examples.
Let us start with the popular Fama-French factors that come in daily,
weekly, and monthly versions. After downloading the monthly data, you
can see that each row refers to a specific month of a year and the
date
column refers to the first day of a month.
factors_ff_3_monthly <- download_data("factors_ff_3_monthly")
#> No `start_date` or `end_date` provided. Returning the full data set.
factors_ff_3_monthly
#> # A tibble: 1,177 × 5
#> date mkt_excess smb hml risk_free
#> <date> <dbl> <dbl> <dbl> <dbl>
#> 1 1926-07-01 0.0296 -0.0256 -0.0243 0.0022
#> 2 1926-08-01 0.0264 -0.0117 0.0382 0.0025
#> 3 1926-09-01 0.0036 -0.014 0.0013 0.0023
#> 4 1926-10-01 -0.0324 -0.0009 0.007 0.0032
#> 5 1926-11-01 0.0253 -0.001 -0.0051 0.0031
#> 6 1926-12-01 0.0262 -0.0003 -0.0005 0.0028
#> 7 1927-01-01 -0.0006 -0.0037 0.0454 0.0025
#> 8 1927-02-01 0.0418 0.0004 0.0294 0.0026
#> 9 1927-03-01 0.0013 -0.0165 -0.0261 0.003
#> 10 1927-04-01 0.0046 0.003 0.0081 0.0025
#> # ℹ 1,167 more rows
Why did we use the first day of a month here? This convention has the big advantage that you can easily add or subtract months or compute date differences. For instance, in the following code chunk, we subtract 3 months from the original date and then compute the difference in months:
factors_ff_3_monthly |>
select(date) |>
mutate(date_lag3 = date %m-% months(3),
date_difference = interval(date_lag3, date) %/% months(1))
#> # A tibble: 1,177 × 3
#> date date_lag3 date_difference
#> <date> <date> <dbl>
#> 1 1926-07-01 1926-04-01 3
#> 2 1926-08-01 1926-05-01 3
#> 3 1926-09-01 1926-06-01 3
#> 4 1926-10-01 1926-07-01 3
#> 5 1926-11-01 1926-08-01 3
#> 6 1926-12-01 1926-09-01 3
#> 7 1927-01-01 1926-10-01 3
#> 8 1927-02-01 1926-11-01 3
#> 9 1927-03-01 1926-12-01 3
#> 10 1927-04-01 1927-01-01 3
#> # ℹ 1,167 more rows
As another example, the date
column is arguably straight
forward for daily data, such as daily Fama-French factors:
factors_ff_3_daily <- download_data("factors_ff_3_daily")
#> No `start_date` or `end_date` provided. Returning the full data set.
factors_ff_3_daily
#> # A tibble: 25,795 × 5
#> date mkt_excess smb hml risk_free
#> <date> <dbl> <dbl> <dbl> <dbl>
#> 1 1926-07-01 0.001 -0.0025 -0.0027 0.00009
#> 2 1926-07-02 0.0045 -0.0033 -0.0006 0.00009
#> 3 1926-07-06 0.0017 0.003 -0.0039 0.00009
#> 4 1926-07-07 0.0009 -0.0058 0.0002 0.00009
#> 5 1926-07-08 0.0021 -0.0038 0.0019 0.00009
#> 6 1926-07-09 -0.0071 0.0043 0.0057 0.00009
#> 7 1926-07-10 0.0062 -0.0053 -0.001 0.00009
#> 8 1926-07-12 0.0004 -0.0003 0.0064 0.00009
#> 9 1926-07-13 0.0048 -0.0028 -0.002 0.00009
#> 10 1926-07-14 0.0004 0.0007 -0.0043 0.00009
#> # ℹ 25,785 more rows
Each row refers to a trading date, so date
refers to the
actual date. So computing date differences to the last row might yield
an integer larger than 1.
factors_ff_3_daily |>
select(date) |>
mutate(date_difference = interval(lag(date), date) %/% days(1))
#> # A tibble: 25,795 × 2
#> date date_difference
#> <date> <dbl>
#> 1 1926-07-01 NA
#> 2 1926-07-02 1
#> 3 1926-07-06 4
#> 4 1926-07-07 1
#> 5 1926-07-08 1
#> 6 1926-07-09 1
#> 7 1926-07-10 1
#> 8 1926-07-12 2
#> 9 1926-07-13 1
#> 10 1926-07-14 1
#> # ℹ 25,785 more rows
Similarly, working with daily CRSP data is straight-forward:
However, enforcing this consistency poses a challenge for monthly
CRSP data. As CRSP calculates the monthly returns based on the last
available price for each month, we decided to use date
following our convention of beginning-of-month and use
calculation_date
as the date that CRSP uses to calculate
the returns. In fact, the CRSP 2.0 update introduced the column
mthcaldt
instead of date
, hence our name
calculation_date
. For most applications, you will not need
calculation_date
, so it serves primarily an informational
purpose.
crsp_monthly <- download_data("wrds_crsp_monthly")
crsp_monthly |>
select(permno, date, calculation_date)
A similar challenge arises with Compustat data. The
datadate
column of the original Compustat tables refers to
the date of the fiscal year end or quarter. For many cases, this date is
the December 31st of a fiscal year, but they can be any end of month
throughout a calendar year. To comply with our notion of consistency, we
introduce the column date
, which is simply the beginning of
the month of each datadate
:
compustat_annual <- download_data("wrds_compustat_annual")
compustat_annual |>
select(gvkey, date, datadate)
If you want to use year or month information in your application, you
can simply parse them from date
:
compustat_annual |>
mutate(year = year(date),
month = month(date)) |>
select(gvkey, date, datadate, year, month)
For completeness, you can do the same for quarterly Compustat data:
compustat_quarterly <- download_data("wrds_compustat_quarterly")
compustat_quarterly |>
select(gvkey, date, datadate) |>
mutate(year = year(date),
month = month(date),
quarter = quarter(date))
As a first example, let us combine the Fama-French factors with monthly CRSP returns. We can simply run:
crsp_monthly |>
left_join(factors_ff_3_monthly, join_by(date)) |>
select(permno, date, risk_free, mkt_excess, smb, hml)
Similarly, the same logic applies to daily CRSP returns:
crsp_daily |>
left_join(factors_ff_3_daily, join_by(date)) |>
select(permno, date, risk_free, mkt_excess, smb, hml)
Of course, be careful not to join data sets from different contexts, e.g., joining daily Fama-French data to monthly CRSP data. However, the worst that can happen is that you get meaningless or missing rows.
Now, let us move to the involved case of joining CRSP with Compustat data. To achieve that, we have to first load the CCM links and join them to the monthly CRSP data (the same logic applies to daily CRSP data).
ccm_links <- download_data("wrds_ccm_links")
crsp_monthly <- crsp_monthly |>
left_join(ccm_links, join_by(permno), relationship = "many-to-many") |>
filter(between(date, linkdt, linkenddt)) |>
select(-c(linkdt, linkenddt))
So if you want to join information from Compustat for the
same date, you just join by gvkey
and
date
:
If you want to lag the Compustat information by 6 months to
incorporate the fact that there is usually a considerable lag between
fiscal year ends and the release of accounting reports, you can just add
months to the date
column in Compustat:
crsp_monthly |>
left_join(
compustat_annual |>
mutate(date = date %m+% months(6)),
join_by(gvkey, date)
)
If you prefer to follow the Fama-French protocol for joining company fundamentals to stock returns (see our chapter on this procedure here), you should do: