%\VignetteEngine{knitr::knitr}
%\VignetteIndexEntry{Buy 'Til You Die - A Walkthrough}
\documentclass[10pt, letterpaper, onecolumn, oneside, final]{article}
\usepackage{microtype}
\usepackage[pdftex]{hyperref}
\hypersetup{colorlinks, citecolor=black, linkcolor=black, urlcolor=blue}
% define author and title
% setlength{\parindent}{0pt}
% setlength{\parskip}{1ex plus 0.2ex minus 0.2ex}
\author{Daniel McCarthy, Edward Wadsworth}
\title{Buy 'Til You Die - A Walkthrough}
\date{November, 2014}
\pagestyle{plain} %consider 'headings'
\begin{document}

<<include=FALSE>>=
library(knitr)
opts_chunk$set(
concordance=TRUE
)
@

\maketitle %generate the title
%\tableofcontents %insert the table of contents

\section{Version 2.4.3 Overview}

This version patches the Pareto/NBD component of BTYD 2.4 
using the fix that Theo Strinopoulos proposed \href{https://github.com/theofilos/BTYD}{here}. 

Everything below this section reads identically to the vignette of
BTYD 2.4. The main difference is that all evaluated references to BTYD 
file paths, libraries, etc. now point to their patched counterparts. 

As of version 2.4.1 the patched BTYD functions have been modified as follows: 
\begin{itemize}
\item Some functions defined in \texttt{R/pnbd.R} and in \texttt{R/bgnbd.R} 
(you can tell them by their names, which start with \texttt{pnbd} and \texttt{bgnbd} respectively) have been changed. 
\item Some of these functions now take an extra logical argument, \texttt{hardie}; 
if TRUE, this function (or a function it calls) uses Bruce Hardie's 
algorithm for estimating the real part of the Gaussian hypergeometric 
function (see MATLAB code on page 4 of \href{http://brucehardie.com/notes/008/pareto_nbd_MATLAB.pdf}{Fader et al. (2005)}); 
if FALSE, it makes use of the 
\href{https://cran.r-project.org/web/packages/hypergeo/index.html}{hypergeo} R package from CRAN. 
For the purposes of this vignette, this parameter is set globally as \texttt{allHardie = TRUE}. This
mirrors the choice of the BTYD package authors, who use Bruce Hardie's algorithm everywhere.
\item Instead of \texttt{base::optim}, the \texttt{pnbd} functions use \texttt{optimx::optimx} and they 
now allow you to pick your optimization method, using the \texttt{method} argument, which defaults to \texttt{L-BFGS-B} without any constraints, as in the original package.
\item Some of the functions defined in \texttt{R/bgnbd.R} have been changed in order to implement the fix for the \texttt{NUM!} error problem proposed in \href{http://www.brucehardie.com/notes/027/bgnbd_num_error.pdf}{this note}. 
\end{itemize}

As of version 2.4.2 the \texttt{hardie} argument defaults to TRUE with the exception of \texttt{bgnbd.generalParams} where it defaults to NULL.

Version 2.4.3 adds a fix to the \texttt{B1B2} component of the function \texttt{pnbd.pmf.General} defined in \texttt{R/pnbd.R} for a bug discovered by Patrik Schilter. The fix now specifies correctly the third parameter of the B2 function described by equation (24) \href{https://www.brucehardie.com/notes/013/Pareto_NBD_interval_pmf_rev.pdf}{here} as $r + s + x + 1$. The previous, incorrect version added the $i$ component from the first term.


\section{Introduction}

The BTYD package contains models to capture non-contractual purchasing
behavior of customers---or, more simply, models that tell the story of
people buying until they die (become inactive as customers). The main
models presented in the package are the Pareto/NBD, BG/NBD and BG/BB
models, which describe scenario of the firm not being able to observe
the exact time at which a customer drops out. We will cover
each in turn. If you are unfamiliar with these models,
\href{http://brucehardie.com/papers/018/fader_et_al_mksc_05.pdf}{Fader et al. (2004)} provides a description of the BG/NBD model,
\href{http://www.brucehardie.com/notes/008/pareto_nbd_MATLAB.pdf}{Fader et al. (2005)} provides a description of the Pareto/NBD model and
\href{http://www.brucehardie.com/papers/020/}{Fader et al. (2010)} provides a description of the BG/BB model.

\section{Pareto/NBD}

The Pareto/NBD model is used for non-contractual situations in which
customers can make purchases at any time. Using four parameters, it
describes the rate at which customers make purchases and the rate at
which they drop out---allowing for heterogeneity in both
regards, of course. We will walk through the Pareto/NBD functionality
provided by the BTYD package using the CDNOW\footnote{Provided with
  the BTYD package and available at
  \href{http://www.brucehardie.com}{brucehardie.com}. For more
  details, see the documentation of the cdnowSummary data included in
  the package.} dataset. As shown by
figure \ref{fig:pnbdCalibrationFit}, the Pareto/NBD model describes
this dataset quite well.

<<fig.path="", label="pnbdCalibrationFit", results="hide", echo=FALSE, include=FALSE>>=
library(knitr)
opts_chunk$set(comment="#")
library(BTYD)
# Set the hardie parameter value here, apply it everywhere
allHardie <- TRUE
data(cdnowSummary)
est.params <- cdnowSummary$est.params
cal.cbs <- cdnowSummary$cbs
pdf(file = 'pnbdCalibrationFit.pdf')
cal.fit <- pnbd.PlotFrequencyInCalibration(params = est.params, 
                                           cal.cbs = cal.cbs, 
                                           censor = 7, 
                                           hardie = allHardie)
dev.off()
@ 

\begin{figure}
  \begin{center}
  \includegraphics[width=0.6\textwidth]{pnbdCalibrationFit}
  \caption{Calibration period fit of Pareto/NBD model to CDNOW dataset.}\label{fig:pnbdCalibrationFit}
  \end{center}
\end{figure}

\subsection{Data Preparation}

The data required to estimate Pareto/NBD model parameters is
surprisingly little. The customer-by-customer approach of the model is
retained, but we need only three pieces of information for every
person: how many transactions they made in the calibration period
(frequency), the time of their last transaction (recency), and the
total time for which they were observed. A
customer-by-sufficient-statistic matrix, as used by the BTYD package,
is simply a matrix with a row for every customer and a column for each
of the above-mentioned statistics.

You may find yourself with the data available as an event log. This is
a data structure which contains a row for every transaction, with a
customer identifier, date of purchase, and (optionally) the amount of
the transaction. \texttt{dc.ReadLines} is a function to convert an
event log in a comma-delimited file to an data frame in R---you could
use another function such as \texttt{read.csv} or \texttt{read.table}
if desired, but \texttt{dc.ReadLines} simplifies things by only
reading the data you require and giving the output appropriate column
names. In the example below, we create an event log from the file
``cdnowElog.csv'', which has customer IDs in the second column, dates
in the third column and sales numbers in the fifth column.

<<message=FALSE, tidy=FALSE>>=
cdnowElog <- system.file("data/cdnowElog.csv", package = "BTYD")
elog <- dc.ReadLines(cdnowElog, cust.idx = 2,
                     date.idx = 3, sales.idx = 5)
elog[1:3,]
@ 

Note the formatting of the dates in the output
above. \texttt{dc.Readlines} saves dates as characters, exactly as
they appeared in the original comma-delimited file. For many of the
data-conversion functions in BTYD, however, dates need to be compared
to each other---and unless your years/months/days happen to be in the
right order, you probably want them to be sorted chronologically and not
alphabetically. Therefore, we convert the dates in the event log to R
Date objects:

<<message=FALSE>>=
elog$date <- as.Date(elog$date, "%Y%m%d")
elog[1:3,]
@ 

Our event log now has dates in the right format, but a bit more
cleaning needs to be done. Transaction-flow models, such as the
Pareto/NBD, is concerned with interpurchase time. Since our timing
information is only accurate to the day, we should merge all
transactions that occurred on the same day. For this, we use
\texttt{dc.MergeTransactionsOnSameDate}. This function returns an
event log with only one transaction per customer per day, with the
total sum of their spending for that day as the sales number. 

<<results="hide", message=FALSE>>=
elog <- dc.MergeTransactionsOnSameDate(elog)
@ 

To validate that the model works, we need to divide the data up into a
calibration period and a holdout period. This is relatively simple
with either an event log or a customer-by-time matrix, which we are
going to create soon. I am going to use 30 September 1997 as the
cutoff date, as this point (39 weeks) divides the dataset in half. The
reason for doing this split now will become evident when we are
building a customer-by-sufficient-statistic matrix from the
customer-by-time matrix---it requires a last transaction date, and we
want to make sure that last transaction date is the last date in the
calibration period and not in the total period.

<<message=FALSE>>=
end.of.cal.period <- as.Date("1997-09-30")
elog.cal <- elog[which(elog$date <= end.of.cal.period), ]
@ 

The final cleanup step is a very important one. In the calibration
period, the Pareto/NBD model is generally concerned with \emph{repeat}
transactions---that is, the first transaction is ignored. This is
convenient for firms using the model in practice, since it is easier
to keep track of all customers who have made at least one transaction
(as opposed to trying to account for those who have not made any
transactions at all). The one problem with simply getting rid of
customers' first transactions is the following: We have to keep track
of a ``time zero'' as a point of reference for recency and total time
observed. For this reason, we use
\texttt{dc.SplitUpElogForRepeatTrans}, which returns a filtered event
log (\texttt{\$repeat.trans.elog}) as well as saving important
information about each customer (\texttt{\$cust.data}).

<<results="hide", message=FALSE>>=
split.data <- dc.SplitUpElogForRepeatTrans(elog.cal)
clean.elog <- split.data$repeat.trans.elog
@ 

The next step is to create a customer-by-time matrix. This is simply a
matrix with a row for each customer and a column for each date. There
are several different options for creating these matrices:
\begin{itemize}
  \item[-] Frequency---each matrix entry will contain the number of
    transactions made by that customer on that day. Use
    \texttt{dc.CreateFreqCBT}. If you have already used
    \texttt{dc.MergeTransactionsOnSameDate}, this will simply be a
    reach customer-by-time matrix.
  \item[-] Reach---each matrix entry will contain a 1 if the customer
    made any transactions on that day, and 0 otherwise. Use
    \texttt{dc.CreateReachCBT}.
  \item[-] Spend---each matrix entry will contain the amount spent by
    that customer on that day. Use \texttt{dc.CreateSpendCBT}. You can
    set whether to use total spend for each day or average spend for
    each day by changing the \texttt{is.avg.spend} parameter. In most
    cases, leaving \texttt{is.avg.spend} as \texttt{FALSE} is appropriate.
\end{itemize}
 
<<message=FALSE>>=
freq.cbt <- dc.CreateFreqCBT(clean.elog)
freq.cbt[1:3,1:5]
@ 

There are two things to note from the output above:
\begin{enumerate}
  \item Customers 3, 4, and 5 appear to be missing, and in fact they
    are. They did not make any repeat transactions and were removed
    when we used \texttt{dc.SplitUpElogForRepeatTrans}. Do not worry
    though---we will get them back into the data when we use
    \texttt{dc.MergeCustomers} (soon).
  \item The columns start on January 8---this is because we removed
    all the first transactions (and nobody made 2 transactions withing
    the first week).
\end{enumerate}

We have a small problem now---since we have deleted all the first
transactions, the frequency customer-by-time matrix does not have any
of the customers who made zero repeat transactions. These customers
are still important; in fact, in most datasets, more customers make
zero repeat transactions than any other number. Solving the problem is
reasonably simple: we create a customer-by-time matrix using all
transactions, and then merge the filtered CBT with this total CBT
(using data from the filtered CBT and customer IDs from the total
CBT).

<<results="hide", message=FALSE>>=
tot.cbt <- dc.CreateFreqCBT(elog)
cal.cbt <- dc.MergeCustomers(tot.cbt, freq.cbt)
@ 

From the calibration period customer-by-time matrix (and a bit of
additional information we saved earlier), we can finally create the
customer-by-sufficient-statistic matrix described earlier. The
function we are going to use is \texttt{dc.BuildCBSFromCBTAndDates},
which requires a customer-by-time matrix, starting and ending dates
for each customer, and the time of the end of the calibration period.
It also requires a time period to use---in this case, we are choosing
to use weeks. If we chose days instead, for example, the recency and
total time observed columns in the customer-by-sufficient-statistic
matrix would have gone up to 273 instead of 39, but it would be the
same data in the end. This function could also be used for the holdout
period by setting \texttt{cbt.is.during.cal.period} to
\texttt{FALSE}. There are slight differences when this function is
used for the holdout period---it requires different input dates
(simply the start and end of the holdout period) and does not return a
recency (which has little value in the holdout period).

<<tidy=FALSE, results="hide", message=FALSE>>=
birth.periods <- split.data$cust.data$birth.per
last.dates <- split.data$cust.data$last.date
cal.cbs.dates <- data.frame(birth.periods, last.dates, 
                            end.of.cal.period)
cal.cbs <- dc.BuildCBSFromCBTAndDates(cal.cbt, cal.cbs.dates, 
                                      per="week")
@ 

You'll be glad to hear that, for the process described above, the
package contains a single function to do everything for you:
\texttt{dc.ElogToCbsCbt}. However, it is good to be aware of the
process above, as you might want to make small changes for different
situations---for example, you may not want to remove customers'
initial transactions, or your data may be available as a
customer-by-time matrix and not as an event log. For most standard
situations, however, \texttt{dc.ElogToCbsCbt} will do. Reading about
its parameters and output in the package documentation will help you
to understand the function well enough to use it for most purposes.

\subsection{Parameter Estimation}

Now that we have the data in the correct format, we can estimate model
parameters. To estimate parameters, we use
\texttt{pnbd.EstimateParameters}, which requires a calibration period
customer-by-sufficient-statistic matrix and (optionally) starting
parameters. (1,1,1,1) is used as default starting parameters if none
are provided. The function which is maximized is \texttt{pnbd.cbs.LL},
which returns the log-likelihood of a given set of parameters for a
customer-by-sufficient-statistic matrix.

<<warning=FALSE>>=
params <- pnbd.EstimateParameters(cal.cbs = cal.cbs, 
                                  hardie = allHardie)
round(params, digits = 3)
LL <- pnbd.cbs.LL(params = params, 
                  cal.cbs = cal.cbs, 
                  hardie = allHardie)
LL
@ 

As with any optimization, we should not be satisfied with the first
output we get. Let's run it a couple more times, with its own output
as a starting point, to see if it converges:

<<>>=
p.matrix <- c(params, LL)
for (i in 1:2){
  params <- pnbd.EstimateParameters(cal.cbs = cal.cbs, 
                                    par.start = params, 
                                    hardie = allHardie)
  LL <- pnbd.cbs.LL(params = params, 
                    cal.cbs = cal.cbs, 
                    hardie = allHardie)
  p.matrix.row <- c(params, LL)
  p.matrix <- rbind(p.matrix, p.matrix.row)
}
colnames(p.matrix) <- c("r", "alpha", "s", "beta", "LL")
rownames(p.matrix) <- 1:3
round(p.matrix, digits = 3)
@ 

This estimation does converge. I am not going to demonstrate it here,
but it is always a good idea to test the estimation function
from several starting points.

Now that we have the parameters, the BTYD package provides functions
to interpret them. As we know, r and alpha describe the gamma mixing
distribution of the NBD transaction process. We can see this gamma
distribution in figure \ref{fig:pnbdTransactionHeterogeneity}, plotted
using \texttt{pnbd.PlotTransactionRateHeterogeneity(params)}. We also
know that s and beta describe the gamma mixing distribution of the
Pareto (or gamma exponential) dropout process. We can see this gamma
distribution in figure \ref{fig:pnbdDropoutHeterogeneity}, plotted
using \texttt{pnbd.PlotDropoutRateHeterogeneity(params)}.

From these, we can already tell something about our customers: they
are more likely to have low values for their individual poisson
transaction process parameters, and more likely to have low values for
their individual exponential dropout process parameters.

<<fig.path="", label="pnbdTransactionHeterogeneity", results="hide", include=FALSE>>=
pdf(file = 'pnbdTransactionHeterogeneity.pdf')
pnbd.PlotTransactionRateHeterogeneity(params = params)
dev.off()
@ 

\begin{figure}
  \begin{center}
  \includegraphics[width=0.6\textwidth]{pnbdTransactionHeterogeneity}
  \caption{Transaction rate heterogeneity of estimated parameters.}\label{fig:pnbdTransactionHeterogeneity}
  \end{center}
\end{figure}

<<fig.path="", label="pnbdDropoutHeterogeneity", results="hide", include=FALSE>>=
pdf(file = 'pnbdDropoutHeterogeneity.pdf')
pnbd.PlotDropoutRateHeterogeneity(params = params)
dev.off()
@ 

\begin{figure}
  \begin{center}
  \includegraphics[width=0.6\textwidth]{pnbdDropoutHeterogeneity}
  \caption{Dropout rate heterogeneity of estimated parameters.}\label{fig:pnbdDropoutHeterogeneity}
  \end{center}
\end{figure}
 
\subsection{Individual Level Estimations}
Now that we have parameters for the population, we can make
estimations for customers on the individual level.

First, we can estimate the number of transactions we expect a newly
acquired customer to make in a given time period. Let's say, for
example, that we are interested in the number of repeat transactions a
newly acquired customer will make in a time period of one year. Note
that we use 52 weeks to represent one year, not 12 months, 365 days,
or 1 year. This is because our parameters were estimated using weekly
data.

<<>>=
pnbd.Expectation(params = params, t = 52)
@ 

We can also obtain expected characteristics for a specific customer,
conditional on their purchasing behavior during the calibration
period. The first of these is
\texttt{pnbd.ConditionalExpectedTransactions}, which gives the number
of transactions we expect a customer to make in the holdout
period. The second is \texttt{pnbd.PAlive}, which gives the
probability that a customer is still alive at the end of the
calibration period. As above, the time periods used depend on which
time period was used to estimate the parameters. 

<<tidy=FALSE>>=
cal.cbs["1516",]
x <- cal.cbs["1516", "x"]
t.x <- cal.cbs["1516", "t.x"]
T.cal <- cal.cbs["1516", "T.cal"]
pnbd.ConditionalExpectedTransactions(params, 
                                     T.star = 52, 
                                     x, 
                                     t.x, 
                                     T.cal, 
                                     hardie = allHardie)
pnbd.PAlive(params, 
            x, 
            t.x, 
            T.cal, 
            hardie = allHardie)
@ 

There is one more point to note here---using the conditional
expectation function, we can see the ``increasing frequency paradox''
in action:

<<tidy=FALSE>>=
# avoid overflow in LaTeX code block here:
cet <- "pnbd.ConditionalExpectedTransactions"
for (i in seq(10, 25, 5)){
  cond.expectation <- match.fun(cet)(params, 
                                     T.star = 52, 
                                     x = i, 
                                     t.x = 20, 
                                     T.cal = 39, 
                                     hardie = allHardie)
  cat ("x:",i,"\t Expectation:",cond.expectation, fill = TRUE)
}
@ 

\subsection{Plotting/ Goodness-of-fit}
\label{subsec:pnbdPlotting}
We would like to be able to do more than make inferences about
individual customers. The BTYD package provides functions to plot
expected customer behavior against actual customer behavior in the
both the calibration and holdout periods.

The first such function is the obvious starting point: a comparison of
actual and expected frequencies within the calibration period. This is
figure \ref{fig:pnbdCalibrationFit}, which was generated using the
following code:

<<results="hide", eval=FALSE>>=
pnbd.PlotFrequencyInCalibration(params = params, 
                                cal.cbs = cal.cbs, 
                                censor = 7, 
                                hardie = allHardie)
@ 

This function obviously needs to be able to generate expected data
(from estimated parameters) and requires the actual data (the
calibration period customer-by-sufficient-statistic).  It also
requires another number, called the censor number. The histogram that
is plotted is right-censored; after a certain number, all frequencies
are binned together. The number provided as a censor number determines
where the data is cut off.

Unfortunately, the only thing we can tell from comparing calibration
period frequencies is that the fit between our model and the data
isn't awful. We need to verify that the fit of the model holds into
the holdout period. Firstly, however, we are are going to need to get
information for holdout period. \texttt{dc.ElogToCbsCbt} produces both
a calibration period customer-by-sufficient-statistic matrix and a
holdout period customer-by-sufficient-statistic matrix, which could be
combined in order to find the number of transactions each customer
made in the holdout period. However, since we did not use
\texttt{dc.ElogToCbsCbt}, I am going to get the information directly
from the event log. Note that I subtract the number of repeat
transactions in the calibration period from the total number of
transactions. We remove the initial transactions first as we are not
concerned with them.

<<message=FALSE>>=
elog <- dc.SplitUpElogForRepeatTrans(elog)$repeat.trans.elog
x.star <- rep(0, nrow(cal.cbs))
cal.cbs <- cbind(cal.cbs, x.star)
elog.custs <- elog$cust
for (i in 1:nrow(cal.cbs)){
  current.cust <- rownames(cal.cbs)[i]
  tot.cust.trans <- length(which(elog.custs == current.cust))
  cal.trans <- cal.cbs[i, "x"]
  cal.cbs[i, "x.star"] <-  tot.cust.trans - cal.trans
}
round(cal.cbs[1:3,], digits = 3)
@ 

Now we can see how well our model does in the holdout period. Figure
\ref{fig:pnbdCondExpComp} shows the output produced by the code below. It
divides the customers up into bins according to calibration period
frequencies and plots actual and conditional expected holdout period
frequencies for these bins.

<<fig.path="", label="pnbdCondExpComp", tidy=FALSE, echo=TRUE, size="small", fig.keep='none'>>=
T.star <- 39 # length of the holdout period
censor <- 7  # This censor serves the same purpose described above
x.star <- cal.cbs[,"x.star"]
pdf(file = 'pnbdCondExpComp.pdf')
comp <- pnbd.PlotFreqVsConditionalExpectedFrequency(params, 
                                                    T.star, 
                                                    cal.cbs, 
                                                    x.star, 
                                                    censor, 
                                                    hardie = allHardie)
dev.off()
rownames(comp) <- c("act", "exp", "bin")
round(comp, digits = 3)
@

\begin{figure}
  \begin{center}
  \includegraphics[width=0.6\textwidth]{pnbdCondExpComp}
  \caption{Actual vs. conditional expected transactions in the holdout period.}\label{fig:pnbdCondExpComp}
  \end{center}
\end{figure}

As you can see above, the graph also produces a matrix output. Most
plotting functions in the BTYD package produce output like this. They
are often worth looking at because they contain additional information
not presented in the graph---the size of each bin in the graph. In
this graph, for example, this information is important because the bin
sizes show that the gap at zero means a lot more than the precision at
6 or 7 transactions. Despite this, this graph shows that the model
fits the data very well in the holdout period.

Aggregation by calibration period frequency is just one way to do
it. BTYD also provides plotting functions which aggregate by several
other measures. The other one I will demonstrate here is aggregation
by time---how well does our model predict how many transactions will
occur in each week?

The first step, once again, is going to be to collect the data we need
to compare the model to. The customer-by-time matrix has already
collected the data for us by time period; so we'll use that to gather
the total transactions per day. Then we convert the daily tracking
data to weekly data.

<<>>=
tot.cbt <- dc.CreateFreqCBT(elog)
d.track.data <- rep(0, 7 * 78)
origin <- as.Date("1997-01-01")
for (i in colnames(tot.cbt)){
  date.index <- difftime(as.Date(i), origin) + 1
  d.track.data[date.index] <- sum(tot.cbt[,i])
}
w.track.data <-  rep(0, 78)
for (j in 1:78){
  w.track.data[j] <- sum(d.track.data[(j*7-6):(j*7)])
}
@

Now, we can make a plot comparing the actual number of transactions to
the expected number of transactions on a weekly basis, as shown in
figure \ref{fig:pnbdTrackingInc}. Note that we set
\texttt{n.periods.final} to 78. This is to show that we are working
with weekly data. If our tracking data was daily, we would use 546
here---the function would plot our daily tracking data against
expected daily transactions, instead of plotting our weekly tracking
data against expected weekly transactions. This concept may be a bit
tricky, but is explained in the documentation for
\texttt{pnbd.PlotTrackingInc}. The reason there are two numbers for
the total period (\texttt{T.tot} and \texttt{n.periods.final}) is that
your customer-by-sufficient-statistic matrix and your tracking data
may be in different time periods.

<<fig.path="", label="pnbdTrackingInc", tidy=FALSE, echo=TRUE, fig.keep='none'>>=
T.cal <- cal.cbs[,"T.cal"]
T.tot <- 78
n.periods.final <- 78

pdf(file = 'pnbdTrackingInc.pdf')
inc.tracking <- pnbd.PlotTrackingInc(params = params, 
                                     T.cal = T.cal, 
                                     T.tot = T.tot, 
                                     actual.inc.tracking.data = w.track.data, 
                                     n.periods.final = n.periods.final)
dev.off()
round(inc.tracking[,20:25], digits = 3)
@ 

\begin{figure}
  \begin{center}
  \includegraphics[width=0.6\textwidth]{pnbdTrackingInc}
  \caption{Actual vs. expected incremental purchasing behaviour.}\label{fig:pnbdTrackingInc}
  \end{center}
\end{figure}

Although figure \ref{fig:pnbdTrackingInc} shows that the model is
definitely capturing the trend of customer purchases over time, it is
very messy and may not convince skeptics. Furthermore, the matrix, of
which a sample is shown, does not really convey much information since
purchases can vary so much from one week to the next. For these
reasons, we may need to smooth the data out by cummulating it over
time, as shown in figure \ref{fig:pnbdTrackingCum}.


<<fig.path="", label="pnbdTrackingCum", tidy=FALSE, echo=TRUE, fig.keep='none'>>=
cum.tracking.data <- cumsum(w.track.data)
pdf(file = 'pnbdTrackingCum.pdf')
cum.tracking <- pnbd.PlotTrackingCum(params = params, 
                                     T.cal = T.cal, 
                                     T.tot = T.tot, 
                                     actual.cu.tracking.data = cum.tracking.data, 
                                     n.periods.final = n.periods.final)
dev.off()
round(cum.tracking[,20:25], digits = 3)
@ 

\begin{figure}
  \begin{center}
  \includegraphics[width=0.6\textwidth]{pnbdTrackingCum}
  \caption{Actual vs. expected cumulative purchasing behaviour.}\label{fig:pnbdTrackingCum}
  \end{center}
\end{figure}


\section{BG/NBD}

The BG/NBD model, like the Pareto/NBD model, is used for non-contractual 
situations in which customers can make purchases at any time. It
describes the rate at which customers make purchases and the rate at
which they drop out with four parameters---allowing for heterogeneity in both. 
We will walk through the BG/NBD functions provided by the BTYD package using the 
CDNOW\footnote{Provided with
  the BTYD package and available at
  \href{http://www.brucehardie.com}{brucehardie.com}. For more
  details, see the documentation of the cdnowSummary data included in
  the package.} dataset. As shown by
figure \ref{fig:bgnbdCalibrationFit}, the BG/NBD model describes
this dataset quite well.

<<fig.path="", label="bgnbdCalibrationFit", results="hide", echo=FALSE, include=FALSE>>=
data(cdnowSummary);
est.params <- c(0.243, 4.414, 0.793, 2.426);
cal.cbs <- cdnowSummary$cbs;
pdf(file = 'bgnbdCalibrationFit.pdf')
cal.fit <- bgnbd.PlotFrequencyInCalibration(est.params, cal.cbs, 7)
dev.off()
@ 

\begin{figure}
  \begin{center}
  \includegraphics[width=0.6\textwidth]{bgnbdCalibrationFit}
  \caption{Calibration period fit of BG/NBD model to CDNOW dataset.}\label{fig:bgnbdCalibrationFit}
  \end{center}
\end{figure}


\subsection{Data Preparation}

The data required to estimate BG/NBD model parameters is
surprisingly little. The customer-by-customer approach of the model is
retained, but we need only three pieces of information for every
person: how many transactions they made in the calibration period
(frequency), the time of their last transaction (recency), and the
total time for which they were observed. This is the same as what is 
needed for the Pareto/NBD model. Indeed, if you have read the data 
preparation section for the Pareto/NBD model, you can safely skip over
this section and move to the section on Parameter Estimation.  

A customer-by-sufficient-statistic matrix, as used by the BTYD package,
is simply a matrix with a row for every customer and a column for each
of the above-mentioned statistics.

You may find yourself with the data available as an event log. This is
a data structure which contains a row for every transaction, with a
customer identifier, date of purchase, and (optionally) the amount of
the transaction. \texttt{dc.ReadLines} is a function to convert an
event log in a comma-delimited file to an data frame in R---you could
use another function such as \texttt{read.csv} or \texttt{read.table}
if desired, but \texttt{dc.ReadLines} simplifies things by only
reading the data you require and giving the output appropriate column
names. In the example below, we create an event log from the file
``cdnowElog.csv'', which has customer IDs in the second column, dates
in the third column and sales numbers in the fifth column.

<<message=FALSE, tidy=FALSE>>=
cdnowElog <- system.file("data/cdnowElog.csv", package = "BTYD")
elog <- dc.ReadLines(cdnowElog, cust.idx = 2,
                     date.idx = 3, sales.idx = 5)
elog[1:3,]
@ 

Note the formatting of the dates in the output
above. \texttt{dc.Readlines} saves dates as characters, exactly as
they appeared in the original comma-delimited file. For many of the
data-conversion functions in BTYD, however, dates need to be compared
to each other---and unless your years/months/days happen to be in the
right order, you probably want them to be sorted chronologically and not
alphabetically. Therefore, we convert the dates in the event log to R
Date objects:

<<message=FALSE>>=
elog$date <- as.Date(elog$date, "%Y%m%d");
elog[1:3,]
@ 

Our event log now has dates in the right format, but a bit more
cleaning needs to be done. Transaction-flow models, such as the
BG/NBD, is concerned with inter-purchase time. Since our timing
information is only accurate to the day, we should merge all
transactions that occurred on the same day. For this, we use
\texttt{dc.MergeTransactionsOnSameDate}. This function returns an
event log with only one transaction per customer per day, with the
total sum of their spending for that day as the sales number. 

<<results="hide", message=FALSE>>=
elog <- dc.MergeTransactionsOnSameDate(elog);
@ 

To validate that the model works, we need to divide the data up into a
calibration period and a holdout period. This is relatively simple
with either an event log or a customer-by-time matrix, which we are
going to create soon. I am going to use 30 September 1997 as the
cutoff date, as this point (39 weeks) divides the dataset in half. The
reason for doing this split now will become evident when we are
building a customer-by-sufficient-statistic matrix from the
customer-by-time matrix---it requires a last transaction date, and we
want to make sure that last transaction date is the last date in the
calibration period and not in the total period.

<<message=FALSE>>=
end.of.cal.period <- as.Date("1997-09-30")
elog.cal <- elog[which(elog$date <= end.of.cal.period), ]
@ 

The final cleanup step is a very important one. In the calibration
period, the BG/NBD model is generally concerned with \emph{repeat}
transactions---that is, the first transaction is ignored. This is
convenient for firms using the model in practice, since it is easier
to keep track of all customers who have made at least one transaction
(as opposed to trying to account for those who have not made any
transactions at all). The one problem with simply getting rid of
customers' first transactions is the following: We have to keep track
of a ``time zero'' as a point of reference for recency and total time
observed. For this reason, we use
\texttt{dc.SplitUpElogForRepeatTrans}, which returns a filtered event
log (\texttt{\$repeat.trans.elog}) as well as saving important
information about each customer (\texttt{\$cust.data}).

<<results="hide", message=FALSE>>=
split.data <- dc.SplitUpElogForRepeatTrans(elog.cal);
clean.elog <- split.data$repeat.trans.elog;
@ 

The next step is to create a customer-by-time matrix. This is simply a
matrix with a row for each customer and a column for each date. There
are several different options for creating these matrices:
\begin{itemize}
  \item[-] Frequency---each matrix entry will contain the number of
    transactions made by that customer on that day. Use
    \texttt{dc.CreateFreqCBT}. If you have already used
    \texttt{dc.MergeTransactionsOnSameDate}, this will simply be a
    reach customer-by-time matrix.
  \item[-] Reach---each matrix entry will contain a 1 if the customer
    made any transactions on that day, and 0 otherwise. Use
    \texttt{dc.CreateReachCBT}.
  \item[-] Spend---each matrix entry will contain the amount spent by
    that customer on that day. Use \texttt{dc.CreateSpendCBT}. You can
    set whether to use total spend for each day or average spend for
    each day by changing the \texttt{is.avg.spend} parameter. In most
    cases, leaving \texttt{is.avg.spend} as \texttt{FALSE} is appropriate.
\end{itemize}
 
<<message=FALSE>>=
freq.cbt <- dc.CreateFreqCBT(clean.elog);
freq.cbt[1:3,1:5]
@ 

There are two things to note from the output above:
\begin{enumerate}
  \item Customers 3, 4, and 5 appear to be missing, and in fact they
    are. They did not make any repeat transactions and were removed
    when we used \texttt{dc.SplitUpElogForRepeatTrans}. Do not worry
    though---we will get them back into the data when we use
    \texttt{dc.MergeCustomers} (soon).
  \item The columns start on January 8---this is because we removed
    all the first transactions (and nobody made 2 transactions withing
    the first week).
\end{enumerate}

We have a small problem now---since we have deleted all the first
transactions, the frequency customer-by-time matrix does not have any
of the customers who made zero repeat transactions. These customers
are still important; in fact, in most datasets, more customers make
zero repeat transactions than any other number. Solving the problem is
reasonably simple: we create a customer-by-time matrix using all
transactions, and then merge the filtered CBT with this total CBT
(using data from the filtered CBT and customer IDs from the total
CBT).

<<results="hide", message=FALSE>>=
tot.cbt <- dc.CreateFreqCBT(elog)
cal.cbt <- dc.MergeCustomers(tot.cbt, freq.cbt)
@ 

From the calibration period customer-by-time matrix (and a bit of
additional information we saved earlier), we can finally create the
customer-by-sufficient-statistic matrix described earlier. The
function we are going to use is \texttt{dc.BuildCBSFromCBTAndDates},
which requires a customer-by-time matrix, starting and ending dates
for each customer, and the time of the end of the calibration period.
It also requires a time period to use---in this case, we are choosing
to use weeks. If we chose days instead, for example, the recency and
total time observed columns in the customer-by-sufficient-statistic
matrix would have gone up to 273 instead of 39, but it would be the
same data in the end. This function could also be used for the holdout
period by setting \texttt{cbt.is.during.cal.period} to
\texttt{FALSE}. There are slight differences when this function is
used for the holdout period---it requires different input dates
(simply the start and end of the holdout period) and does not return a
recency (which has little value in the holdout period).

<<tidy=FALSE, results="hide", message=FALSE>>=
birth.periods <- split.data$cust.data$birth.per
last.dates <- split.data$cust.data$last.date
cal.cbs.dates <- data.frame(birth.periods, last.dates, 
                            end.of.cal.period)
cal.cbs <- dc.BuildCBSFromCBTAndDates(cal.cbt, cal.cbs.dates, 
                                      per="week")
@ 

You'll be glad to hear that, for the process described above, the
package contains a single function to do everything for you:
\texttt{dc.ElogToCbsCbt}. However, it is good to be aware of the
process above, as you might want to make small changes for different
situations---for example, you may not want to remove customers'
initial transactions, or your data may be available as a
customer-by-time matrix and not as an event log. For most standard
situations, however, \texttt{dc.ElogToCbsCbt} will do. Reading about
its parameters and output in the package documentation will help you
to understand the function well enough to use it for most purposes.

\subsection{Parameter Estimation}

Now that we have the data in the correct format, we can estimate model
parameters. To estimate parameters, we use
\texttt{bgnbd.EstimateParameters}, which requires a calibration period
customer-by-sufficient-statistic matrix and (optionally) starting
parameters. (1,3,1,3) is used as default starting parameters if none
are provided as these values tend to provide good convergence across
data sets. The function which is maximized is \texttt{bgnbd.cbs.LL},
which returns the log-likelihood of a given set of parameters for a
customer-by-sufficient-statistic matrix. Below, we start with relatively 
uninformative starting values:

<<>>=
params <- bgnbd.EstimateParameters(cal.cbs);
params
LL <- bgnbd.cbs.LL(params, cal.cbs);
LL
@ 

As with any optimization, we should not be satisfied with the first
output we get. Let's run it a couple more times, with its own output
as a starting point, to see if it converges:

<<>>=
p.matrix <- c(params, LL);
for (i in 1:2){
params <- bgnbd.EstimateParameters(cal.cbs, params);
LL <- bgnbd.cbs.LL(params, cal.cbs);
p.matrix.row <- c(params, LL);
p.matrix <- rbind(p.matrix, p.matrix.row);
}
colnames(p.matrix) <- c("r", "alpha", "a", "b", "LL");
rownames(p.matrix) <- 1:3;
p.matrix;
@ 

This estimation does converge. I am not going to demonstrate it here,
but it is always a good idea to test the estimation function
from several starting points.

Now that we have the parameters, the BTYD package provides functions
to interpret them. As we know, r and alpha describe the gamma mixing
distribution of the NBD transaction process. We can see this gamma
distribution in figure \ref{fig:bgnbdTransactionHeterogeneity}, plotted
using \texttt{bgnbd.PlotTransactionRateHeterogeneity(params)}. We also
know that a and b describe the beta mixing distribution of the
beta geometric dropout probabilities. We can see this beta
distribution in figure \ref{fig:bgnbdDropoutHeterogeneity}, plotted
using \texttt{bgnbd.PlotDropoutHeterogeneity(params)}.

From these, we can already tell something about our customers: they
are more likely to have low values for their individual poisson
transaction process parameters, and more likely to have low values for
their individual geometric dropout probability parameters.

<<fig.path="", label="bgnbdTransactionHeterogeneity", results="hide", include=FALSE>>=
pdf(file = 'bgnbdTransactionHeterogeneity.pdf')
bgnbd.PlotTransactionRateHeterogeneity(params)
dev.off()
@ 

\begin{figure}
  \begin{center}
  \includegraphics[width=0.6\textwidth]{bgnbdTransactionHeterogeneity}
  \caption{Transaction rate heterogeneity of estimated parameters.}\label{fig:bgnbdTransactionHeterogeneity}
  \end{center}
\end{figure}
      
<<fig.path="", label="bgnbdDropoutHeterogeneity", results="hide", include=FALSE>>=
pdf(file = 'bgnbdDropoutHeterogeneity.pdf')
bgnbd.PlotDropoutRateHeterogeneity(params)
dev.off()
@ 

\begin{figure}
  \begin{center}
  \includegraphics[width=0.6\textwidth]{bgnbdDropoutHeterogeneity}
  \caption{Dropout probability heterogeneity of estimated parameters.}\label{fig:bgnbdDropoutHeterogeneity}
  \end{center}
\end{figure}
 
\subsection{Individual Level Estimations}
Now that we have parameters for the population, we can make
estimations for customers on the individual level.

First, we can estimate the number of transactions we expect a newly
acquired customer to make in a given time period. Let's say, for
example, that we are interested in the number of repeat transactions a
newly acquired customer will make in a time period of one year. Note
that we use 52 weeks to represent one year, not 12 months, 365 days,
or 1 year. This is because our parameters were estimated using weekly
data.

<<>>=
bgnbd.Expectation(params, t=52);
@ 

We can also obtain expected characteristics for a specific customer,
conditional on their purchasing behavior during the calibration
period. The first of these is
\texttt{bgnbd.ConditionalExpectedTransactions}, which gives the number
of transactions we expect a customer to make in the holdout
period. The second is \texttt{bgnbd.PAlive}, which gives the
probability that a customer is still alive at the end of the
calibration period. As above, the time periods used depend on which
time period was used to estimate the parameters. 

<<tidy=FALSE>>=
cal.cbs["1516",]
x <- cal.cbs["1516", "x"]
t.x <- cal.cbs["1516", "t.x"]
T.cal <- cal.cbs["1516", "T.cal"]
bgnbd.ConditionalExpectedTransactions(params, T.star = 52, 
                                     x, t.x, T.cal)
bgnbd.PAlive(params, x, t.x, T.cal)
@ 

There is one more point to note here---using the conditional
expectation function, we can see the ``increasing frequency paradox''
in action:

<<tidy=FALSE>>=
for (i in seq(10, 25, 5)){
  cond.expectation <- bgnbd.ConditionalExpectedTransactions(
                        params, T.star = 52, x = i,
                        t.x = 20, T.cal = 39)
  cat ("x:",i,"\t Expectation:",cond.expectation, fill = TRUE)
}
@ 

\subsection{Plotting/ Goodness-of-fit}
\label{subsec:bgnbdPlotting}
We would like to be able to do more than make inferences about
individual customers. The BTYD package provides functions to plot
expected customer behavior against actual customer behaviour in the
both the calibration and holdout periods.

The first such function is the obvious starting point: a comparison of
actual and expected frequencies within the calibration period. This is
figure \ref{fig:bgnbdCalibrationFit}, which was generated using the
following code:

<<results="hide", eval=FALSE>>=
bgnbd.PlotFrequencyInCalibration(params, cal.cbs, 7)
@ 

This function obviously needs to be able to generate expected data
(from estimated parameters) and requires the actual data (the
calibration period customer-by-sufficient-statistic).  It also
requires another number, called the censor number. The histogram that
is plotted is right-censored; after a certain number, all frequencies
are binned together. The number provided as a censor number determines
where the data is cut off.

Unfortunately, the only thing we can tell from comparing calibration
period frequencies is that the fit between our model and the data
isn't awful. We need to verify that the fit of the model holds into
the holdout period. Firstly, however, we are are going to need to get
information for holdout period. \texttt{dc.ElogToCbsCbt} produces both
a calibration period customer-by-sufficient-statistic matrix and a
holdout period customer-by-sufficient-statistic matrix, which could be
combined in order to find the number of transactions each customer
made in the holdout period. However, since we did not use
\texttt{dc.ElogToCbsCbt}, I am going to get the information directly
from the event log. Note that I subtract the number of repeat
transactions in the calibration period from the total number of
transactions. We remove the initial transactions first as we are not
concerned with them.

<<message=FALSE>>=
elog <- dc.SplitUpElogForRepeatTrans(elog)$repeat.trans.elog;
x.star <- rep(0, nrow(cal.cbs));
cal.cbs <- cbind(cal.cbs, x.star);
elog.custs <- elog$cust;
for (i in 1:nrow(cal.cbs)){
  current.cust <- rownames(cal.cbs)[i]
  tot.cust.trans <- length(which(elog.custs == current.cust))
  cal.trans <- cal.cbs[i, "x"]
  cal.cbs[i, "x.star"] <-  tot.cust.trans - cal.trans
}
cal.cbs[1:3,]
@ 

Now we can see how well our model does in the holdout period. Figure
\ref{fig:bgnbdCondExpComp} shows the output produced by the code below. It
divides the customers up into bins according to calibration period
frequencies and plots actual and conditional expected holdout period
frequencies for these bins.

<<fig.path="", label="bgnbdCondExpComp", tidy=FALSE, echo=TRUE, size="small", fig.keep='none'>>=
T.star <- 39 # length of the holdout period
censor <- 7  # This censor serves the same purpose described above
x.star <- cal.cbs[,"x.star"]

pdf(file = 'bgnbdCondExpComp.pdf')
comp <- bgnbd.PlotFreqVsConditionalExpectedFrequency(params, T.star,
                                              cal.cbs, x.star, censor)
dev.off()
rownames(comp) <- c("act", "exp", "bin")
comp
@

\begin{figure}
  \begin{center}
  \includegraphics[width=0.6\textwidth]{bgnbdCondExpComp}
  \caption{Actual vs. conditional expected transactions in the holdout period.}\label{fig:bgnbdCondExpComp}
  \end{center}
\end{figure}

As you can see above, the graph also produces a matrix output. Most
plotting functions in the BTYD package produce output like this. They
are often worth looking at because they contain additional information
not presented in the graph---the size of each bin in the graph. In
this graph, for example, this information is important because the bin
sizes show that the gap at zero means a lot more than the precision at
6 or 7 transactions. Despite this, this graph shows that the model
fits the data very well in the holdout period.

Aggregation by calibration period frequency is just one way to do
it. BTYD also provides plotting functions which aggregate by several
other measures. The other one I will demonstrate here is aggregation
by time---how well does our model predict how many transactions will
occur in each week?

The first step, once again, is going to be to collect the data we need
to compare the model to. The customer-by-time matrix has already
collected the data for us by time period; so we'll use that to gather
the total transactions per day. Then we convert the daily tracking
data to weekly data.

<<>>=
tot.cbt <- dc.CreateFreqCBT(elog)
d.track.data <- rep(0, 7 * 78)
origin <- as.Date("1997-01-01")
for (i in colnames(tot.cbt)){
  date.index <- difftime(as.Date(i), origin) + 1;
  d.track.data[date.index] <- sum(tot.cbt[,i]);
}
w.track.data <-  rep(0, 78)
for (j in 1:78){
  w.track.data[j] <- sum(d.track.data[(j*7-6):(j*7)])
}
@

Now, we can make a plot comparing the actual number of transactions to
the expected number of transactions on a weekly basis, as shown in
figure \ref{fig:bgnbdTrackingInc}. Note that we set
\texttt{n.periods.final} to 78. This is to show that we are working
with weekly data. If our tracking data was daily, we would use 546
here---the function would plot our daily tracking data against
expected daily transactions, instead of plotting our weekly tracking
data against expected weekly transactions. This concept may be a bit
tricky, but is explained in the documentation for
\texttt{bgnbd.PlotTrackingInc}. The reason there are two numbers for
the total period (\texttt{T.tot} and \texttt{n.periods.final}) is that
your customer-by-sufficient-statistic matrix and your tracking data
may be in different time periods.

<<fig.path="", label="bgnbdTrackingInc", tidy=FALSE, echo=TRUE, fig.keep='none'>>=
T.cal <- cal.cbs[,"T.cal"]
T.tot <- 78
n.periods.final <- 78
pdf(file = 'bgnbdTrackingInc.pdf')
inc.tracking <- bgnbd.PlotTrackingInc(params, 
                                      T.cal, 
                                      T.tot, 
                                      w.track.data,
                                      n.periods.final, 
                                      allHardie)
dev.off()
inc.tracking[,20:25]
@ 

\begin{figure}
  \begin{center}
  \includegraphics[width=0.6\textwidth]{bgnbdTrackingInc}
  \caption{Actual vs. expected incremental purchasing behaviour.}\label{fig:bgnbdTrackingInc}
  \end{center}
\end{figure}

Although figure \ref{fig:bgnbdTrackingInc} shows that the model is
definitely capturing the trend of customer purchases over time, it is
very messy and may not convince skeptics. Furthermore, the matrix, of
which a sample is shown, does not really convey much information since
purchases can vary so much from one week to the next. For these
reasons, we may need to smooth the data out by cumulating it over
time, as shown in Figure \ref{fig:bgnbdTrackingCum}.

<<fig.path="", label="bgnbdTrackingCum", tidy=FALSE, echo=TRUE, fig.keep='none'>>=

cum.tracking.data <- cumsum(w.track.data)
pdf(file = 'bgnbdTrackingCum.pdf')
cum.tracking <- bgnbd.PlotTrackingCum(params, 
                                      T.cal, 
                                      T.tot, 
                                      cum.tracking.data,
                                      n.periods.final, 
                                      allHardie)
dev.off()
cum.tracking[,20:25]
@ 

\begin{figure}
  \begin{center}
  \includegraphics[width=0.6\textwidth]{bgnbdTrackingCum}
  \caption{Actual vs. expected cumulative purchasing behaviour.}\label{fig:bgnbdTrackingCum}
  \end{center}
\end{figure}


\section{BG/BB}

The BG/BB model is also used for non-contractual settings. In many
regards, it is very similar to the Pareto/NBD model---it also uses
four parameters to describe a purchasing process and a dropout
process. The difference between the models is that the BG/BB is used
to describe situations in which customers have discrete transaction
opportunities, rather than being able to make transactions at any
time. For this section, we will be using donation data presented in
Fader et. al. (2010). Figure \ref{fig:bgbbCalibrationFit} shows that
this model also fits the right type of data well.

<<fig.path="", label="bgbbCalibrationFit", results="hide", echo=FALSE, include=FALSE>>=
data(donationsSummary)
rf.matrix <- donationsSummary$rf.matrix
params <- bgbb.EstimateParameters(rf.matrix)
pdf(file = 'bgbbCalibrationFit.pdf')
cal.fit <- bgbb.PlotFrequencyInCalibration(params, rf.matrix, 6)
dev.off()
@ 

\begin{figure}
  \begin{center}
  \includegraphics[width=0.6\textwidth]{bgbbCalibrationFit}
  \caption{Calibration period fit of BG/BB model to the donations dataset.}\label{fig:bgbbCalibrationFit}
  \end{center}
\end{figure}

\subsection{Data Preparation}

Luckily, preparing data to be used by the BTYD package BG/BB functions
is going to be very easy if you understood how to set up the data for
the Pareto/NBD package. The BG/BB model uses exactly the same data as
the Pareto/NBD model, but since it is discrete we can go one step
further and create a recency-frequency matrix from our
customer-by-sufficient-statistic matrix. We are able to do this
because the data is discrete---a recency-frequency matrix consists of
a row for every possible calibration period recency/frequency
combination, and contains the total number of customers which had that
particular combination of recency and frequency. While this is not
strictly necessary, it greatly reduces the amount of space required to
store the data and makes parameter estimation much faster---for the
donation data, for example, there is a reduction from 11,104 rows
(number of customers) down to 22.

Since I don't have access to the original event log for the donation
data, I am going to demonstrate the data preparation process with
simulated data (included in the package):

<<message=FALSE, tidy=FALSE>>=
simElog <- system.file("data/discreteSimElog.csv", 
                       package = "BTYD")
elog <- dc.ReadLines(simElog, cust.idx = 1, date.idx = 2)
elog[1:3,]
elog$date <- as.Date(elog$date, "%Y-%m-%d")

max(elog$date);
min(elog$date);
# let's make the calibration period end somewhere in-between
T.cal <- as.Date("1977-01-01")

simData <- dc.ElogToCbsCbt(elog, per="year", T.cal)
cal.cbs <- simData$cal$cbs

freq<- cal.cbs[,"x"]
rec <- cal.cbs[,"t.x"]
trans.opp <- 7 # transaction opportunities
cal.rf.matrix <- dc.MakeRFmatrixCal(freq, rec, trans.opp)
cal.rf.matrix[1:5,]
@ 

\subsection{Parameter Estimation}

Estimating BG/BB parameters is very similar to estimating Pareto/NBD
parameters. The same rules apply---we use
\texttt{bgbb.EstimateParameters}, which also uses (1,1,1,1) as default
starting parameters. Once again, we should run the estimation several
times with its own output as starting parameters, to make sure that
our estimation converges. One should also test the estimation from
different starting points, but I am not going to do that here.

<<>>=
data(donationsSummary);
rf.matrix <- donationsSummary$rf.matrix
params <- bgbb.EstimateParameters(rf.matrix);
LL <- bgbb.rf.matrix.LL(params, rf.matrix);
p.matrix <- c(params, LL);
for (i in 1:2){
params <- bgbb.EstimateParameters(rf.matrix, params);
LL <- bgbb.rf.matrix.LL(params, rf.matrix);
p.matrix.row <- c(params, LL);
p.matrix <- rbind(p.matrix, p.matrix.row);
}
colnames(p.matrix) <- c("alpha", "beta", "gamma", "delta", "LL");
rownames(p.matrix) <- 1:3;
p.matrix;
@ 

The parameter estimation converges very quickly. It is much easier,
and faster, to estimate BG/BB parameters than it is to estimate
Pareto/NBD parameters, because there are fewer calculations involved.

We can interpret these parameters by plotting the mixing
distributions. Alpha and beta describe the beta mixing distribution of
the beta-Bernoulli transaction process. We can see the beta
distribution with parameters alpha and beta in figure
\ref{fig:bgbbTransactionHeterogeneity}, plotted using
\texttt{bgbb.PlotTransactionRateHeterogeneity(params)}. Gamma and
Delta describe the beta mixing distribution of the beta-geometric
dropout process. We can see the beta distribution with parameters
gamma and delta in figure \ref{fig:bgbbDropoutHeterogeneity}, plotted
using \texttt{bgbb.PlotDropoutHeterogeneity(params)}. The story told
by these plots describes the type of customers most firms would
want---their transaction parameters are more likely to be high, and
their dropout parameters are more likely to be low.

<<fig.path="", label="bgbbTransactionHeterogeneity", results="hide", include=FALSE>>=
pdf(file = 'bgbbTransactionHeterogeneity.pdf')
bgbb.PlotTransactionRateHeterogeneity(params)
dev.off()
@ 

\begin{figure}
  \begin{center}
  \includegraphics[width=0.6\textwidth]{bgbbTransactionHeterogeneity}
  \caption{Transaction rate heterogeneity of estimated parameters.}\label{fig:bgbbTransactionHeterogeneity}
  \end{center}
\end{figure}
      
<<fig.path="", label="bgbbDropoutHeterogeneity", results="hide", include=FALSE>>=
pdf(file = 'bgbbDropoutHeterogeneity.pdf')
bgbb.PlotDropoutRateHeterogeneity(params)
dev.off()
@ 

\begin{figure}
  \begin{center}
  \includegraphics[width=0.6\textwidth]{bgbbDropoutHeterogeneity}
  \caption{Dropout rate heterogeneity of estimated parameters.}\label{fig:bgbbDropoutHeterogeneity}
  \end{center}
\end{figure}

\subsection{Individual Level Estimations}

We can estimate the number of transactions we expect a newly acquired
customer to make in a given time period, just as with the Pareto/NBD
model. For this example, let's say we are interested in estimating the
number of repeat transactions we expect a newly-acquired customer to
make in a period of 10 years. The same rules that we used for
Pareto/NBD functions also apply to BG/BB functions: since we used
years to estimate the parameters, we stick to years to represent time
periods.
<<>>=
bgbb.Expectation(params, n=10);
@ 


But we want to be able to say something about our existing customers,
not just about a hypothetical customer to be acquired in the
future. Once again, we use conditional expectations for a holdout
period of 10 years. I am going to do this for 2 customers: A, who made
0 transactions in the calibration period; and B, who made 4
transactions in the calibration period, with the last transaction
occuring in the 5th year.

<<tidy=FALSE>>=
# customer A
n.cal = 6
n.star = 10
x = 0
t.x = 0
bgbb.ConditionalExpectedTransactions(params, n.cal, 
                                     n.star, x, t.x)
# customer B
x = 4
t.x = 5
bgbb.ConditionalExpectedTransactions(params, n.cal, 
                                     n.star, x, t.x)
@ 

As expected, B's conditional expectation is much higher than A's. The
point I am trying to make, however, is that there are 3464 A's in this
dataset and only 284 B's---you should never ignore the zeroes in these
models.

\subsection{Plotting/ Goodness-of-fit}

Figure \ref{fig:pnbdCalibrationFit}, is the first plot to test the
goodness-of-fit: a simple calibration period histogram. 

<<results="hide", eval=FALSE>>=
bgbb.PlotFrequencyInCalibration(params, rf.matrix)
@ 

As with the equivalent Pareto/NBD plot, keep in mind that this plot is
only useful for an initial verification that the fit of the BG/BB
model is not terrible.

The next step is to see how well the model performs in the holdout
period. When we used \texttt{dc.ElogToCbsCbt} earlier, we ignored a
lot of the data it generated. It is easy to get the holdout period
frequencies from that data:

<<>>=
holdout.cbs <- simData$holdout$cbs
x.star <- holdout.cbs[,"x.star"]
@ 
  
At this point, we can switch from simulated data to the real
discrete data (donation data) provided with the package. The holdout
period frequencies for the donations data is included in the
package. Using this information, we can generate figure
\ref{fig:bgbbCondExpComp}, which compares actual and conditional
expected transactions in the holdout period. It bins customers
according to calibration period frequencies.
 
\clearpage

<<fig.path="", label="bgbbCondExpComp", tidy=FALSE, echo=TRUE, size="small", fig.keep='none'>>=
n.star <- 5 # length of the holdout period
x.star <- donationsSummary$x.star
pdf(file = 'bgbbCondExpComp.pdf')
comp <- bgbb.PlotFreqVsConditionalExpectedFrequency(params, n.star, 
                                                    rf.matrix, x.star)
dev.off()
rownames(comp) <- c("act", "exp", "bin")
comp
@

\begin{figure}
  \begin{center}
  \includegraphics[width=0.6\textwidth]{bgbbCondExpComp}
  \caption{Actual vs. conditional expected transactions in the holdout period, binned by calibration period frequency.}\label{fig:bgbbCondExpComp}
  \end{center}
\end{figure}

Since the BG/BB model uses discrete data, we can also bin customers by
recency. Figure \ref{fig:bgbbCondExpCompRec} shows the fit of holdout
period frequencies, with customers binned in this manner.

<<fig.path="", label="bgbbCondExpCompRec", tidy=FALSE, echo=TRUE, size="small", fig.keep='none'>>=
pdf(file = 'bgbbCondExpCompRec.pdf')
comp <- bgbb.PlotRecVsConditionalExpectedFrequency(params, n.star, 
                                                   rf.matrix, x.star)
dev.off()
rownames(comp) <- c("act", "exp", "bin")
comp
@

\begin{figure}
  \begin{center}
  \includegraphics[width=0.6\textwidth]{bgbbCondExpCompRec}
  \caption{Actual vs. conditional expected transactions in the holdout period, binned by calibration period recency.}\label{fig:bgbbCondExpCompRec}
  \end{center}
\end{figure}

Another plotting function provided by the BTYD package is the tracking
function---binning the data according to which time period
transactions occurred in. Since we do not have the granular data
available, we once again cannot get the required data from the event
log; however, the process will be exactly the same as described in
section \ref{subsec:pnbdPlotting}. The annual transaction data is
provided by the package. 

<<fig.path="", label="bgbbTrackingInc", tidy=FALSE, echo=TRUE, fig.keep='none'>>=
inc.track.data <- donationsSummary$annual.trans
n.cal <- 6
xtickmarks <- 1996:2006
pdf(file = 'bgbbTrackingInc.pdf')
inc.tracking <- bgbb.PlotTrackingInc(params, rf.matrix, 
                                     inc.track.data, 
                                     xticklab = xtickmarks)
dev.off()
rownames(inc.tracking) <- c("act", "exp")
inc.tracking
@ 

\begin{figure}
  \begin{center}
  \includegraphics[width=0.6\textwidth]{bgbbTrackingInc}
  \caption{Actual vs. expected incremental purchasing behaviour.}\label{fig:bgbbTrackingInc}
  \end{center}
\end{figure}

Figure \ref{fig:bgbbTrackingInc} shows remarkable fit, but we can
smooth it out for a cleaner graph by making it cumulative (as shown in
figure \ref{fig:bgbbTrackingCum}).

<<fig.path="", label="bgbbTrackingCum", tidy=FALSE, echo=TRUE, size="small", fig.keep='none'>>=
cum.track.data <- cumsum(inc.track.data)
pdf(file = 'bgbbTrackingCum.pdf')
cum.tracking <- bgbb.PlotTrackingCum(params, rf.matrix, cum.track.data, 
                                     xticklab = xtickmarks)
dev.off()
rownames(cum.tracking) <- c("act", "exp")
cum.tracking
@ 

\begin{figure}
  \begin{center}
  \includegraphics[width=0.6\textwidth]{bgbbTrackingCum}
  \caption{Actual vs. expected cumulative purchasing behaviour.}\label{fig:bgbbTrackingCum}
  \end{center}
\end{figure}

\section{Further analysis}

The package functionality I highlighted above is just a starting point
for working with these models. There are additional tools, such as
functions for discounted expected residual transactions (the present
value of the remaining transactions we expect a customer to make) and
an implementation of the gamma-gamma spend model, which may come in
useful for customer analysis. Hopefully you now have an idea of how to
start working with the BTYD package - from here, you should be able to
use the package's additional functions, and may even want to implement
some of your own. Enjoy!

\clearpage

\section*{References}

\setlength{\parindent}{0pt} \setlength{\parskip}{1ex plus 0.2ex minus 0.2ex}

Fader, Peter S., and Bruce G.S. Hardie. ``A Note on Deriving the Pareto/NBD Model and Related Expressions.''
November. 2005. Web.\\ \href{http://www.brucehardie.com/notes/008/}{\textless http://www.brucehardie.com/notes/008/\textgreater}

Fader, Peter S., Bruce G.S. Hardie, and Jen
Shang. ``Customer-Base Analysis in a Discrete-Time Noncontractual Setting.'' \emph{Marketing Science}, \textbf{29(6)},
pp. 1086-1108. 2010. INFORMS.\\ \href{http://www.brucehardie.com/papers/020/}{\textless http://www.brucehardie.com/papers/020/\textgreater}

Fader, Peter S., Hardie, Bruce G.S., and Lee,
Ka Lok. ````Counting Your Customers'' the Easy Way: An Alternative to the Pareto/NBD Model.'' \emph{Marketing Science}, \textbf{24(2)},
pp. 275-284. 2005. INFORMS.\\ \href{http://brucehardie.com/papers/018/}{\textless http://brucehardie.com/papers/018/\textgreater}


\end{document}