The mathematics behind TD
The temporal difference (TD) model (Sutton
& Barto, 1990) is an extension of the ideas underlying the RW
model (Rescorla & Wagner, 1972). Most
notably the TD model abandons the construct of a “trial”, favoring
instead time-based formulations. Also notable is the introduction of
eligibility traces, which allow the model to bridge temporal gaps and
deal with the credit assignment problem.
Implementation note: As of calmr
version
0.6.2
, stimulus representation in TD is based on complete
serial compounds (i.e., time-specific stimulus elements entirely
discriminable from each other), and the eligibility traces are of the
replacing type.
General Note: There are several descriptions of the TD model out
there, however, all of the ones I found were opaque when it comes to
implementation. Hence, the following description of the model has a
focus on implementation details.
1 - Maintaining stimulus representations
TD maintains stimulus traces as eligibility traces. The eligibility
of stimulus \(i\) at time \(t\), \(e_i^t\), is given by:
\[
\tag{Eq. 1}
e_i^t = e_i^{t-1} \sigma \gamma + x_i^t
\]
where \(\sigma\) and \(\gamma\) are decay and discount parameters,
respectively, and \(x_i^t\) is the
activation of stimulus \(i\) at time
\(t\) (1 or 0 for present and absent
stimuli, respectively).
Internally, \(e_i\) is represented
as a vector of length \(d\), where
\(d\) is the number of stimulus
compounds (not in the general sense of the word compound, but in terms
of complete serial compounds, or CSC). For example, a 2s stimulus in a
model with a time resolution of 0.5s will have a \(d = 4\), and the second entry in that
vector represents the eligibility of the compound active after the
stimulus has been present for 1s.
Similarly, \(x_i^t\) entails the
specific compound of stimulus \(i\) at
time \(t\), and not the general
activation of \(i\) at that time. For
example, suppose two, 2s stimuli, \(A\)
and \(B\) are presented with an overlap
of 1s, with \(A\)’s onset occurring
first. Can you guess what stimulus compounds will be active at \(t = 2\) with a time resolution of 0.5s?
2 - Generating expectations
The TD model generates stimulus expectations based on the presented
stimuli, not on the strength of eligibility traces. The
expectation of of stimulus \(j\) at
time \(t\), \(V_j^t\), is given by:
\[
\tag{Eq. 2}
V_j^t = w_j^{t'} x^t = \sum_i^K w_{i,j}^t x_i^t
\]
Where \(w_j^t\) is a matrix of
stimulus weights at time \(t\) pointing
towards \(j\), \('\) denotes transposition, and \(w_{i,j}\) denotes an entry in a square
matrix denoting the association from \(i\) to \(j\). As with the eligibility traces above,
the entries in each matrix are the weights of specific stimulus
compounds.
Internally, the \(w_j^t\) is
constructed on a trial-by-trial, step-by-step basis, depending on the
stimulus compounds active at the time.
3 - Learning associations
Owing to its name, the TD model updates associations based on a
temporally discounted prediction of upcoming stimuli. This temporal
difference error term is given by:
\[
\tag{Eq. 3}
\delta_j^t = \lambda_j^t + \gamma V_j^t - V_j^{t-1}
\]
where \(\lambda_j\) is the value of
stimulus \(j\) at time \(t\), which also determines the asymptote
for stimulus weights towards \(j\).
The temporal difference error term is used to update \(w\) via:
\[
\tag{Eq. 4}
w_{i,j}^t = w_{i,j}^t + \alpha_i \beta(x_j^t) \delta_j^t e_i^t
\]
where \(\alpha_i\) is a learning
rate parameter for stimulus \(i\), and
\(\beta(x_j)\) is a function that
returns one of two learning rate parameters (\(\beta_{on}\) or \(\beta_{off}\)) depending on whether \(j\) is being presented or not at time \(t\).
4 - Generating responses
As with many associative learning models, the transformation between
stimulus expectations and responding is unspecified/left in the hands of
the user. The TD model does not return a response vector, but it
suffices to assume that responding is the identity function on the
expected stimulus values, as follows:
\[
\tag{Eq. 5}
r_j^t = V_j^t
\]
References
Rescorla, R. A., & Wagner, A. R. (1972). A theory of
Pavlovian conditioning: Variations in the
effectiveness of reinforcement and nonreinforcement. In A. H. Black
& W. F. Prokasy (Eds.), Classical conditioning II:
Current research and theory. (pp. 64–69).
Appleton-Century-Crofts.
Sutton, R. S., & Barto, A. G. (1990). Time-derivative models of
Pavlovian reinforcement. In M. Gabriel & J. W. Moore
(Eds.), Learning and computational neuroscience (pp. 497–537).
MIT Press.