This document provides a very quick introduction to the
R
code needed to use parametric bootstraps for typology
validation in sequence analysis. Readers interested in the methods and
the exact interpretation of the results are referred to:
You are kindly asked to cite the above reference if you use the methods presented in this document.
Warning!! To avoid lengthy computations (and overloading the CRAN server), we restricted the number of bootstraps to 50. We recommend using a higher value (i.e., 1000).
Let’s start by setting the seed for reproducible results.
Time for this code chunk to run: 0.1 seconds
For this example, we use the mvad
dataset. Let’s start
with the creation of the state sequence object.
## Loading the TraMineR library
library(TraMineR)
## Loading the data
data(mvad)
## State properties
mvad.alphabet <- c("employment", "FE", "HE", "joblessness", "school", "training")
mvad.lab <- c("employment", "further education", "higher education", "joblessness", "school", "training")
mvad.shortlab <- c("EM","FE","HE","JL","SC","TR")
## Creating the state sequence object
mvad.seq <- seqdef(mvad, 17:86, alphabet = mvad.alphabet, states = mvad.shortlab, labels = mvad.lab, xtstep = 6)
Time for this code chunk to run: 0.53 seconds
We will now create a typology using cluster analysis. Readers
interested in more detail are referred to the
WeightedCluster
library manual (also available as a
vignette), which goes into the details of the creation and computation
of cluster quality measures.
We start by computing dissimilarities with the seqdist
function using the Hamming distance. We then use Ward clustering to
create a typology of the trajectories. For this step, we recommend the
use of the fastcluster
library, which considerably speed up
the computations.
## Using fastcluster for hierarchical clustering
library(fastcluster)
## Distance computation
diss <- seqdist(mvad.seq, method="HAM")
## Hierarchical clustering
hc <- hclust(as.dist(diss), method="ward.D")
Time for this code chunk to run: 2.93 seconds
We can now compute several cluster quality indices using
as.clustrange
function from two to ten groups.
# Loading the WeightedCluster library
library(WeightedCluster)
# Computing cluster quality measures.
clustqual <- as.clustrange(hc, diss=diss, ncluster=10)
clustqual
## PBC HG HGSD ASW ASWw CH R2 CHsq R2sq HC
## cluster2 0.55 0.71 0.69 0.39 0.39 184.57 0.21 347.47 0.33 0.14
## cluster3 0.49 0.57 0.57 0.27 0.27 139.53 0.28 261.70 0.42 0.22
## cluster4 0.46 0.58 0.57 0.24 0.24 124.40 0.35 228.94 0.49 0.23
## cluster5 0.51 0.68 0.67 0.27 0.28 118.20 0.40 241.78 0.58 0.18
## cluster6 0.52 0.70 0.70 0.28 0.29 117.08 0.45 241.96 0.63 0.17
## cluster7 0.55 0.78 0.78 0.32 0.32 114.14 0.49 245.83 0.68 0.13
## cluster8 0.56 0.82 0.82 0.32 0.33 109.22 0.52 244.85 0.71 0.11
## cluster9 0.57 0.85 0.85 0.34 0.35 105.74 0.55 255.12 0.74 0.09
## cluster10 0.55 0.85 0.84 0.32 0.33 101.81 0.57 244.05 0.76 0.10
Time for this code chunk to run: 0.38 seconds
Parametric bootstrap aims to provide baseline values obtained by
clustering similar but non-clustered data (Studer 2021). This can be computed using the
seqnullcqi
function with the following parameters:
R
: number of bootstraps.model
: The null model (see table belowseqdist.args
: list of arguments passed to
seqdist
(should be identical to first call to
seqdist).hclust.method
: hierarchical clustering method (should
be identical to orginal clustering).kmedoid
: If TRUE
, use PAM (and the
wcKMedRange
function) instead of hierarchical
clustering.parallel
: If TRUE
, use parallel computing
to speed up the computations.The following R
code estimate expected values of the
cluster quality indices when clustering similar sequences that are not
clustered according to the "combined"
model, using the
Hamming distance and Ward hierarchical clustering. We set
parallel=TRUE
to use parallel computing. You can use
progressbar=TRUE
to show a progress bar and an estimation
of the computation remaining time (not meaningful here within a
document):
bcq.combined <- seqnullcqi(mvad.seq, clustqual, R=50, model="combined", seqdist.args=list(method="HAM"), hclust.method="ward.D", parallel = TRUE)
Time for this code chunk to run: 150.16 seconds
Once the parametric bootstrap is computed (may take a while…), the
results are stored in the bcq.combined
object. Printing the
object (just by writing its name), already provides several information,
the standardized cluster quality indices and the associated inconclusive
intervals. Here, 2, 9 and 10 groups stand out.
## Parametric bootstrap cluster analysis validation
## Sequence analysis null model: list(model = "combined")
## Number of bootstraps: 50
## Clustering method: hclust with ward.D
## Seqdist arguments: list(method = "HAM")
##
##
## PBC HG HGSD ASW
## cluster2 5.88 6.03 6.12 16.03
## cluster3 3.93 4.04 4.12 8.04
## cluster4 1.92 2.26 2.32 4.95
## cluster5 1.81 2.11 2.16 5.08
## cluster6 1.09 1.42 1.45 4.46
## cluster7 2.23 2.84 2.87 8.31
## cluster8 3.18 3.97 4 9.59
## cluster9 3.87 4.67 4.7 10.99
## cluster10 3.71 4.45 4.48 10.8
##
## Null Max-T 0.95 interval [-0.14; 2.26] [-0.06; 2.14] [-0.05; 2.15] [0.18; 2.16]
## ASWw CH R2 CHsq
## cluster2 15.94 23.07 19.58 25.86
## cluster3 7.99 20.81 16.89 21.1
## cluster4 4.9 20.7 16.14 18.14
## cluster5 5.05 21.57 16.17 19.5
## cluster6 4.42 23.17 16.64 18.74
## cluster7 8.24 26.57 18.37 22.16
## cluster8 9.48 30.66 20.68 26.37
## cluster9 10.88 35.08 23.05 33.16
## cluster10 10.65 37.34 24.11 33.98
##
## Null Max-T 0.95 interval [0.19; 2.16] [-1.06; 2.48] [-1.07; 2.41] [-0.98; 2.49]
## R2sq HC
## cluster2 19.32 -5.39
## cluster3 14.92 -4.48
## cluster4 12.42 -2.76
## cluster5 12.14 -2.74
## cluster6 11.2 -2.29
## cluster7 12.44 -4.55
## cluster8 14.11 -6.2
## cluster9 16.48 -7.11
## cluster10 16.87 -7.46
##
## Null Max-T 0.95 interval [-0.99; 2.36] [0.09; 2.52]
Time for this code chunk to run: 0.06 seconds
To get non-standardized values, use norm=FALSE
. Notice
that the ASW inconclusive intervals are well below the values
recommended by Kaufman and Rousseeuw (over 0.5).
## Parametric bootstrap cluster analysis validation
## Sequence analysis null model: list(model = "combined")
## Number of bootstraps: 50
## Clustering method: hclust with ward.D
## Seqdist arguments: list(method = "HAM")
##
##
## PBC HG HGSD ASW
## cluster2 0.55 0.71 0.69 0.39
## cluster3 0.49 0.57 0.57 0.27
## cluster4 0.46 0.58 0.57 0.24
## cluster5 0.51 0.68 0.67 0.27
## cluster6 0.52 0.7 0.7 0.28
## cluster7 0.55 0.78 0.78 0.32
## cluster8 0.56 0.82 0.82 0.32
## cluster9 0.57 0.85 0.85 0.34
## cluster10 0.55 0.85 0.84 0.32
##
## Null Max-T 0.95 interval [0.46; 0.56] [0.69; 0.78] [0.69; 0.78] [0.14; 0.21]
## ASWw CH R2
## cluster2 0.39 184.57 0.21
## cluster3 0.27 139.53 0.28
## cluster4 0.24 124.4 0.35
## cluster5 0.28 118.2 0.4
## cluster6 0.29 117.08 0.45
## cluster7 0.32 114.14 0.49
## cluster8 0.33 109.22 0.52
## cluster9 0.35 105.74 0.55
## cluster10 0.33 101.81 0.57
##
## Null Max-T 0.95 interval [0.15; 0.22] [38.84; 59.2] [0.31; 0.35]
## CHsq R2sq HC
## cluster2 347.47 0.33 0.14
## cluster3 261.7 0.42 0.22
## cluster4 228.94 0.49 0.23
## cluster5 241.78 0.58 0.18
## cluster6 241.96 0.63 0.17
## cluster7 245.83 0.68 0.13
## cluster8 244.85 0.71 0.11
## cluster9 255.12 0.74 0.09
## cluster10 244.05 0.76 0.1
##
## Null Max-T 0.95 interval [75.26; 103.53] [0.48; 0.54] [0.34; 0.45]
Time for this code chunk to run: 0.05 seconds
Several plots can then be used to inspect the results using the
plot
command and the type
argument. First, one
can look at the sequences generated by the null model by using
type="seqdplot"
.
Time for this code chunk to run: 1.29 seconds
The overall distribution of the CQI values can be plotted using
type="density"
. In this case, one also needs to specify the
CQI to be used. All tested number of groups are found to be significant.
Any CQI computed by as.clustrange()
can be used here. To
show the density of the average silhouette width ("ASW"
),
one can use:
Time for this code chunk to run: 0.36 seconds
By using type="line"
, we plot the obtained and
bootstrapped CQI values depending on the number of groups. Here
again
Time for this code chunk to run: 0.62 seconds
To use another null model, one needs to change the model
argument of the seqnullcqi
function. The randomized
sequencing keep the duration attached to each state, but randomizes the
ordering of the spells. It can be used to uncover sequencing structure
of the data.
bcq.seq <- seqnullcqi(mvad.seq, clustqual, R=50, model="sequencing", seqdist.args=list(method="HAM"), hclust.method="ward.D", parallel = TRUE)
Time for this code chunk to run: 151.07 seconds
We can then plot the results as before. Notice that solutions between 3 and 6 are below the critical line.
Time for this code chunk to run: 0.28 seconds
The randomized duration keeps the same ordering of the states, but randomizes the time spent in each spell. It can be used to uncover the duration-related structure of the data.
bcq.dur <- seqnullcqi(mvad.seq, clustqual, R=50, model="duration", seqdist.args=list(method="HAM"), hclust.method="ward.D", parallel = TRUE)
Time for this code chunk to run: 134.55 seconds
We can then plot the results as before. The solutions 3 and 4 groups solutions are below the “significance line”. Otherwise, the ranking of the solutions is the same.
Time for this code chunk to run: 0.27 seconds
The state independence null model generates sequence, position by position, independently of the previous state. This is a quite unrealistic assumption for longitudinal data, but a common one in statistical modeling.
bcq.stateindep <- seqnullcqi(mvad.seq, clustqual, R=50, model="stateindep", seqdist.args=list(method="HAM"), hclust.method="ward.D", parallel = TRUE)
Time for this code chunk to run: 134.9 seconds
Bootstrapped CQI values are extremely low compared to our clustering, meaning that we have a strong longitudinal structure (not surprising!).
Time for this code chunk to run: 0.29 seconds
The first-order Markov null model generates sequences using time-invariant transition rates. As a result, the generated sequences are often quite similar to the observed ones. This model can uncover structure stemming from time-dependent transition rates.
bcq.Markov <- seqnullcqi(mvad.seq, clustqual, R=50, model="Markov", seqdist.args=list(method="HAM"), hclust.method="ward.D", parallel = TRUE)
Time for this code chunk to run: 142.41 seconds
Time for this code chunk to run: 0.34 seconds
The various null models lead to the same conclusions and ranking of the solutions. Solutions between 3 and 6 groups were not always above the critical lines (in the sequencing null model for instance), and can be avoided. We generally saw good clustering quality for a clustering in 9 groups. The solution is shown below.
Time for this code chunk to run: 1.04 seconds