This document provides an introduction to the R
code to
create typologies of trajectories in large databases using the algorithm
provided by the WeightedCluster R
library (Studer 2013). It also presents methods to
evaluate the quality of the resulting clustering in large databases.
Readers interested by the methods and looking for more information on
the interpretation of the results are referred to:
You are kindly asked to cite the above reference if you use the methods presented in this document.
Warning!! To avoid lengthy computations (and overloading the CRAN server), we restricted the number of iterations and the sample size. We strongly recommend using much higher values.
We start by loading the required library and setting the seed.
Time for this code chunk to run: 1.38 seconds
We rely on the biofam
dataset to illustrate the use of
the WeightedCluster
package. This public dataset is
distributed with the TraMineR R
package and code family
trajectories of a subsample of the Swiss Household Panel (SHP Group 2024).
First, we need to prepare the data by creating a state sequence
object using the seqdef
command (Gabadinho et al. 2011). This object stores all
the information about our trajectories, including the data and
associated characteristics, such as state labels. During this step, we
further define proper state labels as the original data are coded using
numerical values. We then plot the sequences using, for instance, a
chronogram.
data(biofam) #load illustrative data
## Defining the new state labels
statelab <- c("Parent", "Left", "Married", "Left/Married", "Child",
"Left/Child", "Left/Married/Child", "Divorced")
## Creating the state sequence object,
biofam.seq <- seqdef(biofam[,10:25], alphabet=0:7, states=statelab)
seqdplot(biofam.seq, legend.prop=0.2)
Time for this code chunk to run: 0.19 seconds
CLARA for SA is available in the seqclararange
function.
It is used as follows:
bfclara <- seqclararange(biofam.seq, R = 50, sample.size = 100, kvals = 2:10,
seqdist.args = list(method = "HAM"), parallel=TRUE,
stability=TRUE)
Time for this code chunk to run: 56.59 seconds
The function requires us to specify the sequence to cluster (our
biofam.seq
sequence object), the number of iterations
(argument R
, here set to 50 to avoid long computation time,
larger values are recommended) and the subsample size (argument
sample.size
, here again set to the low value of 100 for
illustrative purpose). The number of groups in our typology is set using
the kvals
arguments, here between 2 and ten groups
solutions. We directly specify a range of values that will be considered
later on. Finally, we need to specify how to compute the distance
between sequences through the seqdist.args
argument as a
list
object. All the arguments specified here will be
directly passed to the seqdist
function. Therefore, any
distance measures available in seqdist
can be used here.
Finally, we set stability=TRUE
to estimate the stability of
the clustering among the subsamples.
Setting parallel=TRUE
, a default parallel back-end is
set up using the future framework (Bengtsson
2021). However, any parallel back-end previously defined with the
plan
function will be used when
parallel=FALSE
. The parallel protocol can then be adapted
to specific environments, for instance some High Performance Computing
(HPC) server relies on specific protocols (MPI,…). As implied by the
name, setting progressbar=TRUE
shows information (and
estimated computation time) on the progress of the computations.
The values of the medoid-based CQI are shown when the result is
printed, or can be plotted using the plot
command. When
plotting the CQI, standardizing the values makes it easier to identify
the best solution (Studer 2013).
## Avg dist PBM DB XB AMS ARI>0.8 JC>0.8 Best iter
## cluster2 5.18 1.13 0.89 0.47 0.50 20 27 46
## cluster3 4.29 0.73 1.09 0.61 0.48 21 17 18
## cluster4 3.69 0.56 0.93 0.53 0.53 24 17 12
## cluster5 3.21 0.47 0.79 0.46 0.58 16 7 36
## cluster6 2.94 0.54 0.84 0.42 0.55 11 3 50
## cluster7 2.71 0.55 0.92 0.68 0.55 4 1 9
## cluster8 2.59 0.45 0.88 0.52 0.55 4 1 21
## cluster9 2.49 0.34 0.98 0.62 0.53 1 1 3
## cluster10 2.35 0.31 1.09 0.78 0.52 1 1 3
Time for this code chunk to run: 0.2 seconds
Except the PBM index, all indices favor a five-cluster solution. The
resulting clustering is stored in the clustering
element of
the results. It can for instance be used to represent the sequences in
each cluster as follows.
The stability of the clustering can also be plotted using either the average value, or the number of recoveries of a similar solution.
Time for this code chunk to run: 0.07 seconds
Here again, the five-cluster solution shows the highest CQI values. However, the absolute number of recoveries is low for more than seven groups. A higher number of iterations is therefore recommended. This is not surprising as we used a low number of iterations.
Time for this code chunk to run: 0.06 seconds
The bootclustrange
function can be used to bootstrap the
cluster quality measures. It has the following arguments. The first
object is the clustering to be evaluated, as a
seqclararange
object, a data.frame
or a
vector. Second, the sequences that were used to create the typology. We
should further specify the distance measure to use (as before), the
number of bootstraps (R
argument), and the subsample size
(here 100). We would generally use higher values for this last two
arguments. Finally, the parallel
and
progressbar
arguments could be used as before. The
resulting object is then printed and the standardized values of the CQI
are plotted. The results lead to the same conclusion as for the
medoid-based CQI.
bCQI <- bootclustrange(bfclara, biofam.seq, seqdist.args = list(method = "HAM"), R = 50, sample.size = 100, parallel=TRUE)
bCQI
## PBC HG HGSD ASW ASWw CH R2 CHsq R2sq HC
## cluster2 0.45 0.57 0.55 0.31 0.32 20.72 0.17 37.21 0.27 0.21
## cluster3 0.55 0.69 0.67 0.34 0.36 23.17 0.32 45.78 0.48 0.16
## cluster4 0.58 0.76 0.74 0.36 0.39 21.89 0.41 45.61 0.59 0.13
## cluster5 0.59 0.81 0.79 0.39 0.42 21.47 0.47 46.60 0.66 0.11
## cluster6 0.62 0.87 0.86 0.40 0.43 20.91 0.53 49.45 0.72 0.08
## cluster7 0.59 0.87 0.86 0.38 0.42 20.28 0.57 49.00 0.76 0.08
## cluster8 0.60 0.90 0.88 0.38 0.43 19.01 0.59 47.86 0.78 0.07
## cluster9 0.56 0.87 0.86 0.34 0.40 17.67 0.61 42.74 0.79 0.09
## cluster10 0.54 0.89 0.88 0.33 0.40 17.46 0.63 43.96 0.81 0.08
Time for this code chunk to run: 22.93 seconds
Once a clustering in a given number of groups has been selected, we can plot the sequences by cluster to give a better interpretation.
Time for this code chunk to run: 0.23 seconds
By setting method="fuzzy"
in the
seqclararange
function, the fuzzy version of the algorithm
is used. It should be noted that the computations are generally longer
than for crisp clustering. The CQI values are printed and plotted using
the same commands.
bfclaraf <- seqclararange(biofam.seq, R = 50, sample.size = 100, kvals = 2:10, method="fuzzy",
seqdist.args = list(method = "HAM"), parallel=TRUE)
bfclaraf
## Avg dist PBM DB XB AMS ARI>0.8 JC>0.8 Best iter
## cluster2 4.36 0.71 1.20 0.44 0.62 NA NA 1
## cluster3 3.22 0.50 1.28 0.46 0.72 NA NA 17
## cluster4 2.62 0.33 1.31 0.37 0.73 NA NA 32
## cluster5 2.27 0.19 1.60 0.45 0.74 NA NA 18
## cluster6 2.02 0.17 1.49 0.40 0.75 NA NA 7
## cluster7 1.81 0.14 1.70 0.45 0.75 NA NA 15
## cluster8 1.67 0.13 1.66 0.42 0.74 NA NA 4
## cluster9 1.56 0.11 1.98 0.52 0.75 NA NA 4
## cluster10 1.46 0.09 1.96 0.49 0.76 NA NA 4
Time for this code chunk to run: 58.76 seconds
The XB and AMS index favor a four or six cluster solution. Several
plots are available to describe fuzzy clustering of trajectories, see
Studer (2018) and the
fuzzyseqplot
function. Here, we use sequence index plot
ordered by membership strength, with typical trajectories of each
cluster represented at the top, leaving aside trajectories with low
membership. In this kind of graphic, hybrid trajectories are represented
at the bottom.
fuzzyseqplot(biofam.seq, group=bfclaraf$clustering$cluster4, type="I", sortv="membership", membership.threashold=0.4)
Time for this code chunk to run: 2.99 seconds