This document provides an introduction to the R
code to
create typologies of trajectories using fuzzy and property-based
clustering in sequence analysis using the cluster
and
WeightedCluster R
library (Studer
2013). Readers interested by the methods and looking for more
information on the interpretation of the results are referred to:
You are kindly asked to cite the above reference if you use the methods presented in this document.
We start by loading the required library and setting the seed.
Time for this code chunk to run: 0.08 seconds
We rely on the biofam
dataset to illustrate the use of
the WeightedCluster
package. This public dataset is
distributed with the TraMineR R
package and code family
trajectories of a subsample of the Swiss Household Panel (SHP Group 2024).
First, we need to prepare the data by creating a state sequence
object using the seqdef
command (Gabadinho et al. 2011). This object stores all
the information about our trajectories, including the data and
associated characteristics, such as state labels. During this step, we
further define proper state labels as the original data are coded using
numerical values. We then plot the sequences using, for instance, a
chronogram.
data(biofam) #load illustrative data
## Defining the new state labels
statelab <- c("Parent", "Left", "Married", "Left/Married", "Child",
"Left/Child", "Left/Married/Child", "Divorced")
## Creating the state sequence object,
biofam.seq <- seqdef(biofam[,10:25], alphabet=0:7, states=statelab)
seqdplot(biofam.seq, legend.prop=0.2)
Time for this code chunk to run: 0.69 seconds
The clustering algorithms discussed here relies on a distance matrix.
We compute one using the "LCS"
distance.
Time for this code chunk to run: 1.37 seconds
In fuzzy clustering, the sequences belong to each cluster with an
estimated membership strength. This approach is particularly useful when
some sequences are thought to lie in between two (or more) types of
sequences (i.e., hybrid-type sequences) or when only weak structure is
found in the data. Here, we use the fanny
fuzzy clustering
algorithm Maechler et al. (2024), which is
available in the cluster R
library. It is used as
follows:
Time for this code chunk to run: 34.46 seconds
The fanny
function requires us to specify the distance
matrix (our diss
object), the number of groups (argument
k=5
), diss=TRUE
specifies that we directly
provided a distance matrix, and memb.exp=1.5
the fuzziness
parameters (values between 1.5 and 2 are often recommended).
Descriptive statistics of membership strength to each cluster provide useful information. The mean can be interpreted as a relative frequency of each cluster if sequences are weighted according to their membership strength.
## V1 V2 V3 V4
## Min. :0.008774 Min. :0.001817 Min. :0.007352 Min. :0.005009
## 1st Qu.:0.077276 1st Qu.:0.019741 1st Qu.:0.039166 1st Qu.:0.044493
## Median :0.140503 Median :0.053033 Median :0.055403 Median :0.091802
## Mean :0.205773 Mean :0.211530 Mean :0.210021 Mean :0.178295
## 3rd Qu.:0.286928 3rd Qu.:0.298238 3rd Qu.:0.210623 3rd Qu.:0.234374
## Max. :0.778559 Max. :0.956435 Max. :0.981889 Max. :0.820392
## V5
## Min. :0.002511
## 1st Qu.:0.034931
## Median :0.079874
## Mean :0.194380
## 3rd Qu.:0.212156
## Max. :0.895185
Time for this code chunk to run: 0.03 seconds
A graphical representation of the typology can be obtained by
weighting each sequences according to its membership strength (Studer 2018). Using this strategy, one can
represent distribution plots (see the seqplot
function of
the TraMineR
package). The fuzzyseqplot
function from the WeightedCluster
can be used to do so.
## Displaying the resulting clustering with membership threshold of 0.4
fuzzyseqplot(biofam.seq, group=fclust$membership, type="d")
Time for this code chunk to run: 0.29 seconds
Following a similar strategy, one can further order the sequences
according to the membership strength (argument
sortv="membership"
) in an index plot (argument
type="I"
) and remove sequences with low membership
strengths (argument membership.threashold =0.4
). In this
plot, each sequence is represented using a thin line. The sequences at
the top are the one with the highest membership strength. They therefore
describe the best the full cluster membership.
## Displaying the resulting clustering with membership threshold of 0.4
fuzzyseqplot(biofam.seq, group=fclust$membership, type="I", membership.threashold =0.4, sortv="membership")
Time for this code chunk to run: 1.76 seconds ## Regression models
Fuzzy clustering membership matrix can be directly included in subsequent regression as an independent variable. To use it as a dependent variable, you mus use Dirichlet or Beta regression (see the full article for more explanation).
Dirichlet regressions are available in the DirichletReg
package (Maier 2021). Here is a sample
regression with sex and birth year as independent variables.
## Loading required package: Formula
##Estimation of Dirichlet Regression
##Dependent variable formatting
fmember <- DR_data(fclust$membership)
## Estimation
bdirig <- DirichReg(fmember~sex+birthyr|1, data=biofam, model="alternative")
## Displaying results of Dirichlet regression.
summary(bdirig)
## Call:
## DirichReg(formula = fmember ~ sex + birthyr | 1, data = biofam, model =
## "alternative")
##
## Standardized Residuals:
## Min 1Q Median 3Q Max
## v1 -1.1019 -0.7502 -0.4404 0.2699 2.7339
## v2 -1.0543 -0.8090 -0.6264 0.5368 5.4646
## v3 -1.4116 -0.8212 -0.6344 0.0549 4.9510
## v4 -1.0227 -0.7999 -0.5514 0.1737 3.3375
## v5 -1.0592 -0.7934 -0.5392 0.0987 3.9312
##
## MEAN MODELS:
## ------------------------------------------------------------------
## Coefficients for variable no. 1: v1
## - variable omitted (reference category) -
## ------------------------------------------------------------------
## Coefficients for variable no. 2: v2
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -35.706127 5.156757 -6.924 4.39e-12 ***
## sexwoman 0.034515 0.054809 0.630 0.529
## birthyr 0.018218 0.002654 6.864 6.72e-12 ***
## ------------------------------------------------------------------
## Coefficients for variable no. 3: v3
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 31.472422 4.975613 6.325 2.53e-10 ***
## sexwoman -0.144001 0.054029 -2.665 0.00769 **
## birthyr -0.016240 0.002562 -6.338 2.32e-10 ***
## ------------------------------------------------------------------
## Coefficients for variable no. 4: v4
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -10.819906 4.962915 -2.180 0.0292 *
## sexwoman 0.061080 0.054263 1.126 0.2603
## birthyr 0.005476 0.002555 2.143 0.0321 *
## ------------------------------------------------------------------
## Coefficients for variable no. 5: v5
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -15.239307 4.929080 -3.092 0.001990 **
## sexwoman 0.190014 0.054565 3.482 0.000497 ***
## birthyr 0.007686 0.002537 3.029 0.002452 **
## ------------------------------------------------------------------
##
## PRECISION MODEL:
## ------------------------------------------------------------------
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.14238 0.01312 87.1 <2e-16 ***
## ------------------------------------------------------------------
## Significance codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Log-likelihood: 7438 on 13 df (247 BFGS + 2 NR Iterations)
## AIC: -14850, BIC: -14777
## Number of Observations: 2000
## Links: Logit (Means) and Log (Precision)
## Parametrization: alternative
Time for this code chunk to run: 7.63 seconds
Beta regression is available in the betareg
package
(Zeileis et al. 2024). To estimate a beta
regression for the third fuzzy type, one can use.
library(betareg)
## Estimation of beta regression
breg1 <- betareg(fclust$membership[, 3]~sex+birthyr, data=biofam)
## Displaying results
summary(breg1)
##
## Call:
## betareg(formula = fclust$membership[, 3] ~ sex + birthyr, data = biofam)
##
## Quantile residuals:
## Min 1Q Median 3Q Max
## -1.8110 -0.6738 -0.3801 0.0635 2.7888
##
## Coefficients (mean model with logit link):
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 44.527815 4.881684 9.121 < 2e-16 ***
## sexwoman -0.184664 0.052269 -3.533 0.000411 ***
## birthyr -0.023337 0.002514 -9.284 < 2e-16 ***
##
## Phi coefficients (precision model with identity link):
## Estimate Std. Error z value Pr(>|z|)
## (phi) 1.47522 0.04207 35.07 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Type of estimator: ML (maximum likelihood)
## Log-likelihood: 1163 on 4 Df
## Pseudo R-squared: 0.07331
## Number of iterations: 19 (BFGS) + 4 (Fisher scoring)
Time for this code chunk to run: 1.89 seconds
The aim of property-based clustering is to build a sequence typology by identifying well-defined clustering rules that are based on the most relevant properties of the analyzed object. Here, we present the algorithm discussed in (Studer 2018), an extension of the algorithms of Chavent, Lechevallier, and Briant (2008) and Piccarreta and Billari (2007).
Property-based clustering works in two steps. First, sequence properties are extracted. Second, the clustering is created based on the extracted properties. The following table presents the states sequences properties that can be extracted.
Name | Description |
---|---|
"state" |
The state in which an individual is found, at each time position \(t\). |
"spell.age" |
The age at the beginning of each spell of a given type. |
"spell.dur" |
The duration of each of the spells presented above. |
"duration" |
The total time spent in each state. |
"pattern" |
Count of the frequent subsequences of states in the DSS. |
"AFpattern" |
Age at the first occurrence of the above frequent subsequence. |
"transition" |
Count of the frequent subsequence of events in each sequence, where each transition is considered another event. |
"AFtransition" |
Age at the first occurrence of the above frequent subsequence. |
"Complexity" |
Complexity index, number of transitions, turbulence. |
Second, the clustering algorithm is run. In R
, the two
steps are regrouped in the same function seqpropclust
. The
first argument specifies the state sequence object. Then, the
diss
argument specifies the distance matrix,
maxcluster
the maximum number of clusters to consider, and
properties
the list of properties to consider using the
names in the above table.
## [>] Extracting 'state' properties...OK (16 properties extracted)
## [>] Extracting 'duration' properties...OK (9 properties extracted)
## [>] 25 properties extracted.
## Dissimilarity tree:
## Parameters: min.size=20, max.depth=5, R=1, pval=1
## Formula: seqdata ~ state.a15 + state.a16 + state.a17 + state.a18 + state.a19 + state.a20 + state.a21 + state.a22 + state.a23 + state.a24 + state.a25 + state.a26 + state.a27 + state.a28 + state.a29 + state.a30 + duration.Parent + duration.Left + duration.Married + `duration.Left/Married` + duration.Child + `duration.Left/Child` + `duration.Left/Married/Child` + duration.Divorced + `duration.*`
## Global R2: 0.48761
##
## Fitted tree:
##
## |-- Root (n: 2000 disc: 8.0991)
## |-> duration.Left 0.21436
## |-- <= 3 (n: 1335 disc: 6.8008)
## |-> duration.Left/Married/Child 0.22773
## |-- <= 3 (n: 860 disc: 6.0596)
## |-> state.a26 0.2375
## |-- [ Parent ] (n: 386 disc: 2.2151)[(Parent,16)] *
## |-- [ Left,Married,Left/Married,Child,Left/Child,Left/Married/Child,Divorced ] (n: 474 disc: 6.5793)
## |-> duration.Married 0.35945
## |-- <= 2 (n: 309 disc: 4.9034)[(Parent,9)-(Left/Married,7)] *
## |-- > 2 (n: 165 disc: 2.9239)[(Parent,8)-(Married,8)] *
## |-- > 3 (n: 475 disc: 3.7901)[(Parent,8)-(Left/Married,1)-(Left/Married/Child,7)] *
## |-- > 3 (n: 665 disc: 5.4841)[(Parent,6)-(Left,9)-(Left/Married,1)] *
Time for this code chunk to run: 1.59 seconds
If GraphViz is installed on the computer, one can use
seqtreedisplay
for a graphical representation of the
results (not presented here).
Time for this code chunk to run: 0 seconds
The cluster quality indices can be computed using the
as.clustrange()
function as for hierarchical clustering
algorithms.
## PBC HG HGSD ASW ASWw CH R2 CHsq R2sq HC
## cluster2 0.50 0.61 0.59 0.36 0.36 545.15 0.21 1108.58 0.36 0.19
## cluster3 0.54 0.69 0.67 0.35 0.35 518.98 0.34 1100.68 0.52 0.16
## cluster4 0.54 0.73 0.71 0.36 0.36 478.65 0.42 1008.63 0.60 0.15
## cluster5 0.58 0.80 0.78 0.41 0.41 474.63 0.49 1122.81 0.69 0.12
Time for this code chunk to run: 0.69 seconds
The clustering solution can then be accessed and plotted using the same procedure as for any cluster algorithms, e.g.:
Time for this code chunk to run: 0.21 seconds