Clustering of unknown subpopulations in admixture models
Xavier Milhaud
Source:vignettes/admixtureclustering.Rmd
admixtureclustering.Rmd
The clustering of populations following admixture models is, for now, based on the Ksample test theory. Consider \(K\) samples. For \(i=1,...,K\), sample \(X^{(i)} = (X_1^{(i)}, ..., X_{n_i}^{(i)})\) follows \[L_i(x) = p_i F_i(x) + (1p_i) G_i, \qquad x \in \mathbb{R}.\]
We still use IBM approach to perform pairwise hypothesis testing. The idea is to adapt the Ksample test procedure to obtain a datadriven method that cluster the \(K\) populations into \(N\) subgroups, characterized by a common unknown mixture component. The advantages of such an approach is twofold:
 the number \(N\) of clusters is automatically chosen by the procedure,
 Each subgroup is validated by the Ksample testing method, which has theoretical guarantees.
This clustering technique thus allows to cluster unobserved subpopulations instead of individuals. We call this algorithm the Ksample 2component mixture clustering (K2MC).
Algorithm
We now detail the steps of the algorithm.
 Initialization: create the first cluster to be filled, i.e. \(c = 1\). By convention, \(S_0=\emptyset\).
 Select \(\{x,y\}={\rm argmin}\{d_n(i,j); i \neq j \in S \setminus \bigcup_{k=1}^c S_{k1}\}\).
 Test \(H_0\) between \(x\) and \(y\).
If $H_0$ is not rejected then $S_1 = \{x,y\}$,\\
Else $S_1 = \{x\}$, $S_{c+1} = \{y\}$ and then $c=c+1$.
 While \(S\setminus \bigcup_{k=1}^c S_k = \emptyset\) do
Select $u={\rm argmin}\{d(i,j); i\in S_c, j\in S\setminus \bigcup_{k=1}^c S_k\}$;
Test $H_0$ the simultaneous equality of all the $f_j$, $j\in S_c$ :\\
If $H_0$ not rejected, then put $S_c=S_c\bigcup \{u\}$;\\
Else $S_{c+1} = \{u\}$ and $c = c+1$.
Applications
On \(\mathbb{R}^+\)
We present a case study with 5 populations to cluster on \(\mathbb{R}^+\), with GammaExponential and GammaGamma mixtures.
## Simulate data (chosen parameters indicate 3 clusters (populations (1,3), (2,5) and 4)!):
< list(f1 = "gamma", g1 = "exp",
list.comp f2 = "gamma", g2 = "exp",
f3 = "gamma", g3 = "gamma",
f4 = "exp", g4 = "exp",
f5 = "gamma", g5 = "exp")
< list(f1 = list(shape = 16, rate = 4), g1 = list(rate = 1/3.5),
list.param f2 = list(shape = 14, rate = 2), g2 = list(rate = 1/5),
f3 = list(shape = 16, rate = 4), g3 = list(shape = 12, rate = 2),
f4 = list(rate = 1/2), g4 = list(rate = 1/7),
f5 = list(shape = 14, rate = 2), g5 = list(rate = 1/6))
< rsimmix(n=6000, unknownComp_weight=0.8, comp.dist = list(list.comp$f1,list.comp$g1),
A.sim comp.param = list(list.param$f1, list.param$g1))$mixt.data
< rsimmix(n=6000, unknownComp_weight=0.5, comp.dist = list(list.comp$f2,list.comp$g2),
B.sim comp.param = list(list.param$f2, list.param$g2))$mixt.data
< rsimmix(n=6000, unknownComp_weight=0.65, comp.dist = list(list.comp$f3,list.comp$g3),
C.sim comp.param = list(list.param$f3, list.param$g3))$mixt.data
< rsimmix(n=6000, unknownComp_weight=0.75, comp.dist = list(list.comp$f4,list.comp$g4),
D.sim comp.param = list(list.param$f4, list.param$g4))$mixt.data
< rsimmix(n=6000, unknownComp_weight=0.55, comp.dist = list(list.comp$f5,list.comp$g5),
E.sim comp.param = list(list.param$f5, list.param$g5))$mixt.data
## Look for the clusters:
< list(f1 = NULL, g1 = "exp",
list.comp f2 = NULL, g2 = "exp",
f3 = NULL, g3 = "gamma",
f4 = NULL, g4 = "exp",
f5 = NULL, g5 = "exp")
< list(f1 = NULL, g1 = list(rate = 1/3.5),
list.param f2 = NULL, g2 = list(rate = 1/5),
f3 = NULL, g3 = list(shape = 12, rate = 2),
f4 = NULL, g4 = list(rate = 1/7),
f5 = NULL, g5 = list(rate = 1/6))
< admix_clustering(samples = list(A.sim,B.sim,C.sim,D.sim,E.sim), n_sim_tab = 10,
clusters comp.dist=list.comp, comp.param=list.param, parallel=FALSE, n_cpu=2)
#>

  0%

====  8%

==============================  60%

========================================  80%

================================================== 100%
clusters#> Call:
#> admix_clustering(samples = list(A.sim, B.sim, C.sim, D.sim, E.sim),
#> n_sim_tab = 10, comp.dist = list.comp, comp.param = list.param,
#> parallel = FALSE, n_cpu = 2)
#>
#> The number of populations/samples under study is 5.
#> The level of the underlying ksample testing procedure is set to 5%.
#>
#> The number of detected clusters in these populations equals 3.
#> The pvalues of the ksample tests (showing when to close the clusters (i.e. pvalue < 0.05) equal: 0.375, 0, 0, 0.714.
#>
#> The list of clusters with populations belonging to them (in numeric format, i.e. inside c()) :
#>  Cluster #1: vector of populations c(1, 3)
#>  Cluster #2: vector of populations 4
#>  Cluster #3: vector of populations c(2, 5)
#>
#> The list of estimated weights for the unknown component distributions in each detected cluster
#> (in the same format and order as listed populations for clusters just above) :
#>  estimated weights of the unknown component distributions for cluster 1 : c(0.794051468493906, 0.659049291333871)
#>  estimated weights of the unknown component distributions for cluster 2 : numeric(0)
#>  estimated weights of the unknown component distributions for cluster 3 : c(0.540877664013853, 0.630816555971802)
#>
#> The matrix giving the distances between populations, used in the clustering procedure through the ksample tests:
#> [,1] [,2] [,3] [,4] [,5]
#> [1,] 0.00000000 25.68231805 0.03460288 6.365796 56.04598388
#> [2,] 25.68231805 0.00000000 18.56015255 4.106711 0.08503132
#> [3,] 0.03460288 18.56015255 0.00000000 64.039494 22.12674903
#> [4,] 6.36579558 4.10671077 64.03949391 0.000000 3.12620181
#> [5,] 56.04598388 0.08503132 22.12674903 3.126202 0.00000000