Clustering of K populations following admixture models

Create clusters on the unknown components related to the K populations following admixture models. Based on the K-sample test using Inversion - Best Matching (IBM) approach, see 'Details' below for further information.

Usage

admix_clustering(
  samples = NULL,
  n_sim_tab = 100,
  comp.dist = NULL,
  comp.param = NULL,
  tabul.dist = NULL,
  conf.level = 0.95,
  parallel = FALSE,
  n_cpu = 2
)

Arguments

samples: A list of the K observed samples to be clustered, all following admixture distributions.
n_sim_tab: Number of simulated gaussian processes used in the tabulation of the inner convergence distribution in the IBM approach.
comp.dist: A list with 2*K elements corresponding to the component distributions (specified with R native names for these distributions) involved in the K admixture models. Elements, grouped by 2, refer to the unknown and known components of each admixture model, If there are unknown elements, they must be specified as 'NULL' objects. For instance, 'comp.dist' could be specified as follows with K = 3: list(f1 = NULL, g1 = 'norm', f2 = NULL, g2 = 'norm', f3 = NULL, g3 = 'rnorm').
comp.param: A list with 2*K elements corresponding to the parameters of the component distributions, each element being a list itself. The names used in this list must correspond to the native R argument names for these distributions. Elements, grouped by 2, refer to the parameters of unknown and known components of each admixture model. If there are unknown elements, they must be specified as 'NULL' objects. For instance, 'comp.param' could be specified as follows (with K = 3): list(f1 = NULL, g1 = list(mean=0,sd=1), f2 = NULL, g2 = list(mean=3,sd=1.1), f3 = NULL, g3 = list(mean=-2,sd=0.6)).
tabul.dist: Only useful for comparisons of detected clusters at different confidence levels. Is a list of the tabulated distributions of the stochastic integral for each cluster previously detected.
conf.level: The confidence level of the K-sample test used in the clustering procedure.
parallel: (default to FALSE) Boolean to indicate whether parallel computations are performed (speed-up the tabulation).
n_cpu: (default to 2) Number of cores used when parallelizing.

Value

A list with eight elements: 1) the number of populations under consideration; 2) the number of detected clusters; 3) the list of p-values for each test performed; 4) the cluster affiliation for each population; 5) the chosen confidence level of statistical tests; 6) the cluster components; 7) the estimated weights of the unknown component distributions inside each cluster (remind that estimated weights are consistent only under the null); 8) the matrix of pairwise discrepancies among all populations.

Details

See the paper at the following HAL weblink: https://hal.archives-ouvertes.fr/hal-03201760

Author

Xavier Milhaud xavier.milhaud.research@gmail.com

Examples

# \donttest{
## Simulate data (chosen parameters indicate 2 clusters (populations (1,3), and (2,4)):
list.comp <- list(f1 = "gamma", g1 = "exp",
                  f2 = "gamma", g2 = "exp",
                  f3 = "gamma", g3 = "gamma",
                  f4 = "gamma", g4 = "exp")
list.param <- list(f1 = list(shape = 16, rate = 4), g1 = list(rate = 1/3.5),
                   f2 = list(shape = 14, rate = 2), g2 = list(rate = 1/5),
                   f3 = list(shape = 16, rate = 4), g3 = list(shape = 12, rate = 2),
                   f4 = list(shape = 14, rate = 2), g4 = list(rate = 1/7))
A.sim <- rsimmix(n=2600, unknownComp_weight=0.8, comp.dist = list(list.comp$f1,list.comp$g1),
                 comp.param = list(list.param$f1, list.param$g1))$mixt.data
B.sim <- rsimmix(n=3000, unknownComp_weight=0.7, comp.dist = list(list.comp$f2,list.comp$g2),
                 comp.param = list(list.param$f2, list.param$g2))$mixt.data
C.sim <- rsimmix(n=3500, unknownComp_weight=0.6, comp.dist = list(list.comp$f3,list.comp$g3),
                 comp.param = list(list.param$f3, list.param$g3))$mixt.data
D.sim <- rsimmix(n=4800, unknownComp_weight=0.5, comp.dist = list(list.comp$f4,list.comp$g4),
                 comp.param = list(list.param$f4, list.param$g4))$mixt.data
## Look for the clusters:
list.comp <- list(f1 = NULL, g1 = "exp",
                  f2 = NULL, g2 = "exp",
                  f3 = NULL, g3 = "gamma",
                  f4 = NULL, g4 = "exp")
list.param <- list(f1 = NULL, g1 = list(rate = 1/3.5),
                   f2 = NULL, g2 = list(rate = 1/5),
                   f3 = NULL, g3 = list(shape = 12, rate = 2),
                   f4 = NULL, g4 = list(rate = 1/7))
clusters <- admix_clustering(samples = list(A.sim,B.sim,C.sim,D.sim), n_sim_tab = 8,
                             comp.dist=list.comp, comp.param=list.param, conf.level = 0.95,
                             parallel = FALSE, n_cpu = 2)
#> Warning: In 'IBM_estimProp': optimization algorithm was changed (in 'optim') from 'Nelder-Mead' to 'BFGS' to avoid the solution to explose.
#> Warning: In 'IBM_estimProp': optimization algorithm was changed (in 'optim') from 'Nelder-Mead' to 'BFGS' to avoid the solution to explose.
#> Warning: In 'IBM_estimProp': optimization algorithm was changed (in 'optim') from 'Nelder-Mead' to 'BFGS' to avoid the solution to explose.
#> Warning: In 'IBM_estimProp': optimization algorithm was changed (in 'optim') from 'Nelder-Mead' to 'BFGS' to avoid the solution to explose.
#> 
  |                                                        
  |                                                  |   0%
  |                                                        
  |======                                            |  12%
  |                                                        
  |======================================            |  75%
  |                                                        
  |==================================================| 100%
clusters
#> Call:
#> admix_clustering(samples = list(A.sim, B.sim, C.sim, D.sim), 
#>     n_sim_tab = 8, comp.dist = list.comp, comp.param = list.param, 
#>     conf.level = 0.95, parallel = FALSE, n_cpu = 2)
#> 
#> The number of populations/samples under study is 4.
#> The level of the underlying k-sample testing procedure is set to 5%.
#> 
#> The number of detected clusters in these populations equals 2.
#> The p-values of the k-sample tests (showing when to close the clusters (i.e. p-value < 0.05) equal: 1, 0, 0.833.
#> 
#> The list of clusters with populations belonging to them (in numeric format, i.e. inside c()) :
#>    - Cluster #1: vector of populations c(2, 4)
#>   - Cluster #2: vector of populations c(1, 3)
#> 
#> The list of estimated weights for the unknown component distributions in each detected cluster
#>       (in the same format and order as listed populations for clusters just above) :
#>    - estimated weights of the unknown component distributions for cluster  1 :  c(0.673500025546141, 0.504929412020502)
#>   - estimated weights of the unknown component distributions for cluster  2 :  c(0.803192448969367, 0.606087809193469)
#> 
#> The matrix giving the distances between populations, used in the clustering procedure through the k-sample tests:
#>             [,1]        [,2]       [,3]        [,4]
#> [1,]  0.00000000 12.85530202 0.04038172 25.60197996
#> [2,] 12.85530202  0.00000000 8.29069783  0.01455913
#> [3,]  0.04038172  8.29069783 0.00000000  6.02351512
#> [4,] 25.60197996  0.01455913 6.02351512  0.00000000
# }