Equality test of unknown component distributions in K admixture models, with IBM approach

Test hypothesis on the unknown component of K (K > 1) admixture models using Inversion - Best Matching method. K-samples test of the unknown component distribution in admixture models using Inversion - Best Matching (IBM) method. Recall that we have K populations following admixture models, each one with probability density functions (pdf) l_k = p_k*f_k + (1-p_k)*g_k, where g_k is the known pdf and l_k corresponds to the observed sample. Perform the following hypothesis test: H0 : f_1 = ... = f_K against H1 : f_i differs from f_j (i different from j, and i,j in 1,...,K).

Usage

IBM_k_samples_test(
  samples = NULL,
  sim_U = NULL,
  n_sim_tab = 100,
  min_size = NULL,
  comp.dist = NULL,
  comp.param = NULL,
  conf.level = 0.95,
  tune.penalty = TRUE,
  parallel = FALSE,
  n_cpu = 2
)

Arguments

samples: A list of the K samples to be studied, all following admixture distributions.
sim_U: (default to NULL) Random draws of the inner convergence part of the contrast as defined in the IBM approach (see 'Details' below).
n_sim_tab: Number of simulated gaussian processes when tabulating the inner convergence distribution in the IBM approach.
min_size: (default to NULL) Useful to provide the minimal size among all samples (needed to take into account the correction factor for the variance-covariance assessment). Automatically calculated if NULL.
comp.dist: A list with 2*K elements corresponding to the component distributions (specified with R native names for these distributions) involved in the K admixture models. Elements, grouped by 2, refer to the unknown and known components of each admixture model, If there are unknown elements, they must be specified as 'NULL' objects. For instance, 'comp.dist' could be specified as follows with K = 3: list(f1 = NULL, g1 = 'norm', f2 = NULL, g2 = 'norm', f3 = NULL, g3 = 'rnorm').
comp.param: A list with 2*K elements corresponding to the parameters of the component distributions, each element being a list itself. The names used in this list must correspond to the native R argument names for these distributions. Elements, grouped by 2, refer to the parameters of unknown and known components of each admixture model. If there are unknown elements, they must be specified as 'NULL' objects. For instance, 'comp.param' could be specified as follows (with K = 3): list(f1 = NULL, g1 = list(mean=0,sd=1), f2 = NULL, g2 = list(mean=3,sd=1.1), f3 = NULL, g3 = list(mean=-2,sd=0.6)).
conf.level: The confidence level of the K-sample test.
tune.penalty: A boolean that allows to choose between a classical penalty term or an optimized penalty embedding some tuning parameters (automatically optimized). Optimized penalty is particularly useful for low sample size.
parallel: (default to FALSE) Boolean indicating whether parallel computations are performed.
n_cpu: (default to 2) Number of cores used when parallelizing.

Value

A list of ten elements, containing: 1) the rejection decision; 2) the test p-value; 3) the terms involved in the test statistic; 4) the test statistic value; 5) the selected rank (number of terms involved in the test statistic); 6) the value of the penalized test statistic; 7) a boolean indicating whether the applied penalty rule is that under the null H0; 8) the sorted contrast values; 9) the 95th-quantile of the contrast distribution; 10) the final terms of the statistic; and 11) the contrast matrix.

Details

See the paper at the following HAL weblink: https://hal.archives-ouvertes.fr/hal-03201760

Author

Xavier Milhaud xavier.milhaud.research@gmail.com

Examples

# \donttest{
####### Under the null hypothesis H0 (with K=3 populations):
## Specify the parameters of the mixture models for simulation:
list.comp <- list(f1 = "norm", g1 = "norm",
                  f2 = "norm", g2 = "norm",
                  f3 = "norm", g3 = "norm")
list.param <- list(f1 = list(mean = 0, sd = 1), g1 = list(mean = 2, sd = 0.7),
                   f2 = list(mean = 0, sd = 1), g2 = list(mean = 4, sd = 1.1),
                   f3 = list(mean = 0, sd = 1), g3 = list(mean = -3, sd = 0.8))
## Simulate the data:
sim1 <- rsimmix(n = 1000, unknownComp_weight = 0.8, comp.dist = list(list.comp$f1,list.comp$g1),
                comp.param = list(list.param$f1, list.param$g1))$mixt.data
sim2 <- rsimmix(n= 1300, unknownComp_weight = 0.6, comp.dist = list(list.comp$f2,list.comp$g2),
                comp.param = list(list.param$f2, list.param$g2))$mixt.data
sim3 <- rsimmix(n = 1100, unknownComp_weight = 0.7, comp.dist = list(list.comp$f3,list.comp$g3),
                comp.param = list(list.param$f3, list.param$g3))$mixt.data
## Back to the context of admixture models, where one mixture component is unknown:
list.comp <- list(f1 = NULL, g1 = "norm",
                  f2 = NULL, g2 = "norm",
                  f3 = NULL, g3 = "norm")
list.param <- list(f1 = NULL, g1 = list(mean = 2, sd = 0.7),
                   f2 = NULL, g2 = list(mean = 4, sd = 1.1),
                   f3 = NULL, g3 = list(mean = -3, sd = 0.8))
## Perform the 3-samples test:
IBM_k_samples_test(samples = list(sim1,sim2,sim3), sim_U= NULL, n_sim_tab = 20, min_size = NULL,
                   comp.dist = list.comp, comp.param = list.param, conf.level = 0.95,
                   tune.penalty = FALSE, parallel = FALSE, n_cpu = 2)
#> $confidence_level
#> [1] 0.95
#> 
#> $rejection_rule
#>   95% 
#> FALSE 
#> 
#> $p_value
#> [1] 0.2105263
#> 
#> $stat_name
#> [1] "d_2-3"
#> 
#> $test.stat
#> [1] 0.1201393
#> 
#> $stat_rank
#> [1] 1
#> 
#> $penalized_stat
#> [1]  -407.3  -814.5 -1221.7
#> 
#> $penalty_H0
#> [1] NA
#> 
#> $gamma
#> [1] NA
#> 
#> $tuning_constant
#> [1] NA
#> 
#> $ordered_contrasts
#> [1] 0.1201393 0.1494601 0.1637697
#> 
#> $quantiles
#> [1] 0.1924684
#> 
#> $sim_U
#>  [1] 0.05254498 0.05555592 0.14475051 0.06645817 0.06806092 0.08078858
#>  [7] 0.20854261 0.08672619 0.10221740 0.05579581 0.05931733 0.13072461
#> [13] 0.03181730 0.06787795 0.07049473 0.04018007 0.19068242 0.07359908
#> [19] 0.06079014
#> 
#> $stat_terms
#> [1] "d_2-3"                 "d_2-3 + d_1-2"         "d_2-3 + d_1-2 + d_1-3"
#> 
#> $contr_mat
#>      [,1]      [,2]      [,3]
#> [1,]   NA 0.1494601 0.1637697
#> [2,]   NA        NA 0.1201393
#> [3,]   NA        NA        NA
#> 

####### Now under the alternative H1:
list.comp <- list(f1 = "norm", g1 = "norm",
                  f2 = "norm", g2 = "norm",
                  f3 = "norm", g3 = "norm")
list.param <- list(f1 = list(mean = 0, sd = 1), g1 = list(mean = 2, sd = 0.7),
                   f2 = list(mean = 0, sd = 1), g2 = list(mean = 4, sd = 1.1),
                   f3 = list(mean = 2, sd = 0.7), g3 = list(mean = 3, sd = 0.8))
sim1 <- rsimmix(n = 3000, unknownComp_weight = 0.8, comp.dist = list(list.comp$f1,list.comp$g1),
                comp.param = list(list.param$f1, list.param$g1))$mixt.data
sim2 <- rsimmix(n= 3300, unknownComp_weight = 0.6, comp.dist = list(list.comp$f2,list.comp$g2),
                comp.param = list(list.param$f2, list.param$g2))$mixt.data
sim3 <- rsimmix(n = 3100, unknownComp_weight = 0.7, comp.dist = list(list.comp$f3,list.comp$g3),
                comp.param = list(list.param$f3, list.param$g3))$mixt.data
list.comp <- list(f1 = NULL, g1 = "norm",
                  f2 = NULL, g2 = "norm",
                  f3 = NULL, g3 = "norm")
list.param <- list(f1 = NULL, g1 = list(mean = 2, sd = 0.7),
                   f2 = NULL, g2 = list(mean = 4, sd = 1.1),
                   f3 = NULL, g3 = list(mean = 3, sd = 0.8))
IBM_k_samples_test(samples = list(sim1,sim2,sim3), sim_U= NULL, n_sim_tab = 20, min_size = NULL,
                   comp.dist = list.comp, comp.param = list.param, conf.level = 0.95,
                   tune.penalty = FALSE, parallel = FALSE, n_cpu = 2)
#> Warning: In 'IBM_estimProp': optimization algorithm was changed (in 'optim') from 'Nelder-Mead' to 'BFGS' to avoid the solution to explose.
#> Warning: In 'IBM_estimProp': optimization algorithm was changed (in 'optim') from 'Nelder-Mead' to 'BFGS' to avoid the solution to explose.
#> $confidence_level
#> [1] 0.95
#> 
#> $rejection_rule
#>   95% 
#> FALSE 
#> 
#> $p_value
#> [1] 0.95
#> 
#> $stat_name
#> [1] "d_1-2"
#> 
#> $test.stat
#> [1] 0.05252448
#> 
#> $stat_rank
#> [1] 1
#> 
#> $penalized_stat
#> [1] -1059.4 -2111.3 -3113.9
#> 
#> $penalty_H0
#> [1] NA
#> 
#> $gamma
#> [1] NA
#> 
#> $tuning_constant
#> [1] NA
#> 
#> $ordered_contrasts
#> [1]  0.05252448  7.66933917 56.79952153
#> 
#> $quantiles
#> [1] 0.1740053
#> 
#> $sim_U
#>  [1] 0.17143429 0.08327106 0.15787435 0.08710890 0.10564245 0.09969122
#>  [7] 0.13675376 0.15361418 0.13528019 0.15593465 0.11983569 0.05364253
#> [13] 0.09970056 0.04656998 0.15821645 0.22285547 0.15037667 0.12301412
#> [19] 0.15859610 0.08241896
#> 
#> $stat_terms
#> [1] "d_1-2"                 "d_1-2 + d_1-3"         "d_1-2 + d_1-3 + d_2-3"
#> 
#> $contr_mat
#>      [,1]       [,2]      [,3]
#> [1,]   NA 0.05252448  7.669339
#> [2,]   NA         NA 56.799522
#> [3,]   NA         NA        NA
#> 
# }