Two-sample test of the unknown component distribution in admixture models using Inversion - Best Matching (IBM) method. Recall that we have two admixture models with respective probability density functions (pdf) l1 = p1 f1 + (1-p1) g1 and l2 = p2 f2 + (1-p2) g2, where g1 and g2 are known pdf and l1 and l2 are observed. Perform the following hypothesis test: H0 : f1 = f2 versus H1 : f1 differs from f2.

IBM_test_H0(
  sample1,
  sample2,
  known.p = NULL,
  comp.dist = NULL,
  comp.param = NULL,
  sim_U = NULL,
  min_size = NULL,
  parallel = FALSE,
  n_cpu = 4
)

Arguments

sample1

Observations of the first sample under study.

sample2

Observations of the second sample under study.

known.p

(default to NULL) Numeric vector with two elements, the known (true) mixture weights.

comp.dist

A list with four elements corresponding to the component distributions (specified with R native names for these distributions) involved in the two admixture models. The two first elements refer to the unknown and known components of the 1st admixture model, and the last two ones to those of the second admixture model. If there are unknown elements, they must be specified as 'NULL' objects. For instance, 'comp.dist' could be specified as follows: list(f1=NULL, g1='norm', f2=NULL, g2='norm').

comp.param

A list with four elements corresponding to the parameters of the component distributions, each element being a list itself. The names used in this list must correspond to the native R argument names for these distributions. The two first elements refer to the parameters of unknown and known components of the 1st admixture model, and the last two ones to those of the second admixture model. If there are unknown elements, they must be specified as 'NULL' objects. For instance, 'comp.param' could be specified as follows: : list(f1=NULL, g1=list(mean=0,sd=1), f2=NULL, g2=list(mean=3,sd=1.1)).

sim_U

Random draws of the inner convergence part of the contrast as defined in the IBM approach (see 'Details' below).

min_size

(default to NULL) In the k-sample case, useful to provide the minimal size among all samples. Otherwise, useless.

parallel

(default to FALSE) Boolean to indicate whether parallel computations are performed (speed-up the tabulation).

n_cpu

(default to 2) Number of cores used when parallelizing.

Value

A list of four elements, containing : 1) the test statistic value; 2) the rejection decision; 3) the p-value of the test, and 4) the estimated weights of the unknown component for each of the two admixture models.

Details

See the paper presenting the IBM approach at the following HAL weblink: https://hal.archives-ouvertes.fr/hal-03201760

Author

Xavier Milhaud xavier.milhaud.research@gmail.com

Examples

####### Under the null hypothesis H0 : ## Simulate data: list.comp <- list(f1 = "norm", g1 = "norm", f2 = "norm", g2 = "norm") list.param <- list(f1 = list(mean = 1, sd = 1), g1 = list(mean = 2, sd = 0.7), f2 = list(mean = 1, sd = 1), g2 = list(mean = 3, sd = 1.2)) X.sim <- rsimmix(n = 1500, unknownComp_weight=0.6, comp.dist = list(list.comp$f1,list.comp$g1), comp.param = list(list.param$f1, list.param$g1))$mixt.data Y.sim <- rsimmix(n = 1400, unknownComp_weight=0.5, comp.dist = list(list.comp$f2,list.comp$g2), comp.param = list(list.param$f2, list.param$g2))$mixt.data ## Tabulate the inner convergence part of the contrast distribution: list.comp <- list(f1 = NULL, g1 = "norm", f2 = NULL, g2 = "norm") list.param <- list(f1 = NULL, g1 = list(mean = 2, sd = 0.7), f2 = NULL, g2 = list(mean = 3, sd = 1.2)) U <- IBM_tabul_stochasticInteg(n.sim = 8, n.varCovMat = 100, sample1 = X.sim, sample2 = Y.sim, min_size=NULL, comp.dist=list.comp, comp.param=list.param, parallel=TRUE, n_cpu=2) ## Simulate new data that will allow to perform the test: list.comp <- list(f1 = "norm", g1 = "norm", f2 = "norm", g2 = "norm") list.param <- list(f1 = list(mean = 1, sd = 1), g1 = list(mean = 2, sd = 0.7), f2 = list(mean = 1, sd = 1), g2 = list(mean = 3, sd = 1.2)) X.sim <- rsimmix(n = 1500, unknownComp_weight=0.6, comp.dist = list(list.comp$f1,list.comp$g1), comp.param = list(list.param$f1, list.param$g1))$mixt.data Y.sim <- rsimmix(n = 1400, unknownComp_weight=0.5, comp.dist = list(list.comp$f2,list.comp$g2), comp.param = list(list.param$f2, list.param$g2))$mixt.data list.comp <- list(f1 = NULL, g1 = "norm", f2 = NULL, g2 = "norm") list.param <- list(f1 = NULL, g1 = list(mean = 2, sd = 0.7), f2 = NULL, g2 = list(mean = 3, sd = 1.2)) IBM_test_H0(sample1 = X.sim, sample2 = Y.sim, known.p = NULL, comp.dist = list.comp, comp.param=list.param, sim_U = U[["U_sim"]], min_size=NULL, parallel=TRUE, n_cpu=2)
#> $test.stat #> [1] 0.2385836 #> #> $decision #> 95% #> FALSE #> #> $p_val #> [1] 0.6 #> #> $weights #> [1] 0.6442925 0.5341394 #>