KMERF¶
-
class
hyppo.independence.
KMERF
(forest='regressor', ntrees=500, **kwargs)¶ Kernel Mean Embedding Random Forest (KMERF) test statistic and p-value.
The KMERF test statistic is a kernel method for calculating independence by using a random forest induced similarity matrix as an input, and has been shown to have especially high gains in finite sample testing power in high dimensional settings [1].
A description of KMERF in greater detail can be found in [1]. It is computed using the following steps:
Let x and y be (n,p) and (n,1) samples of random variables X and Y.
Run random forest with m trees. Independent bootstrap samples of size nb≤n are drawn to build a tree each time; each tree structure within the forest is denoted as ϕw∈P, w∈{1,…,m}; ϕw(xi) denotes the partition assigned to xi.
Calculate the proximity kernel:
Kxij=1mm∑w=1I(ϕw(xi)=ϕw(xj))where I(⋅) is the indicator function for how often two observations lie in the same partition.
Compute the induced kernel correlation: Let
Lxij={Kxij−1n−2∑nt=1Kxit−1n−2∑ns=1Kxsj+1(n−1)(n−2)∑ns,t=1Kxstwhen i≠j0 otherwiseThen let Ky be the Euclidean distance induced kernel, and similarly compute Ly from Ky. The unbiased kernel correlation equals
KMERFn(x,y)=1n(n−3)tr(LxLy)
The p-value returned is calculated using a permutation test using
hyppo.tools.perm_test
.- Parameters
forest (
"regressor"
,"classifier"
, default:"regressor"
) -- Type of forest used when running the independence test. If the y input intest
is categorial, use the "classifier" keyword.ntrees (
int
, default:500
) -- The number of trees used in the random forest.**kwargs -- Additional arguments used for the forest (see
sklearn.ensemble.RandomForestClassifier
orsklearn.ensemble.RandomForestRegressor
)
Methods Summary
|
Helper function that calculates the KMERF test statistic. |
|
Calculates the KMERF test statistic and p-value. |
-
KMERF.
statistic
(x, y)¶ Helper function that calculates the KMERF test statistic.
-
KMERF.
test
(x, y, reps=1000, workers=1)¶ Calculates the KMERF test statistic and p-value.
- Parameters
x,y (
ndarray
) -- Input data matrices.x
andy
must have the same number of samples. That is, the shapes must be(n, p)
and(n, 1)
where n is the number of samples and p is the number of dimensions.reps (
int
, default:1000
) -- The number of replications used to estimate the null distribution when using the permutation test used to calculate the p-value.workers (
int
, default:1
) -- The number of cores to parallelize the p-value computation over. Supply-1
to use all cores available to the Process.
- Returns
Examples
>>> import numpy as np >>> from hyppo.independence import KMERF >>> x = np.arange(100) >>> y = x >>> '%.1f, %.2f' % KMERF().test(x, y)[:1] '1.0, 0.001'