KMERF

class hyppo.independence.KMERF(forest='regressor', ntrees=500, **kwargs)

Kernel Mean Embedding Random Forest (KMERF) test statistic and p-value.

The KMERF test statistic is a kernel method for calculating independence by using a random forest induced similarity matrix as an input, and has been shown to have especially high gains in finite sample testing power in high dimensional settings [1].

A description of KMERF in greater detail can be found in [1]. It is computed using the following steps:

Let x and y be (n,p) and (n,1) samples of random variables X and Y.

  • Run random forest with m trees. Independent bootstrap samples of size nbn are drawn to build a tree each time; each tree structure within the forest is denoted as ϕwP, w{1,,m}; ϕw(xi) denotes the partition assigned to xi.

  • Calculate the proximity kernel:

    Kxij=1mmw=1I(ϕw(xi)=ϕw(xj))

    where I() is the indicator function for how often two observations lie in the same partition.

  • Compute the induced kernel correlation: Let

    Lxij={Kxij1n2nt=1Kxit1n2ns=1Kxsj+1(n1)(n2)ns,t=1Kxstwhen ij0 otherwise
  • Then let Ky be the Euclidean distance induced kernel, and similarly compute Ly from Ky. The unbiased kernel correlation equals

    KMERFn(x,y)=1n(n3)tr(LxLy)

The p-value returned is calculated using a permutation test using hyppo.tools.perm_test.

Parameters
  • forest ("regressor", "classifier", default: "regressor") -- Type of forest used when running the independence test. If the y input in test is categorial, use the "classifier" keyword.

  • ntrees (int, default: 500) -- The number of trees used in the random forest.

  • **kwargs -- Additional arguments used for the forest (see sklearn.ensemble.RandomForestClassifier or sklearn.ensemble.RandomForestRegressor)

Methods Summary

KMERF.statistic(x, y)

Helper function that calculates the KMERF test statistic.

KMERF.test(x, y[, reps, workers])

Calculates the KMERF test statistic and p-value.


KMERF.statistic(x, y)

Helper function that calculates the KMERF test statistic.

Parameters

x,y (ndarray) -- Input data matrices. x and y must have the same number of samples. That is, the shapes must be (n, p) and (n, 1) where n is the number of samples and p is the number of dimensions.

Returns

stat (float) -- The computed KMERF statistic.

KMERF.test(x, y, reps=1000, workers=1)

Calculates the KMERF test statistic and p-value.

Parameters
  • x,y (ndarray) -- Input data matrices. x and y must have the same number of samples. That is, the shapes must be (n, p) and (n, 1) where n is the number of samples and p is the number of dimensions.

  • reps (int, default: 1000) -- The number of replications used to estimate the null distribution when using the permutation test used to calculate the p-value.

  • workers (int, default: 1) -- The number of cores to parallelize the p-value computation over. Supply -1 to use all cores available to the Process.

Returns

  • stat (float) -- The computed KMERF statistic.

  • pvalue (float) -- The computed KMERF p-value.

  • kmerf_dict (dict) --

    Contains additional useful returns containing the following keys:

    • feat_importancendarray

      An array containing the importance of each dimension

Examples

>>>
>>> import numpy as np
>>> from hyppo.independence import KMERF
>>> x = np.arange(100)
>>> y = x
>>> '%.1f, %.2f' % KMERF().test(x, y)[:1] 
'1.0, 0.001'

Examples using hyppo.independence.KMERF