phik package

Subpackages

Submodules

phik.betainc module

Project: PhiK - correlation analyzer library

Created: 2018/09/05

Description:

Implementation of incomplete beta function

Authors:

KPMG Advanced Analytics & Big Data team, Amstelveen, The Netherlands

Redistribution and use in source and binary forms, with or without modification, are permitted according to the terms listed in the file LICENSE.

phik.betainc.contfractbeta(a: float, b: float, x: float, ITMAX: int = 5000, EPS: float = 1e-07) float

Continued fraction form of the incomplete Beta function.

Code translated from: Numerical Recipes in C.

Example kindly taken from blog: https://malishoaib.wordpress.com/2014/04/15/the-beautiful-beta-functions-in-raw-python/

Parameters:
  • a (float) – a

  • b (float) – b

  • x (float) – x

  • ITMAX (int) – max number of iterations, default is 5000.

  • EPS (float) – epsilon precision parameter, default is 1e-7.

Returns:

continued fraction form

Return type:

float

phik.betainc.incompbeta(a: float, b: float, x: float) float

Evaluation of incomplete beta function.

Code translated from: Numerical Recipes in C.

Here a, b > 0 and 0 <= x <= 1. This function requires contfractbeta(a,b,x, ITMAX = 200)

Example kindly taken from blog: https://malishoaib.wordpress.com/2014/04/15/the-beautiful-beta-functions-in-raw-python/

Parameters:
  • a (float) – a

  • b (float) – b

  • x (float) – x

Returns:

incomplete beta function

Return type:

float

phik.betainc.log_incompbeta(a: float, b: float, x: float) Tuple[float, float]

Evaluation of logarithm of incomplete beta function

Logarithm of incomplete beta function is implemented to ensure sufficient precision for values very close to zero and one.

Code translated from: Numerical Recipes in C.

Here a, b > 0 and 0 <= x <= 1. This function requires contfractbeta(a,b,x, ITMAX = 200)

Example kindly taken from blog: https://malishoaib.wordpress.com/2014/04/15/the-beautiful-beta-functions-in-raw-python/

Parameters:
  • a (float) – a

  • b (float) – b

  • x (float) – x

Returns:

tuple of log(incb) and log(1-incb)

Return type:

tuple

phik.binning module

Project: PhiK - correlation analyzer library

Created: 2018/09/06

Description:

A set of rebinning functions, to help rebin two lists into a 2d histogram.

Authors:

KPMG Advanced Analytics & Big Data team, Amstelveen, The Netherlands

Redistribution and use in source and binary forms, with or without modification, are permitted according to the terms listed in the file LICENSE.

phik.binning.auto_bin_data(df: DataFrame, interval_cols: list | None = None, bins: int | list | ndarray | dict = 10, quantile: bool = False, dropna: bool = True, verbose: bool = True) DataFrame

Index the input DataFrame with automatic bin_edges and interval columns.

Parameters:
  • data_binned (pd.DataFrame) – input data

  • interval_cols (list) – column names of columns with interval variables.

  • bins – number of bins, or a list of bin edges (same for all columns), or a dictionary where per column the bins are specified. (default=10) E.g.: bins = {‘mileage’:5, ‘driver_age’:[18,25,35,45,55,65,125]}

  • quantile – when bins is an integer, uniform bins (False) or bins based on quantiles (True)

  • dropna (bool) – remove NaN values with True

  • verbose (bool) – if False, do not print all interval columns that are guessed

Returns:

phik correlation matrix

phik.binning.bin_array(arr: ndarray | list, bin_edges: ndarray | list) Tuple[ndarray, list]

Index the data given the bin_edges.

Underflow and overflow values are indicated.

Parameters:
  • arr – array like object with input data

  • bin_edges – list with bin edges.

Returns:

indexed data

phik.binning.bin_data(data: DataFrame, cols: list | ndarray | tuple = (), bins: int | list | ndarray | dict = 10, quantile: bool = False, retbins: bool = False)

Index the input DataFrame given the bin_edges for the columns specified in cols.

Parameters:
  • data (DataFrame) – input data

  • cols (list) – list of columns with numeric data which needs to be indexed

  • bins – number of bins, or a list of bin edges (same for all columns), or a dictionary where per column the bins are specified. (default=10) E.g.: bins = {‘mileage’:5, ‘driver_age’:[18,25,35,45,55,65,125]}

  • quantile – when bins is an integer, uniform bins (False) or bins based on quantiles (True)

Returns:

rebinned DataFrame

Return type:

pandas.DataFrame

phik.binning.bin_edges(arr: ndarray | list | Series, nbins: int, quantile: bool = False) ndarray

Create uniform or quantile bin-edges for the input array.

Parameters:
  • arr – array like object with input data

  • nbins (int) – the number of bin

  • quantile (bool) – uniform bins (False) or bins based on quantiles (True)

Returns:

array with bin edges

phik.binning.create_correlation_overview_table(vals: List[Tuple[str, str, float]]) DataFrame

Create overview table of phik/significance data.

Parameters:

vals (list) – list holding tuples of data for each variable pair formatted as (‘var1’, ‘var2’, value)

Returns:

symmetric table with phik/significances of all variable pairs

Return type:

pandas.DataFrame

phik.binning.hist2d(df: DataFrame, interval_cols: list | ndarray | None = None, bins: int | float | list | ndarray | dict = 10, quantile: bool = False, dropna: bool = True, drop_underflow: bool = True, drop_overflow: bool = True, retbins: bool = False, verbose: bool = True) DataFrame | Tuple[DataFrame, dict]

Give binned 2d DataFrame of two columns of input DataFrame

Parameters:
  • df – input data. DataFrame must contain exactly two columns

  • interval_cols – columns with interval variables which need to be binned

  • bins – number of bins, or a list of bin edges (same for all columns), or a dictionary where per column the bins are specified. (default=10) E.g.: bins = {‘mileage’:5, ‘driver_age’:[18,25,35,45,55,65,125]}

  • quantile (bool) – when the number of bins is specified, use uniform binning (False) or quantile binning (True)

  • dropna (bool) – remove NaN values with True

  • drop_underflow (bool) – do not take into account records in underflow bin when True (relevant when binning a numeric variable)

  • drop_overflow (bool) – do not take into account records in overflow bin when True (relevant when binning a numeric variable)

  • verbose (bool) – if False, do not print all interval columns that are guessed

Returns:

histogram DataFrame

phik.binning.hist2d_from_array(x: ~pandas.core.series.Series | list | ~numpy.ndarray, y: [<class 'pandas.core.series.Series'>, <class 'list'>, <class 'numpy.ndarray'>], **kwargs) DataFrame | Tuple[DataFrame, dict]

Give binned 2d DataFrame of two input arrays

Parameters:
  • x – input data. First array-like.

  • y – input data. Second array-like.

  • interval_cols – columns with interval variables which need to be binned

  • bins – number of bins, or a list of bin edges (same for all columns), or a dictionary where per column the bins are specified. (default=10) E.g.: bins = {‘mileage’:5, ‘driver_age’:[18,25,35,45,55,65,125]}

  • quantile (bool) – when the number of bins is specified, use uniform binning (False) or quantile binning (True)

  • dropna (bool) – remove NaN values with True

  • drop_underflow (bool) – do not take into account records in underflow bin when True (relevant when binning a numeric variable)

  • drop_overflow (bool) – do not take into account records in overflow bin when True (relevant when binning a numeric variable)

Returns:

histogram DataFrame

phik.binning.hist2d_from_rebinned_df(data_binned: DataFrame, dropna: bool = True, drop_underflow: bool = True, drop_overflow: bool = True) DataFrame

Give binned 2d DataFrame of two columns of rebinned input DataFrame

Parameters:
  • df – input data. DataFrame must contain exactly two columns

  • dropna (bool) – remove NaN values with True

  • drop_underflow (bool) – do not take into account records in underflow bin when True (relevant when binning a numeric variable)

  • drop_overflow (bool) – do not take into account records in overflow bin when True (relevant when binning a numeric variable)

Returns:

histogram DataFrame

phik.bivariate module

Project: PhiK - correlation analyzer library

Created: 2019/11/23

Description:

Convert Pearson correlation value into a chi2 value of a contingency test matrix of a bivariate gaussian, and vice-versa. Calculation uses scipy’s mvn library.

Authors:

KPMG Advanced Analytics & Big Data team, Amstelveen, The Netherlands

Redistribution and use in source and binary forms, with or without modification, are permitted according to the terms listed in the file LICENSE.

phik.bivariate.bivariate_normal_theory(rho: float, nx: int = -1, ny: int = -1, n: int = 1, sx: ndarray | None = None, sy: ndarray | None = None) ndarray

Return binned pdf of bivariate normal distribution.

This function returns a “perfect” binned bivariate normal distribution.

Parameters:
  • rho (float) – tilt parameter

  • nx (int) – number of uniform bins on x-axis. alternative to sx.

  • ny (int) – number of uniform bins on y-axis. alternative to sy.

  • sx (np.ndarray) – bin edges array of x-axis. default is None.

  • sy (np.ndarray) – bin edges array of y-axis. default is None.

  • n (int) – number of entries. default is one.

Returns:

np.ndarray of binned bivariate normal pdf

phik.bivariate.chi2_from_phik(rho: float, n: int, subtract_from_chi2: float = 0, corr0: list | None = None, scale: float | None = None, sx: ndarray | None = None, sy: ndarray | None = None, pedestal: float = 0, nx: int = -1, ny: int = -1) float

Calculate chi2-value of bivariate gauss having correlation value rho

Calculate no-noise chi2 value of bivar gauss with correlation rho, with respect to bivariate gauss without any correlation.

Parameters:
  • rho (float) – tilt parameter

  • n (int) – number of records

  • subtract_from_chi2 (float) – value subtracted from chi2 calculation. default is 0.

  • corr0 (list) – mvn_array result for rho=0. Default is None.

  • scale (float) – scale is multiplied with the chi2 if set.

  • sx (np.ndarray) – bin edges array of x-axis. default is None.

  • sy (np.ndarray) – bin edges array of y-axis. default is None.

  • pedestal (float) – pedestal is added to the chi2 if set.

  • nx (int) – number of uniform bins on x-axis. alternative to sx.

  • ny (int) – number of uniform bins on y-axis. alternative to sy.

Returns float:

chi2 value

phik.bivariate.phik_from_chi2(chi2: float, n: int, nx: int, ny: int, sx: ndarray | None = None, sy: ndarray | None = None, pedestal: float = 0) float

Correlation coefficient of bivariate gaussian derived from chi2-value

Chi2-value gets converted into correlation coefficient of bivariate gauss with correlation value rho, assuming giving binning and number of records. Correlation coefficient value is between 0 and 1.

Bivariate gaussian’s range is set to [-5,5] by construction.

Parameters:
  • chi2 (float) – input chi2 value

  • n (int) – number of records

  • nx (int) – number of uniform bins on x-axis. alternative to sx.

  • ny (int) – number of uniform bins on y-axis. alternative to sy.

  • sx (np.ndarray) – bin edges array of x-axis. default is None.

  • sy (np.ndarray) – bin edges array of y-axis. default is None.

  • pedestal (float) – pedestal is added to the chi2 if set.

Returns float:

correlation coefficient

phik.data_quality module

Project: PhiK - correlation analyzer library

Created: 2018/12/28

Description:

A set of functions to check for data quality issues in input data.

Authors:

KPMG Advanced Analytics & Big Data team, Amstelveen, The Netherlands

Redistribution and use in source and binary forms, with or without modification, are permitted according to the terms listed in the file LICENSE.

phik.data_quality.dq_check_hist2d(hist2d: ndarray) bool

Basic data quality checks for a contingency table

The Following checks are done:

  1. There must be at least two bins in both the x and y direction.

  2. If the number of bins in the x and/or y direction is larger than 100 a warning is printed.

Parameters:

hist2d – contingency table

Returns:

bool passed_check

phik.data_quality.dq_check_nunique_values(df: DataFrame, interval_cols: list, dropna: bool = True) Tuple[DataFrame, list]

Basic data quality checks per column in a DataFrame.

The following checks are done:

1. For all non-interval variables, if the number of unique values per variable is larger than 100 a warning is printed. When the number of unique values is large, the variable is likely to be an interval variable. Calculation of phik will be slow(ish) for pairs of variables where one (or two) have many different values (i.e. many bins).

2. For all interval variables, the number of unique values must be at least two. If the number of unique values is zero (i.e. all NaN) the column is removed. If the number of unique values is one, it is not possible to automatically create a binning for this variable (as min and max are the same). The variable is therefore dropped, irrespective of whether dropna is True or False.

3. For all non-interval variables, the number of unique values must be at least either a) 1 if dropna=False (NaN is now also considered a valid category), or b) 2 if dropna=True

The function returns a DataFrame where all columns with invalid data are removed. Also the list of interval_cols is updated and returned.

Parameters:
  • df (pd.DataFrame) – input data

  • interval_cols (list) – column names of columns with interval variables.

  • dropna (bool) – remove NaN values when True

Returns:

cleaned data, updated list of interval columns

phik.definitions module

Project: PhiK - correlation analyzer library

Created: 2018/09/05

Description:

Definitions used throughout the phik package

Authors:

KPMG Advanced Analytics & Big Data team, Amstelveen, The Netherlands

Redistribution and use in source and binary forms, with or without modification, are permitted according to the terms listed in the file LICENSE.

phik.entry_points module

Project: PhiK - correlation analyzer library

Created: 2018/11/13

Description:

Collection of phik entry points

Authors:

KPMG Advanced Analytics & Big Data team, Amstelveen, The Netherlands

Redistribution and use in source and binary forms, with or without modification, are permitted according to the terms listed in the file LICENSE.

phik.entry_points.phik_trial()

Run Phi_K tests.

We will keep this here until we’ve completed switch to pytest or nose and tox. We could also keep it, but I don’t like the fact that packages etc. are hard coded. Gotta come up with a better solution.

phik.outliers module

Project: PhiK - correlation analyzer library

Created: 2018/09/05

Description:

Functions for calculating the statistical significance of outliers in a contingency table.

Authors:

KPMG Advanced Analytics & Big Data team, Amstelveen, The Netherlands

Redistribution and use in source and binary forms, with or without modification, are permitted according to the terms listed in the file LICENSE.

phik.outliers.get_exact_poisson_uncertainty(x: float, nsigmas: float = 1) float

Calculate the uncertainty on x using an exact poisson confidence interval. The width of the confidence interval can be specified using the number of sigmas. The default number of sigmas is set to 1, resulting in an error that is approximated by the standard poisson error sqrt(x).

Exact poisson uncertainty is described here: https://ms.mcmaster.ca/peter/s743/poissonalpha.html https://www.statsdirect.com/help/rates/poisson_rate_ci.htm https://www.ncbi.nlm.nih.gov/pubmed/2296988

Parameters:

x (float) – value

Return x_err:

the uncertainty on x (1 sigma)

Return type:

float

phik.outliers.get_independent_frequency_estimates(values: ndarray, CI_method: str = 'poisson') Tuple[ndarray, ndarray]

Calculation of expected frequencies, based on the ABCD-method, i.e. independent frequency estimates.

Parameters:
  • values – The contingency table. The table contains the observed number of occurrences in each category.

  • CI_method (string) – method to be used for uncertainty calculation. poisson: normal poisson error. exact_poisson: error calculated from the asymmetric exact poisson interval

Returns exp, experr:

expected frequencies, error on the expected frequencies

phik.outliers.get_outlier_significances(obs: ndarray, exp: ndarray, experr: ndarray) Tuple[ndarray, ndarray]

Evaluation of significance of observation

Evaluation of the significance of the difference between the observed number of occurrences and the expected number of occurrences, taking into account the uncertainty on the expected number of occurrences. When the uncertainty is not zero, the Linnemann method is used to calculate the p-values.

Parameters:
  • obs – observed numbers

  • exp – expected numbers

  • experr – uncertainty on the expected numbers

Returns:

pvalues, zvalues

phik.outliers.get_poisson_uncertainty(x: float) float

Calculate the uncertainty on x using standard poisson error. In case x=0 the error=1 is assigned.

Parameters:

x (float) – value

Return x_err:

the uncertainty on x (1 sigma)

Return type:

float

phik.outliers.get_uncertainty(x: float, CI_method: str = 'poisson') float

Calculate the uncertainty on a random number x taken from the poisson distribution

The uncertainty on the x is calculated using either the standard poisson error (poisson) or using the asymmetric exact poisson interval (exact_poisson). https://www.ncbi.nlm.nih.gov/pubmed/2296988 #FIXME: check ref

Parameters:
  • x (float) – value, must be equal or greater than zero

  • CI_method (string) – method to be used for uncertainty calculation. poisson: normal poisson error. exact_poisson: error calculated from the asymmetric exact poisson interval

Return x_err:

the uncertainty on x (1 sigma)

phik.outliers.log_poisson_obs_mid_p(nobs: int, nexp: float, nexperr: float) Tuple[float, float]

Calculate the logarithm of the p-value for measuring nobs observations given the expected value.

The Lancaster mid-P correction is applied to take into account the effects of discrete statistics. If the uncertainty on the expected value is known the Linnemann method is used for the p-value calculation. Otherwise the Poisson distribution is used to estimate the p-value.

Parameters:
  • nobs (int) – observed count

  • nexp (float) – expected number

  • nexperr (float) – uncertainty on the expected number

Returns:

tuple of log(p) and log(1-p)

Return type:

tuple

phik.outliers.log_poisson_obs_p(nobs: int, nexp: float, nexperr: float) Tuple[float, float]

Calculate logarithm of p-value for nobs observations given the expected value and its uncertainty using the Linnemann method.

If the uncertainty on the expected value is known the Linnemann method is used. Otherwise the Poisson distribution is used to estimate the p-value.

Measures of Significance in HEP and Astrophysics Authors: J. T. Linnemann http://arxiv.org/abs/physics/0312059

Code inspired by: https://root.cern.ch/doc/master/NumberCountingUtils_8cxx_source.html#l00086

Three fixes are added for:

  • nobs = 0, when - by construction - p should be 1.

  • uncertainty of zero, for which Linnemann’s function does not work, but one can simply revert to regular Poisson.

  • when nexp=0, betainc always returns 1. Here we set nexp = nexperr.

Parameters:
  • nobs (int) – observed count

  • nexp (float) – expected number

  • nexperr (float) – uncertainty on the expected number

Returns:

tuple containing pvalue and 1 - pvalue

Return type:

tuple

phik.outliers.outlier_significance_from_array(x: ndarray | list | Series, y: ndarray | list | Series, num_vars: list | None = None, bins: int | list | ndarray | dict = 10, quantile: bool = False, ndecimals: int = 1, CI_method: str = 'poisson', dropna: bool = True, drop_underflow: bool = True, drop_overflow: bool = True, verbose: bool = True) DataFrame

Calculate the significance matrix of excesses or deficits of input x and input y. x and y can contain interval, ordinal or categorical data. Use the num_vars variable to indicate whether x and/or y contain interval data.

Parameters:
  • x (list) – array-like input

  • y (list) – array-like input

  • num_vars (list) – list of variables which are numeric and need to be binned, either [‘x’],[‘y’],or[‘x’,’y’]

  • bins – number of bins, or a list of bin edges (same for all columns), or a dictionary where per column the bins are specified. (default=10) E.g.: bins = {‘mileage’:5, ‘driver_age’:[18,25,35,45,55,65,125]}

  • quantile (bool) – when the number of bins is specified, use uniform binning (False) or quantile binning (True)

  • ndecimals – number of decimals to use in labels of binned interval variables to specify bin edges (default=1)

  • CI_method (string) – method to be used for uncertainty calculation. poisson: normal poisson error. exact_poisson: error calculated from the asymmetric exact poisson interval

  • dropna (bool) – remove NaN values with True

  • drop_underflow (bool) – do not take into account records in underflow bin when True (relevant when binning a numeric variable)

  • drop_overflow (bool) – do not take into account records in overflow bin when True (relevant when binning a numeric variable)

  • verbose (bool) – if False, do not print all interval columns that are guessed

Returns:

outlier significance matrix (pd.DataFrame)

phik.outliers.outlier_significance_from_binned_array(x: ndarray | list | Series, y: ndarray | list | Series, CI_method: str = 'poisson', dropna: bool = True, drop_underflow: bool = True, drop_overflow: bool = True) DataFrame

Calculate the significance matrix of excesses or deficits of input x and input y. x and y can contain binned interval, ordinal or categorical data.

Parameters:
  • x (list) – array-like input

  • y (list) – array-like input

  • CI_method (string) – method to be used for uncertainty calculation. poisson: normal poisson error. exact_poisson: error calculated from the asymmetric exact poisson interval

  • dropna (bool) – remove NaN values with True

  • drop_underflow (bool) – do not take into account records in underflow bin when True (relevant when binning a numeric variable)

  • drop_overflow (bool) – do not take into account records in overflow bin when True (relevant when binning a numeric variable)

Returns:

outlier significance matrix (pd.DataFrame)

phik.outliers.outlier_significance_matrices(df: DataFrame, interval_cols: list | None = None, CI_method: str = 'poisson', ndecimals: int = 1, bins=10, quantile: bool = False, combinations: list | tuple = (), dropna: bool = True, drop_underflow: bool = True, drop_overflow: bool = True, retbins: bool = False, verbose: bool = True)

Calculate the significance matrix of excesses or deficits for all possible combinations of variables, or for those combinations specified using combinations

Parameters:
  • df – input data

  • interval_cols – columns with interval variables which need to be binned

  • CI_method (string) – method to be used for uncertainty calculation. poisson: normal poisson error. exact_poisson: error calculated from the asymmetric exact poisson interval

  • ndecimals – number of decimals to use in labels of binned interval variables to specify bin edges (default=1)

  • bins – number of bins, or a list of bin edges (same for all columns), or a dictionary where per column the bins are specified. (default=10) E.g.: bins = {‘mileage’:5, ‘driver_age’:[18,25,35,45,55,65,125]}

  • quantile (bool) – when the number of bins is specified, use uniform binning (False) or quantile binning (True)

  • combinations – in case you do not want to calculate an outlier significance matrix for all permutations of the available variables, you can specify a list of the required permutations here, in the format [(var1, var2), (var2, var4), etc]

  • dropna (bool) – remove NaN values with True

  • drop_underflow (bool) – do not take into account records in underflow bin when True (relevant when binning a numeric variable)

  • drop_overflow (bool) – do not take into account records in overflow bin when True (relevant when binning a numeric variable)

  • retbins (bool) – if true, function also returns dict with bin_edges of rebinned variables.

  • verbose (bool) – if False, do not print all interval columns that are guessed

Returns:

dictionary with outlier significance matrices (pd.DataFrame)

phik.outliers.outlier_significance_matrices_from_rebinned_df(data_binned: DataFrame, binning_dict=None, CI_method='poisson', ndecimals=1, combinations: list | tuple = (), dropna=True, drop_underflow=True, drop_overflow=True)

Calculate the significance matrix of excesses or deficits for all possible combinations of variables, or for those combinations specified using combinations. This functions could also be used instead of outlier_significance_matrices in case all variables are either categorical or ordinal, so no binning is required.

Parameters:
  • data_binned – input data. Interval variables need to be binned. DataFrame must contain exactly two columns

  • binning_dict (dict) – dictionary with bin edges for each binned interval variable. When no bin_edges are provided values are used as bin label. Otherwise, bin labels are constructed based on provided bin edge information.

  • CI_method (string) – method to be used for uncertainty calculation. poisson: normal poisson error. exact_poisson: error calculated from the asymmetric exact poisson interval

  • ndecimals – number of decimals to use in labels of binned interval variables to specify bin edges (default=1)

  • combinations – in case you do not want to calculate an outlier significance matrix for all permutations of the available variables, you can specify a list of the required permutations here, in the format [(var1, var2), (var2, var4), etc]

  • dropna (bool) – remove NaN values with True

  • drop_underflow (bool) – do not take into account records in underflow bin when True (relevant when binninga numeric variable)

  • drop_overflow (bool) – do not take into account records in overflow bin when True (relevant when binning a numeric variable)

Returns:

dictionary with outlier significance matrices (pd.DataFrame)

phik.outliers.outlier_significance_matrix(df: DataFrame, interval_cols: list | None = None, CI_method: str = 'poisson', ndecimals: int = 1, bins=10, quantile: bool = False, dropna: bool = True, drop_underflow: bool = True, drop_overflow: bool = True, retbins: bool = False, verbose: bool = True)

Calculate the significance matrix of excesses or deficits

Parameters:
  • df – input data. DataFrame must contain exactly two columns

  • interval_cols – columns with interval variables which need to be binned

  • CI_method (string) – method to be used for uncertainty calculation. poisson: normal poisson error. exact_poisson: error calculated from the asymmetric exact poisson interval

  • bins – number of bins, or a list of bin edges (same for all columns), or a dictionary where per column the bins are specified. (default=10) E.g.: bins = {‘mileage’:5, ‘driver_age’:[18,25,35,45,55,65,125]}

  • ndecimals – number of decimals to use in labels of binned interval variables to specify bin edges (default=1)

  • quantile (bool) – when the number of bins is specified, use uniform binning (False) or quantile binning (True)

  • dropna (bool) – remove NaN values with True

  • drop_underflow (bool) – do not take into account records in underflow bin when True (relevant when binning a numeric variable)

  • drop_overflow (bool) – do not take into account records in overflow bin when True (relevant when binning a numeric variable)

  • retbins (bool) – if true, function also returns dict with bin_edges of rebinned variables.

  • verbose (bool) – if False, do not print all interval columns that are guessed

Returns:

outlier significance matrix (pd.DataFrame)

phik.outliers.outlier_significance_matrix_from_hist2d(data: ndarray, CI_method: str = 'poisson') Tuple[ndarray, ndarray]

Calculate the significance matrix of excesses or deficits in a contingency table

Parameters:
  • data – numpy array contingency table

  • CI_method (string) – method to be used for uncertainty calculation. poisson: normal poisson error. exact_poisson: error calculated from the asymmetric exact poisson interval

Returns:

p-value matrix, outlier significance matrix

phik.outliers.outlier_significance_matrix_from_rebinned_df(data_binned: DataFrame, binning_dict: dict, CI_method: str = 'poisson', ndecimals: int = 1, dropna: bool = True, drop_underflow: bool = True, drop_overflow: bool = True) DataFrame

Calculate the significance matrix of excesses or deficits

Parameters:
  • data_binned – input data. DataFrame must contain exactly two columns

  • binning_dict (dict) – dictionary with bin edges for each binned interval variable. When no bin_edges are provided values are used as bin label. Otherwise, bin labels are constructed based on provided bin edge information.

  • CI_method (string) – method to be used for uncertainty calculation. poisson: normal poisson error. exact_poisson: error calculated from the asymmetric exact poisson interval

  • ndecimals – number of decimals to use in labels of binned interval variables to specify bin edges (default=1)

  • dropna (bool) – remove NaN values with True

  • drop_underflow (bool) – do not take into account records in underflow bin when True (relevant when binning a numeric variable)

  • drop_overflow (bool) – do not take into account records in overflow bin when True (relevant when binning a numeric variable)

Returns:

outlier significance matrix (pd.DataFrame)

phik.outliers.poisson_obs_mid_p(nobs: int, nexp: float, nexperr: float) float

Calculate the p-value for measuring nobs observations given the expected value.

The Lancaster mid-P correction is applied to take into account the effects of discrete statistics. If the uncertainty on the expected value is known the Linnemann method is used for the p-value calculation. Otherwise the Poisson distribution is used to estimate the p-value.

Parameters:
  • nobs (int) – observed count

  • nexp (float) – expected number

  • nexperr (float) – uncertainty on the expected number

Returns:

mid p-value

Return type:

float

phik.outliers.poisson_obs_mid_z(nobs: int, nexp: float, nexperr: float) float

Calculate the Z-value for measuring nobs observations given the expected value.

The Z-value express the number of sigmas the observed value deviates from the expected value, and is based on the p-value calculation. The Lancaster midP correction is applied to take into account the effects of low statistics. If the uncertainty on the expected value is known the Linnemann method is used for the p-value calculation. Otherwise the Poisson distribution is used to estimate the p-value.

Parameters:
  • nobs (int) – observed count

  • nexp (float) – expected number

  • nexperr (float) – uncertainty on the expected number

Returns:

Z-value

Return type:

tuple

phik.outliers.poisson_obs_p(nobs: int, nexp: float, nexperr: float) float

Calculate p-value for nobs observations given the expected value and its uncertainty using the Linnemann method.

If the uncertainty on the expected value is known the Linnemann method is used. Otherwise the Poisson distribution is used to estimate the p-value.

Measures of Significance in HEP and Astrophysics Authors: J. T. Linnemann http://arxiv.org/abs/physics/0312059

Code inspired by: https://root.cern.ch/doc/master/NumberCountingUtils_8cxx_source.html#l00086

Three fixes are added for:

  • nobs = 0, when - by construction - p should be 1.

  • uncertainty of zero, for which Linnemann’s function does not work, but one can simply revert to regular Poisson.

  • when nexp=0, betainc always returns 1. Here we set nexp = nexperr.

Parameters:
  • nobs (int) – observed count

  • nexp (float) – expected number

  • nexperr (float) – uncertainty on the expected number

Returns:

p-value

Return type:

float

phik.outliers.poisson_obs_z(nobs: int, nexp: float, nexperr: float) float

Calculate the Z-value for measuring nobs observations given the expected value.

The Z-value express the number of sigmas the observed value deviates from the expected value, and is based on the p-value calculation. If the uncertainty on the expected value is known the Linnemann method is used. Otherwise the Poisson distribution is used to estimate the p-value.

Parameters:
  • nobs (int) – observed count

  • nexp (float) – expected number

  • nexperr (float) – uncertainty on the expected number

Returns:

Z-value

Return type:

float

phik.phik module

Project: PhiK - correlation analyzer library

Created: 2018/09/05

Description:

Functions for the Phik correlation calculation

Authors:

KPMG Advanced Analytics & Big Data team, Amstelveen, The Netherlands

Redistribution and use in source and binary forms, with or without modification, are permitted according to the terms listed in the file LICENSE.

phik.phik.global_phik_array(df: DataFrame, interval_cols: list | None = None, bins: int | list | ndarray | dict = 10, quantile: bool = False, noise_correction: bool = True, dropna: bool = True, drop_underflow: bool = True, drop_overflow: bool = True, verbose: bool = True, njobs: int = -1) Tuple[ndarray, ndarray]

Global correlation values of variables, obtained from the PhiK correlation matrix.

A global correlation value is a simple approximation of how well one feature can be modeled in terms of all others.

The global correlation coefficient is a number between zero and one, obtained from the PhiK correlation matrix, that gives the highest possible correlation between a variable and the linear combination of all other variables. See PhiK paper or for original definition: https://inspirehep.net/literature/101965

Global PhiK uses two important simplifications / approximations: - The variables are assumed to belong to a multinormal distribution, which is typically not the case. - The correlation should be a Pearson correlation matrix, allowing for negative values, which is not the case

with PhiK correlations (which are positive by construction).

To correct for these, the Global PhiK values are artificially capped between 0 and 1.

Still, the Global PhiK values are useful, quick, simple estimates that are interesting in an exploratory study. For a solid, trustworthy estimate be sure to use a classifier or regression model instead.

Parameters:
  • data_binned (pd.DataFrame) – input data

  • interval_cols (list) – column names of columns with interval variables.

  • bins – number of bins, or a list of bin edges (same for all columns), or a dictionary where per column the bins are specified (default=10). E.g.: bins = {‘mileage’:5, ‘driver_age’:[18,25,35,45,55,65,125]}

  • quantile – when bins is an integer, uniform bins (False) or bins based on quantiles (True)

  • noise_correction (bool) – apply noise correction in phik calculation

  • dropna (bool) – remove NaN values with True

  • drop_underflow (bool) – do not take into account records in underflow bin when True (relevant when binning a numeric variable)

  • drop_overflow (bool) – do not take into account records in overflow bin when True (relevant when binning a numeric variable)

  • verbose (bool) – if False, do not print all interval columns that are guessed

  • njobs (int) – number of parallel jobs used for calc of global phik. default is -1. 1 uses no parallel jobs.

Returns:

global correlations array

phik.phik.global_phik_from_rebinned_df(data_binned: DataFrame, noise_correction: bool = True, dropna: bool = True, drop_underflow: bool = True, drop_overflow: bool = True, njobs: int = -1) Tuple[ndarray, ndarray]

Global correlation values of variables, obtained from the PhiK correlation matrix.

A global correlation value is a simple approximation of how well one feature can be modeled in terms of all others.

The global correlation coefficient is a number between zero and one, obtained from the PhiK correlation matrix, that gives the highest possible correlation between a variable and the linear combination of all other variables. See PhiK paper or for original definition: https://inspirehep.net/literature/101965

Global PhiK uses two important simplifications / approximations: - The variables are assumed to belong to a multinormal distribution, which is typically not the case. - The correlation should be a Pearson correlation matrix, allowing for negative values, which is not the case

with PhiK correlations (which are positive by construction).

To correct for these, the Global PhiK values are artificially capped between 0 and 1.

Still, the Global PhiK values are useful, quick, simple estimates that are interesting in an exploratory study. For a solid, trustworthy estimate be sure to use a classifier or regression model instead.

Parameters:
  • data_binned (pd.DataFrame) – rebinned input data

  • noise_correction (bool) – apply noise correction in phik calculation

  • dropna (bool) – remove NaN values with True

  • drop_underflow (bool) – do not take into account records in underflow bin when True (relevant when binning a numeric variable)

  • drop_overflow (bool) – do not take into account records in overflow bin when True (relevant when binning a numeric variable)

  • njobs (int) – number of parallel jobs used for calculation of phik. default is -1. 1 uses no parallel jobs.

Returns:

global correlations array

phik.phik.phik_from_array(x: ndarray | Series, y: ndarray | Series, num_vars: str | list | None = None, bins: int | dict | list | ndarray = 10, quantile: bool = False, noise_correction: bool = True, dropna: bool = True, drop_underflow: bool = True, drop_overflow: bool = True) float

Correlation matrix of bivariate gaussian derived from chi2-value

Chi2-value gets converted into correlation coefficient of bivariate gauss with correlation value rho, assuming giving binning and number of records. Correlation coefficient value is between 0 and 1.

Bivariate gaussian’s range is set to [-5,5] by construction.

Parameters:
  • x – array-like input

  • y – array-like input

  • num_vars – list of numeric variables which need to be binned, e.g. [‘x’] or [‘x’,’y’]

  • bins – number of bins, or a list of bin edges (same for all columns), or a dictionary where per column the bins are specified (default=10). E.g.: bins = {‘mileage’:5, ‘driver_age’:[18,25,35,45,55,65,125]}

  • quantile – when bins is an integer, uniform bins (False) or bins based on quantiles (True)

  • noise_correction (bool) – apply noise correction in phik calculation

  • dropna (bool) – remove NaN values with True

  • drop_underflow (bool) – do not take into account records in underflow bin when True (relevant when binning a numeric variable)

  • drop_overflow (bool) – do not take into account records in overflow bin when True (relevant when binning a numeric variable)

Returns:

phik correlation coefficient

phik.phik.phik_from_binned_array(x: ndarray | Series, y: ndarray | Series, noise_correction: bool = True, dropna: bool = True, drop_underflow: bool = True, drop_overflow: bool = True) float

Correlation matrix of bivariate gaussian derived from chi2-value

Chi2-value gets converted into correlation coefficient of bivariate gauss with correlation value rho, assuming giving binning and number of records. Correlation coefficient value is between 0 and 1.

Bivariate gaussian’s range is set to [-5,5] by construction.

Parameters:
  • x – array-like input. Interval variables need to be binned beforehand.

  • y – array-like input. Interval variables need to be binned beforehand.

  • noise_correction (bool) – apply noise correction in phik calculation

  • dropna (bool) – remove NaN values with True

  • drop_underflow (bool) – do not take into account records in underflow bin when True (relevant when binning a numeric variable)

  • drop_overflow (bool) – do not take into account records in overflow bin when True (relevant when binning a numeric variable)

Returns:

phik correlation coefficient

phik.phik.phik_from_hist2d(observed: ndarray, noise_correction: bool = True, expected: ndarray | None = None) float

correlation coefficient of bivariate gaussian derived from chi2-value

Chi2-value gets converted into correlation coefficient of bivariate gauss with correlation value rho, assuming giving binning and number of records. Correlation coefficient value is between 0 and 1.

Bivariate gaussian’s range is set to [-5,5] by construction.

Parameters:
  • observed – 2d-array observed values

  • noise_correction (bool) – apply noise correction in phik calculation

  • expected – 2d-array expected values. Optional, default is None, otherwise evaluated automatically.

Returns float:

correlation coefficient phik

phik.phik.phik_from_rebinned_df(data_binned: DataFrame, noise_correction: bool = True, dropna: bool = True, drop_underflow: bool = True, drop_overflow: bool = True, njobs: int = -1) DataFrame

Correlation matrix of bivariate gaussian derived from chi2-value

Chi2-value gets converted into correlation coefficient of bivariate gauss with correlation value rho, assuming giving binning and number of records. Correlation coefficient value is between 0 and 1.

Bivariate gaussian’s range is set to [-5,5] by construction.

Parameters:
  • data_binned (pd.DataFrame) – input data where interval variables have been binned

  • noise_correction (bool) – apply noise correction in phik calculation

  • dropna (bool) – remove NaN values with True

  • drop_underflow (bool) – do not take into account records in underflow bin when True (relevant when binning a numeric variable)

  • drop_overflow (bool) – do not take into account records in overflow bin when True (relevant when binning a numeric variable)

  • njobs (int) – number of parallel jobs used for calculation of phik. default is -1. 1 uses no parallel jobs.

Returns:

phik correlation matrix

phik.phik.phik_matrix(df: DataFrame, interval_cols: list | None = None, bins: int | list | ndarray | dict = 10, quantile: bool = False, noise_correction: bool = True, dropna: bool = True, drop_underflow: bool = True, drop_overflow: bool = True, verbose: bool = True, njobs: int = -1) DataFrame

Correlation matrix of bivariate gaussian derived from chi2-value

Chi2-value gets converted into correlation coefficient of bivariate gauss with correlation value rho, assuming giving binning and number of records. Correlation coefficient value is between 0 and 1.

Bivariate gaussian’s range is set to [-5,5] by construction.

Parameters:
  • data_binned (pd.DataFrame) – input data

  • interval_cols (list) – column names of columns with interval variables.

  • bins – number of bins, or a list of bin edges (same for all columns), or a dictionary where per column the bins are specified (default=10). E.g.: bins = {‘mileage’:5, ‘driver_age’:[18,25,35,45,55,65,125]}

  • quantile – when bins is an integer, uniform bins (False) or bins based on quantiles (True)

  • noise_correction (bool) – apply noise correction in phik calculation

  • dropna (bool) – remove NaN values with True

  • drop_underflow (bool) – do not take into account records in underflow bin when True (relevant when binning a numeric variable)

  • drop_overflow (bool) – do not take into account records in overflow bin when True (relevant when binning a numeric variable)

  • verbose (bool) – if False, do not print all interval columns that are guessed

  • njobs (int) – number of parallel jobs used for calculation of phik. default is -1. 1 uses no parallel jobs.

Returns:

phik correlation matrix

phik.phik.phik_observed_vs_expected_from_rebinned_df(obs_binned: DataFrame, exp_binned: DataFrame, noise_correction: bool = True, dropna: bool = True, drop_underflow: bool = True, drop_overflow: bool = True, njobs: int = -1) DataFrame

PhiK correlation matrix of comparing observed with expected dataset

Chi2-value gets converted into correlation coefficient of bivariate gauss with correlation value rho, assuming giving binning and number of records. Correlation coefficient value is between 0 and 1.

Parameters:
  • obs_binned (pd.DataFrame) – observed input data where interval variables have been binned

  • exp_binned (pd.DataFrame) – expected input data where interval variables have been binned

  • noise_correction (bool) – apply noise correction in phik calculation

  • dropna (bool) – remove NaN values with True

  • drop_underflow (bool) – do not take into account records in underflow bin when True (relevant when binning a numeric variable)

  • drop_overflow (bool) – do not take into account records in overflow bin when True (relevant when binning a numeric variable)

  • njobs (int) – number of parallel jobs used for calculation of phik. default is -1. 1 uses no parallel jobs.

Returns:

phik correlation matrix

phik.phik.spark_phik_matrix_from_hist2d_dict(spark_context, hist_dict: dict)

Correlation matrix of bivariate gaussian using spark parallelization over variable-pair 2d histograms

See spark notebook phik_tutorial_spark.ipynb as example.

Each input histogram gets converted into correlation coefficient of bivariate gauss with correlation value rho, assuming giving binning and number of records. Correlation coefficient value is between 0 and 1.

Parameters:
  • spark_context – spark context

  • hist_dict – dict of 2d numpy grids with value-counts. keys are histogram names.

Returns:

phik correlation matrix

phik.report module

Project: PhiK - correlation analyzer library

Created: 2018/09/06

Description:

Functions to create nice correlation overview and matrix plots

Authors:

KPMG Advanced Analytics & Big Data team, Amstelveen, The Netherlands

Redistribution and use in source and binary forms, with or without modification, are permitted according to the terms listed in the file LICENSE.

phik.report.correlation_report(data: DataFrame, interval_cols: list | None = None, bins=10, quantile: bool = False, do_outliers: bool = True, pdf_file_name: str = '', significance_threshold: float = 3, correlation_threshold: float = 0.5, noise_correction: bool = True, store_each_plot: bool = False, lambda_significance: str = 'log-likelihood', simulation_method: str = 'multinominal', nsim_chi2: int = 1000, significance_method: str = 'asymptotic', CI_method: str = 'poisson', verbose: bool = True, plot_phik_matrix_kws: dict = {}, plot_global_phik_kws: dict = {}, plot_significance_matrix_kws: dict = {}, plot_outlier_significance_kws: dict = {}) Tuple[DataFrame, DataFrame, Dict[str, DataFrame], Dict[str, str]]

Create a correlation report for the given dataset.

The following quantities are calculated:

  • The phik correlation matrix

  • The significance matrix

  • The outlier significances measured in pairs of variables. (optional)

Parameters:
  • data – input dataframe

  • interval_cols – list of columns names of columns containing interval data

  • bins – number of bins, or a list of bin edges (same for all columns), or a dictionary where per column the bins are specified. (default=10) E.g.: bins = {‘mileage’:5, ‘driver_age’:[18,25,35,45,55,65,125]}

  • quantile – when bins is an integer, uniform bins (False) or bins based on quantiles (True)

  • do_outliers – Evaluate outlier significances of variable pairs (when True)

  • pdf_file_name – file name of the pdf where the results are stored

  • store_each_plot – store each plot in folder derived from pdf_file_name. If true, single pdf is no longer stored. Default is false.

  • significance_threshold – evaluate outlier significance for all variable pairs with a significance of uncorrelation higher than this threshold

  • correlation_threshold – evaluate outlier significance for all variable pairs with a phik correlation higher than this threshold

  • noise_correction – Apply noise correction in phik calculation

  • lambda_significance – test statistic used in significance calculation. Options: [pearson, log-likelihood]

  • simulation_method – sampling method using in significance calculation. Options: [mutlinominal, row_product_multinominal, col_product_multinominal, hypergeometric]

  • nsim_chi2 – number of simulated datasets in significance calculation.

  • significance_method – method for significance calculation. Options: [asymptotic, MC, hybrid]

  • CI_method – method for uncertainty calculation for outlier significance calculation. Options: [poisson, exact_poisson]

  • verbose (bool) – if False, do not print all interval columns that are guessed

  • plot_phik_matrix_kws (dict) – kwargs passed to plot_correlation_matrix() to plot the phik matrix. updates the default plotting values.

  • plot_global_phik_kws (dict) – kwargs passed to plot_correlation_matrix() to plot the global-phik vector. updates the default plotting values.

  • plot_significance_matrix_kws (dict) – kwargs passed to plot_correlation_matrix() to plot significance matrix. updates the default plotting values.

  • plot_outlier_significance_kws (dict) – kwargs passed to plot_correlation_matrix() to plot the outlier significances. updates the default plotting values.

Returns:

phik_matrix (pd.DataFrame), global_phik (np.array), significance_matrix (pd.DataFrame), outliers_overview (dictionary), output_files (dictionary)

phik.report.plot_correlation_matrix(matrix_colors: ndarray, x_labels: list, y_labels: list, pdf_file_name: str = '', title: str = 'correlation', vmin: float = -1, vmax: float = 1, color_map: str = 'RdYlGn', x_label: str = '', y_label: str = '', top: int = 20, matrix_numbers: ndarray | None = None, print_both_numbers: bool = True, figsize: tuple = (7, 5), usetex: bool = False, identity_layout: bool = True, fontsize_factor: float = 1) None

Create and plot correlation matrix.

Copied with permission from the eskapade package (pip install eskapade)

Parameters:
  • matrix_colors – input correlation matrix

  • x_labels (list) – Labels for histogram x-axis bins

  • y_labels (list) – Labels for histogram y-axis bins

  • pdf_file_name (str) – if set, will store the plot in a pdf file

  • title (str) – if set, title of the plot

  • vmin (float) – minimum value of color legend (default is -1)

  • vmax (float) – maximum value of color legend (default is +1)

  • x_label (str) – Label for histogram x-axis

  • y_label (str) – Label for histogram y-axis

  • color_map (str) – color map passed to matplotlib pcolormesh. (default is ‘RdYlGn’)

  • top (int) – only print the top 20 characters of x-labels and y-labels. (default is 20)

  • matrix_numbers – input matrix used for plotting numbers. (default it matrix_colors)

  • identity_layout – Plot diagonal from right top to bottom left (True) or bottom left to top right (False)

phik.report.plot_hist_and_func(data: list | ndarray | Series, func: Callable, funcparams, xbins=False, labels=None, xlabel='', ylabel='', title='', xlimit=None, alpha=1)

Create a histogram of the provided data and overlay with a function.

Parameters:
  • data (list) – data

  • func (function) – function of the type f(x, a, b, c) where parameters a, b, c are optional

  • funcparams (list) – parameter values to be given to the function, to be specified as [a, b, c]

  • xbins – specify binning of histogram, either by giving the number of bins or a list of bin edges

  • labels – labels of histogram and function to be used in the legend

  • xlabel – figure xlabel

  • ylabel – figure ylabel

  • title – figure title

  • xlimit – x limits figure

  • alpha – alpha histogram

Returns:

phik.resources module

Project: PhiK - correlation analyzer library

Created: 2018/11/13

Description:

Collection of helper functions to get fixtures, i.e. for test data and notebooks. These are mostly used by the (integration) tests and example notebooks.

Authors:

KPMG Advanced Analytics & Big Data team, Amstelveen, The Netherlands

Redistribution and use in source and binary forms, with or without modification, are permitted according to the terms listed in the file LICENSE.

phik.resources.fixture(name: str) str

Return the full path filename of a fixture data set.

Parameters:

name (str) – The name of the fixture.

Returns:

The full path filename of the fixture data set.

Return type:

str

Raises:

FileNotFoundError – If the fixture cannot be found.

phik.resources.notebook(name: str) str

Return the full path filename of a tutorial notebook.

Parameters:

name (str) – The name of the notebook.

Returns:

The full path filename of the notebook.

Return type:

str

Raises:

FileNotFoundError – If the notebook cannot be found.

phik.significance module

Project: PhiK - correlation analyzer library

Created: 2018/09/05

Description:

Functions for doing the significance evaluation of an hypothesis test of variable independence using a contingency table.

Authors:

KPMG Advanced Analytics & Big Data team, Amstelveen, The Netherlands

Redistribution and use in source and binary forms, with or without modification, are permitted according to the terms listed in the file LICENSE.

phik.significance.fit_test_statistic_distribution(chi2s: list | ndarray, nbins: int = 50) Tuple[float, float, float, float]

Fit the hybrid chi2-distribution to the data to find f.

Perform a binned likelihood fit to the data to find the optimal value for the fraction f in h(x|f) = N * (f * chi2(x, ndof) + (1-f) * gauss(x, ndof, sqrt(ndof)) The parameter ndof is fixed in the fit using ndof = mean(x). The total number of datapoints N is also fixed.

Parameters:
  • chi2s (list) – input data - a list of chi2 values

  • nbins (int) – in order to fit the data a histogram is created with nbins number of bins

Returns:

f, ndof, sigma (width of gauss), bw (bin width)

phik.significance.hfunc(x: float, N: float, f: float, k: float, sigma: float) float

Definition of the combined probability density function h(x)

h(x|f) = N * (f * chi2(x, k) + (1-f) * gauss(x, k, sigma)

Parameters:
  • x (float) – x

  • N (float) – normalisation

  • f (float) – fraction [0,1]

  • k (float) – ndof of chi2 function and mean of gauss

  • sigma (float) – width of gauss

Returns:

h(x|f)

phik.significance.significance_from_array(x: ndarray | Series, y: ndarray | Series, num_vars=None, bins: int | list | ndarray | dict = 10, quantile: bool = False, lambda_: str = 'log-likelihood', nsim: int = 1000, significance_method: str = 'hybrid', simulation_method: str = 'multinominal', dropna: bool = True, drop_underflow: bool = True, drop_overflow: bool = True, njobs: int = -1) Tuple[float, float]

Calculate the significance of correlation

Calculate the significance of correlation for two variables which can be of interval, oridnal or categorical type. Interval variables will be binned.

Parameters:
  • x – array-like input

  • y – array-like input

  • num_vars – list of numeric variables which need to be binned, e.g. [‘x’] or [‘x’,’y’]

  • bins – number of bins, or a list of bin edges (same for all columns), or a dictionary where per column the bins are specified. (default=10) E.g.: bins = {‘mileage’:5, ‘driver_age’:[18,25,35,45,55,65,125]}

  • quantile – when bins is an integer, uniform bins (False) or bins based on quantiles (True)

  • lambda (str) – test statistic. Available options are [pearson, log-likelihood]

  • nsim (int) – number of simulated datasets

  • simulation_method (str) – simulation method. Options: [mutlinominal, row_product_multinominal, col_product_multinominal, hypergeometric].

  • significance_method (str) – significance_method. Options: [asymptotic, MC, hybrid]

  • dropna (bool) – remove NaN values with True

  • drop_underflow (bool) – do not take into account records in underflow bin when True (relevant when binning a numeric variable)

  • drop_overflow (bool) – do not take into account records in overflow bin when True (relevant when binning a numeric variable)

  • njobs (int) – number of parallel jobs used for simulation. default is -1. 1 uses no parallel jobs.

Returns:

p-value, significance

phik.significance.significance_from_binned_array(x: ndarray | Series, y: ndarray | Series, lambda_: str = 'log-likelihood', significance_method: str = 'hybrid', nsim: int = 1000, simulation_method: str = 'multinominal', dropna: bool = True, drop_underflow: bool = True, drop_overflow: bool = True, njobs: int = -1) Tuple[float, float]

Calculate the significance of correlation

Calculate the significance of correlation for two variables which can be of interval, oridnal or categorical type. Interval variables need to be binned.

Parameters:
  • x – array-like input

  • y – array-like input

  • lambda (str) – test statistic. Available options are [pearson, log-likelihood]

  • simulation_method (str) – simulation method. Options: [multinominal, row_product_multinominal, col_product_multinominal, hypergeometric].

  • nsim (int) – number of simulated datasets

  • significance_method (str) – significance_method. Options: [asymptotic, MC, hybrid]

  • dropna (bool) – remove NaN values with True

  • drop_underflow (bool) – do not take into account records in underflow bin when True (relevant when binning a numeric variable)

  • drop_overflow (bool) – do not take into account records in overflow bin when True (relevant when binning a numeric variable)

  • njobs (int) – number of parallel jobs used for simulation. default is -1. 1 uses no parallel jobs.

Returns:

p-value, significance

phik.significance.significance_from_chi2_MC(chi2: float, values: ndarray, nsim: int = 1000, lambda_: str = 'log-likelihood', simulation_method: str = 'multinominal', chi2s=None, njobs: int = -1) Tuple[float, float]

Convert a chi2 into significance using knowledge about the shape of the chi2 distribution of simulated data

Calculate significance based on simulation (MC method), using a simple percentile.

Parameters:
  • chi2 (float) – chi2 value

  • chi2s (list) – provide your own chi2s values (optional)

  • njobs (int) – number of parallel jobs used for simulation. default is -1. 1 uses no parallel jobs.

Returns:

pvalue, significance

phik.significance.significance_from_chi2_asymptotic(values: ndarray, chi2: float) Tuple[float, float]

Convert a chi2 into significance using knowledge about the number of degrees of freedom

Convention is done using asymptotic approximation.

Parameters:
  • chi2 (float) – chi2 value

  • ndof (float) – number of degrees of freedom

Returns:

p_value, significance

phik.significance.significance_from_chi2_hybrid(chi2: float, values: ndarray, nsim: int = 1000, lambda_: str = 'log-likelihood', simulation_method: str = 'multinominal', chi2s=None, njobs: int = -1) Tuple[float, float]

Convert a chi2 into significance using a hybrid method

This method combines the asymptotic method with the MC method, but applies several corrections:

  • use effective number of degrees of freedom instead of number of degrees of freedom. The effective number of degrees of freedom is measured as mean(chi2s), with chi2s a list of simulated chi2 values.

  • for low statistics data sets, with on average less than 4 data points per bin, the distribution of chi2-values is better described by h(x|f) then by the usual chi2-distribution. Use h(x|f) to convert the chi2 value to the pvalue and significance.

h(x|f) = N * (f * chi2(x, ndof) + (1-f) * gauss(x, ndof, sqrt(ndof))

Parameters:
  • chi2 (float) – chi2 value

  • chi2s (list) – provide your own chi2s values (optional)

  • avg_per_bin (float) – average number of data points per bin

  • njobs (int) – number of parallel jobs used for simulation. default is -1. 1 uses no parallel jobs.

Returns:

p_value, significance

phik.significance.significance_from_chi2_ndof(chi2: float, ndof: float) Tuple[float, float]

Convert a chi2 into significance using knowledge about the number of degrees of freedom

Conversion is done using asymptotic approximation.

Parameters:
  • chi2 (float) – chi2 value

  • ndof (float) – number of degrees of freedom

Returns:

p_value, significance

phik.significance.significance_from_hist2d(values: ndarray, nsim: int = 1000, lambda_: str = 'log-likelihood', simulation_method: str = 'multinominal', significance_method: str = 'hybrid', njobs: int = -1) Tuple[float, float]

Calculate the significance of correlation of two variables based on the contingency table

Parameters:
  • values – contingency table

  • nsim (int) – number of simulations

  • lambda (str) – test statistic. Available options are [pearson, log-likelihood]

  • simulation_method (str) – simulation method. Options: [multinominal, row_product_multinominal, col_product_multinominal, hypergeometric].

  • significance_method (str) – significance_method. Options: [asymptotic, MC, hybrid]

  • njobs (int) – number of parallel jobs used for simulation. default is -1. 1 uses no parallel jobs.

Returns:

pvalue, significance

phik.significance.significance_from_rebinned_df(data_binned: DataFrame, lambda_: str = 'log-likelihood', simulation_method: str = 'multinominal', nsim: int = 1000, significance_method: str = 'hybrid', dropna: bool = True, drop_underflow: bool = True, drop_overflow: bool = True, njobs: int = -1) DataFrame

Calculate significance of correlation of all variable combinations in the DataFrame

Parameters:
  • data_binned – input binned DataFrame

  • nsim (int) – number of simulations

  • lambda (str) – test statistic. Available options are [pearson, log-likelihood]

  • simulation_method (str) – simulation method. Options: [mutlinominal, row_product_multinominal, col_product_multinominal, hypergeometric].

  • significance_method (str) – significance_method. Options: [asymptotic, MC, hybrid]

  • dropna (bool) – remove NaN values with True

  • drop_underflow (bool) – do not take into account records in underflow bin when True (relevant when binning a numeric variable)

  • drop_overflow (bool) – do not take into account records in overflow bin when True (relevant when binning a numeric variable)

  • njobs (int) – number of parallel jobs used for simulation. default is -1.

Returns:

significance matrix

phik.significance.significance_matrix(df: DataFrame, interval_cols: list | None = None, lambda_: str = 'log-likelihood', simulation_method: str = 'multinominal', nsim: int = 1000, significance_method: str = 'hybrid', bins: int | list | ndarray | dict = 10, dropna: bool = True, drop_underflow: bool = True, drop_overflow: bool = True, verbose: bool = True, njobs: int = -1) DataFrame

Calculate significance of correlation of all variable combinations in the dataframe

Parameters:
  • df (pd.DataFrame) – input data

  • interval_cols (list) – column names of columns with interval variables.

  • nsim (int) – number of simulations

  • lambda (str) – test statistic. Available options are [pearson, log-likelihood]

  • simulation_method (str) – simulation method. Options: [mutlinominal, row_product_multinominal, col_product_multinominal, hypergeometric].

  • nsim – number of simulated datasets

  • significance_method (str) – significance_method. Options: [asymptotic, MC, hybrid] :param bool dropna: remove NaN values with True

  • bins – number of bins, or a list of bin edges (same for all columns), or a dictionary where per column the bins are specified. (default=10) E.g.: bins = {‘mileage’:5, ‘driver_age’:[18,25,35,45,55,65,125]}

  • dropna (bool) – remove NaN values with True

  • drop_underflow (bool) – do not take into account records in underflow bin when True (relevant when binning a numeric variable)

  • drop_overflow (bool) – do not take into account records in overflow bin when True (relevant when binning a numeric variable)

  • verbose (bool) – if False, do not print all interval columns that are guessed

  • njobs (int) – number of parallel jobs used for simulation. default is -1. 1 uses no parallel jobs.

Returns:

significance matrix

phik.simulation module

Project: PhiK - correlation analyzer library

Created: 2018/09/05

Description:

Helper functions to simulate 2D datasets

Authors:

KPMG Advanced Analytics & Big Data team, Amstelveen, The Netherlands

Redistribution and use in source and binary forms, with or without modification, are permitted according to the terms listed in the file LICENSE.

phik.simulation.sim_2d_data(hist: ndarray, ndata: int = 0) ndarray

Simulate a 2 dimensional dataset given a 2 dimensional pdf

Parameters:
  • hist (array-like) – contingency table, which contains the observed number of occurrences in each category. This table is used as probability density function.

  • ndata (int) – number of simulations

Returns:

simulated data

phik.simulation.sim_2d_data_patefield(data: ndarray, seed: int | None = None) ndarray

Simulate a two dimensional dataset with fixed row and column totals.

Simulation algorithm by Patefield: W. M. Patefield, Applied Statistics 30, 91 (1981) Python implementation inspired by (C version): https://people.sc.fsu.edu/~jburkardt/c_src/asa159/asa159.html

Parameters:

data – contingency table, which contains the observed number of occurrences in each category. :param seed: optional seed for the simulation, primarily for testing purposes. This table is used as probability density function.

Returns:

simulated data

phik.simulation.sim_2d_product_multinominal(data: ndarray, axis: int) ndarray

Simulate 2 dimensional data with either row or column totals fixed.

Parameters:
  • data – contingency table, which contains the observed number of occurrences in each category. This table is used as probability density function.

  • axis – fix row totals (0) or column totals (1).

Returns:

simulated data

phik.simulation.sim_chi2_distribution(values: ndarray, nsim: int = 1000, lambda_: str = 'log-likelihood', simulation_method: str = 'multinominal', alt_hypothesis: bool = False, njobs: int = -1) list

Simulate 2D data and calculate the chi-square statistic for each simulated dataset.

Parameters:
  • values – The contingency table. The table contains the observed number of occurrences in each category

  • nsim (int) – number of simulations (optional, default=1000)

  • simulation_method (str) – sampling method. Options: [multinominal, hypergeometric, row_product_multinominal, col_product_multinominal]

  • lambda (str) – test statistic. Available options are [pearson, log-likelihood].

  • alt_hypothesis (bool) – if True, simulate values directly, and not its dependent frequency estimates.

  • njobs (int) – number of parallel jobs used for simulation. default is -1. 1 uses no parallel jobs.

Returns chi2s:

list of chi2 values for each simulated dataset

phik.simulation.sim_data(data: ndarray, method: str = 'multinominal') ndarray

Simulate a 2 dimensional dataset given a 2 dimensional pdf

Several simulation methods are provided:

  • multinominal: Only the total number of records is fixed.

  • row_product_multinominal: The row totals fixed in the sampling.

  • col_product_multinominal: The column totals fixed in the sampling.

  • hypergeometric: Both the row or column totals are fixed in the sampling. Note that this type of sampling is only available when row and column totals are integers.

Parameters:
  • data – contingency table

  • method (str) – sampling method. Options: [multinominal, hypergeometric, row_product_multinominal, col_product_multinominal]

Returns:

simulated data

phik.statistics module

Project: PhiK - correlation coefficient package

Created: 2018/09/05

Description:

Statistics helper functions, for the calculation of phik and significance of a contingency table.

Authors:

KPMG Advanced Analytics & Big Data team, Amstelveen, The Netherlands

Redistribution and use in source and binary forms, with or without modification, are permitted according to the terms listed in the file LICENSE.

phik.statistics.estimate_ndof(chi2values: list | ndarray) float

Estimation of the effective number of degrees of freedom.

A good approximation of endof is the average value. Alternatively a fit to the chi2 distribution can be make. Both values are returned.

Parameters:

chi2values (list) – list of chi2 values

Returns:

endof0, endof

phik.statistics.estimate_simple_ndof(observed: ndarray) int

Simple estimation of the effective number of degrees of freedom.

This equals the nominal calculation for ndof minus the number of empty bins in the expected contingency table.

Parameters:

observed – numpy array of observed cell counts

Returns:

endof

phik.statistics.get_chi2_using_dependent_frequency_estimates(vals: ndarray, lambda_: str = 'log-likelihood') float

Chi-square test of independence of variables in a contingency table.

The expected frequencies are based on the marginal sums of the table, i.e. dependent frequency estimates.

Parameters:

vals – The contingency table. The table contains the observed number of occurrences in each category

Returns test_statistic:

the test statistic value

phik.statistics.get_dependent_frequency_estimates(vals: ndarray) ndarray

Calculation of dependent expected frequencies.

Calculation is based on the marginal sums of the table, i.e. dependent frequency estimates. :param vals: The contingency table. The table contains the observed number of occurrences in each category

Returns exp:

expected frequencies

phik.statistics.get_pearson_chi_square(observed: ndarray, expected: ndarray | None = None, normalize: bool = True) float

Calculate pearson chi square between observed and expected 2d contingency matrix

Parameters:
  • observed – The observed contingency table. The table contains the observed number of occurrences in each cell.

  • expected – The expected contingency table. The table contains the expected number of occurrences in each cell.

  • normalize (bool) – normalize expected frequencies, default is True.

Returns:

the pearson chi2 value

phik.statistics.theoretical_ndof(observed: ndarray) int

Simple estimation of the effective number of degrees of freedom.

This equals the nominal calculation for ndof minus the number of empty bins in the expected contingency table.

Parameters:

observed – numpy array of observed cell counts

Returns:

theoretical ndof

phik.statistics.z_from_logp(logp: float, flip_sign: bool = False) float

Convert logarithm of p-value into one-sided Z-value

Parameters:
  • logp (float) – logarithm of p-value, should not be greater than 0

  • flip_sign (bool) – flip sign of Z-value, e.g. use for input log(1-p). Default is false.

Returns:

statistical significance Z-value

Return type:

float

phik.version module

Module contents

phik.global_phik_array(df: DataFrame, interval_cols: list | None = None, bins: int | list | ndarray | dict = 10, quantile: bool = False, noise_correction: bool = True, dropna: bool = True, drop_underflow: bool = True, drop_overflow: bool = True, verbose: bool = True, njobs: int = -1) Tuple[ndarray, ndarray]

Global correlation values of variables, obtained from the PhiK correlation matrix.

A global correlation value is a simple approximation of how well one feature can be modeled in terms of all others.

The global correlation coefficient is a number between zero and one, obtained from the PhiK correlation matrix, that gives the highest possible correlation between a variable and the linear combination of all other variables. See PhiK paper or for original definition: https://inspirehep.net/literature/101965

Global PhiK uses two important simplifications / approximations: - The variables are assumed to belong to a multinormal distribution, which is typically not the case. - The correlation should be a Pearson correlation matrix, allowing for negative values, which is not the case

with PhiK correlations (which are positive by construction).

To correct for these, the Global PhiK values are artificially capped between 0 and 1.

Still, the Global PhiK values are useful, quick, simple estimates that are interesting in an exploratory study. For a solid, trustworthy estimate be sure to use a classifier or regression model instead.

Parameters:
  • data_binned (pd.DataFrame) – input data

  • interval_cols (list) – column names of columns with interval variables.

  • bins – number of bins, or a list of bin edges (same for all columns), or a dictionary where per column the bins are specified (default=10). E.g.: bins = {‘mileage’:5, ‘driver_age’:[18,25,35,45,55,65,125]}

  • quantile – when bins is an integer, uniform bins (False) or bins based on quantiles (True)

  • noise_correction (bool) – apply noise correction in phik calculation

  • dropna (bool) – remove NaN values with True

  • drop_underflow (bool) – do not take into account records in underflow bin when True (relevant when binning a numeric variable)

  • drop_overflow (bool) – do not take into account records in overflow bin when True (relevant when binning a numeric variable)

  • verbose (bool) – if False, do not print all interval columns that are guessed

  • njobs (int) – number of parallel jobs used for calc of global phik. default is -1. 1 uses no parallel jobs.

Returns:

global correlations array

phik.outlier_significance_from_array(x: ndarray | list | Series, y: ndarray | list | Series, num_vars: list | None = None, bins: int | list | ndarray | dict = 10, quantile: bool = False, ndecimals: int = 1, CI_method: str = 'poisson', dropna: bool = True, drop_underflow: bool = True, drop_overflow: bool = True, verbose: bool = True) DataFrame

Calculate the significance matrix of excesses or deficits of input x and input y. x and y can contain interval, ordinal or categorical data. Use the num_vars variable to indicate whether x and/or y contain interval data.

Parameters:
  • x (list) – array-like input

  • y (list) – array-like input

  • num_vars (list) – list of variables which are numeric and need to be binned, either [‘x’],[‘y’],or[‘x’,’y’]

  • bins – number of bins, or a list of bin edges (same for all columns), or a dictionary where per column the bins are specified. (default=10) E.g.: bins = {‘mileage’:5, ‘driver_age’:[18,25,35,45,55,65,125]}

  • quantile (bool) – when the number of bins is specified, use uniform binning (False) or quantile binning (True)

  • ndecimals – number of decimals to use in labels of binned interval variables to specify bin edges (default=1)

  • CI_method (string) – method to be used for uncertainty calculation. poisson: normal poisson error. exact_poisson: error calculated from the asymmetric exact poisson interval

  • dropna (bool) – remove NaN values with True

  • drop_underflow (bool) – do not take into account records in underflow bin when True (relevant when binning a numeric variable)

  • drop_overflow (bool) – do not take into account records in overflow bin when True (relevant when binning a numeric variable)

  • verbose (bool) – if False, do not print all interval columns that are guessed

Returns:

outlier significance matrix (pd.DataFrame)

phik.outlier_significance_matrices(df: DataFrame, interval_cols: list | None = None, CI_method: str = 'poisson', ndecimals: int = 1, bins=10, quantile: bool = False, combinations: list | tuple = (), dropna: bool = True, drop_underflow: bool = True, drop_overflow: bool = True, retbins: bool = False, verbose: bool = True)

Calculate the significance matrix of excesses or deficits for all possible combinations of variables, or for those combinations specified using combinations

Parameters:
  • df – input data

  • interval_cols – columns with interval variables which need to be binned

  • CI_method (string) – method to be used for uncertainty calculation. poisson: normal poisson error. exact_poisson: error calculated from the asymmetric exact poisson interval

  • ndecimals – number of decimals to use in labels of binned interval variables to specify bin edges (default=1)

  • bins – number of bins, or a list of bin edges (same for all columns), or a dictionary where per column the bins are specified. (default=10) E.g.: bins = {‘mileage’:5, ‘driver_age’:[18,25,35,45,55,65,125]}

  • quantile (bool) – when the number of bins is specified, use uniform binning (False) or quantile binning (True)

  • combinations – in case you do not want to calculate an outlier significance matrix for all permutations of the available variables, you can specify a list of the required permutations here, in the format [(var1, var2), (var2, var4), etc]

  • dropna (bool) – remove NaN values with True

  • drop_underflow (bool) – do not take into account records in underflow bin when True (relevant when binning a numeric variable)

  • drop_overflow (bool) – do not take into account records in overflow bin when True (relevant when binning a numeric variable)

  • retbins (bool) – if true, function also returns dict with bin_edges of rebinned variables.

  • verbose (bool) – if False, do not print all interval columns that are guessed

Returns:

dictionary with outlier significance matrices (pd.DataFrame)

phik.outlier_significance_matrix(df: DataFrame, interval_cols: list | None = None, CI_method: str = 'poisson', ndecimals: int = 1, bins=10, quantile: bool = False, dropna: bool = True, drop_underflow: bool = True, drop_overflow: bool = True, retbins: bool = False, verbose: bool = True)

Calculate the significance matrix of excesses or deficits

Parameters:
  • df – input data. DataFrame must contain exactly two columns

  • interval_cols – columns with interval variables which need to be binned

  • CI_method (string) – method to be used for uncertainty calculation. poisson: normal poisson error. exact_poisson: error calculated from the asymmetric exact poisson interval

  • bins – number of bins, or a list of bin edges (same for all columns), or a dictionary where per column the bins are specified. (default=10) E.g.: bins = {‘mileage’:5, ‘driver_age’:[18,25,35,45,55,65,125]}

  • ndecimals – number of decimals to use in labels of binned interval variables to specify bin edges (default=1)

  • quantile (bool) – when the number of bins is specified, use uniform binning (False) or quantile binning (True)

  • dropna (bool) – remove NaN values with True

  • drop_underflow (bool) – do not take into account records in underflow bin when True (relevant when binning a numeric variable)

  • drop_overflow (bool) – do not take into account records in overflow bin when True (relevant when binning a numeric variable)

  • retbins (bool) – if true, function also returns dict with bin_edges of rebinned variables.

  • verbose (bool) – if False, do not print all interval columns that are guessed

Returns:

outlier significance matrix (pd.DataFrame)

phik.phik_from_array(x: ndarray | Series, y: ndarray | Series, num_vars: str | list | None = None, bins: int | dict | list | ndarray = 10, quantile: bool = False, noise_correction: bool = True, dropna: bool = True, drop_underflow: bool = True, drop_overflow: bool = True) float

Correlation matrix of bivariate gaussian derived from chi2-value

Chi2-value gets converted into correlation coefficient of bivariate gauss with correlation value rho, assuming giving binning and number of records. Correlation coefficient value is between 0 and 1.

Bivariate gaussian’s range is set to [-5,5] by construction.

Parameters:
  • x – array-like input

  • y – array-like input

  • num_vars – list of numeric variables which need to be binned, e.g. [‘x’] or [‘x’,’y’]

  • bins – number of bins, or a list of bin edges (same for all columns), or a dictionary where per column the bins are specified (default=10). E.g.: bins = {‘mileage’:5, ‘driver_age’:[18,25,35,45,55,65,125]}

  • quantile – when bins is an integer, uniform bins (False) or bins based on quantiles (True)

  • noise_correction (bool) – apply noise correction in phik calculation

  • dropna (bool) – remove NaN values with True

  • drop_underflow (bool) – do not take into account records in underflow bin when True (relevant when binning a numeric variable)

  • drop_overflow (bool) – do not take into account records in overflow bin when True (relevant when binning a numeric variable)

Returns:

phik correlation coefficient

phik.phik_matrix(df: DataFrame, interval_cols: list | None = None, bins: int | list | ndarray | dict = 10, quantile: bool = False, noise_correction: bool = True, dropna: bool = True, drop_underflow: bool = True, drop_overflow: bool = True, verbose: bool = True, njobs: int = -1) DataFrame

Correlation matrix of bivariate gaussian derived from chi2-value

Chi2-value gets converted into correlation coefficient of bivariate gauss with correlation value rho, assuming giving binning and number of records. Correlation coefficient value is between 0 and 1.

Bivariate gaussian’s range is set to [-5,5] by construction.

Parameters:
  • data_binned (pd.DataFrame) – input data

  • interval_cols (list) – column names of columns with interval variables.

  • bins – number of bins, or a list of bin edges (same for all columns), or a dictionary where per column the bins are specified (default=10). E.g.: bins = {‘mileage’:5, ‘driver_age’:[18,25,35,45,55,65,125]}

  • quantile – when bins is an integer, uniform bins (False) or bins based on quantiles (True)

  • noise_correction (bool) – apply noise correction in phik calculation

  • dropna (bool) – remove NaN values with True

  • drop_underflow (bool) – do not take into account records in underflow bin when True (relevant when binning a numeric variable)

  • drop_overflow (bool) – do not take into account records in overflow bin when True (relevant when binning a numeric variable)

  • verbose (bool) – if False, do not print all interval columns that are guessed

  • njobs (int) – number of parallel jobs used for calculation of phik. default is -1. 1 uses no parallel jobs.

Returns:

phik correlation matrix

phik.significance_from_array(x: ndarray | Series, y: ndarray | Series, num_vars=None, bins: int | list | ndarray | dict = 10, quantile: bool = False, lambda_: str = 'log-likelihood', nsim: int = 1000, significance_method: str = 'hybrid', simulation_method: str = 'multinominal', dropna: bool = True, drop_underflow: bool = True, drop_overflow: bool = True, njobs: int = -1) Tuple[float, float]

Calculate the significance of correlation

Calculate the significance of correlation for two variables which can be of interval, oridnal or categorical type. Interval variables will be binned.

Parameters:
  • x – array-like input

  • y – array-like input

  • num_vars – list of numeric variables which need to be binned, e.g. [‘x’] or [‘x’,’y’]

  • bins – number of bins, or a list of bin edges (same for all columns), or a dictionary where per column the bins are specified. (default=10) E.g.: bins = {‘mileage’:5, ‘driver_age’:[18,25,35,45,55,65,125]}

  • quantile – when bins is an integer, uniform bins (False) or bins based on quantiles (True)

  • lambda (str) – test statistic. Available options are [pearson, log-likelihood]

  • nsim (int) – number of simulated datasets

  • simulation_method (str) – simulation method. Options: [mutlinominal, row_product_multinominal, col_product_multinominal, hypergeometric].

  • significance_method (str) – significance_method. Options: [asymptotic, MC, hybrid]

  • dropna (bool) – remove NaN values with True

  • drop_underflow (bool) – do not take into account records in underflow bin when True (relevant when binning a numeric variable)

  • drop_overflow (bool) – do not take into account records in overflow bin when True (relevant when binning a numeric variable)

  • njobs (int) – number of parallel jobs used for simulation. default is -1. 1 uses no parallel jobs.

Returns:

p-value, significance

phik.significance_matrix(df: DataFrame, interval_cols: list | None = None, lambda_: str = 'log-likelihood', simulation_method: str = 'multinominal', nsim: int = 1000, significance_method: str = 'hybrid', bins: int | list | ndarray | dict = 10, dropna: bool = True, drop_underflow: bool = True, drop_overflow: bool = True, verbose: bool = True, njobs: int = -1) DataFrame

Calculate significance of correlation of all variable combinations in the dataframe

Parameters:
  • df (pd.DataFrame) – input data

  • interval_cols (list) – column names of columns with interval variables.

  • nsim (int) – number of simulations

  • lambda (str) – test statistic. Available options are [pearson, log-likelihood]

  • simulation_method (str) – simulation method. Options: [mutlinominal, row_product_multinominal, col_product_multinominal, hypergeometric].

  • nsim – number of simulated datasets

  • significance_method (str) – significance_method. Options: [asymptotic, MC, hybrid] :param bool dropna: remove NaN values with True

  • bins – number of bins, or a list of bin edges (same for all columns), or a dictionary where per column the bins are specified. (default=10) E.g.: bins = {‘mileage’:5, ‘driver_age’:[18,25,35,45,55,65,125]}

  • dropna (bool) – remove NaN values with True

  • drop_underflow (bool) – do not take into account records in underflow bin when True (relevant when binning a numeric variable)

  • drop_overflow (bool) – do not take into account records in overflow bin when True (relevant when binning a numeric variable)

  • verbose (bool) – if False, do not print all interval columns that are guessed

  • njobs (int) – number of parallel jobs used for simulation. default is -1. 1 uses no parallel jobs.

Returns:

significance matrix