Causal Learning (CL)

The Causal Learning (CL) module is a core component of the ETIA framework, designed to automate the discovery of causal relationships in complex, high-dimensional datasets. It is responsible for learning a causal graph from the features selected by the Automated Feature Selection (AFS) module. This causal graph captures the directed dependencies between variables, facilitating further tasks such as causal reasoning and prediction.

The CL module optimizes the entire causal discovery pipeline by exploring a configuration space of algorithms and hyperparameters. It searches for the best-fitting causal model based on the available data, ensuring that the discovered relationships are accurate and interpretable. By supporting a wide variety of causal discovery algorithms, independence tests, and scoring functions, CL can be adapted to different data types (continuous, mixed, categorical) and assumptions about the underlying system (e.g., causal sufficiency, latent confounders).

Core Objectives

The main goals of the CL module include:

  • Learning an accurate causal graph from the selected features.

  • Optimizing the causal discovery process by searching over various algorithms and configurations.

  • Supporting different data types, including continuous, categorical, and mixed variables.

  • Handling both causally sufficient and insufficient systems (i.e., with or without latent confounders).

  • Allowing flexible integration with downstream reasoning and visualization tasks.

How CL Works

The CL module operates in three stages:

  1. Causal Configuration Generator (CG): The generator explores the configuration space of causal discovery algorithms, independence tests, and scoring functions. It selects appropriate configurations based on the characteristics of the input data, including the type (continuous, mixed, or categorical) and any assumptions regarding causal sufficiency.

  2. Causal Discovery: Once the best configuration is selected, the CL module applies the causal discovery algorithm to the data. The output is a causal graph that captures the directed dependencies between variables. This graph can be further analyzed to identify key causal relationships, intervention points, or adjustment sets.

  3. Causal Evaluation: The discovered causal graphs are evaluated using scoring functions to assess their fit to the data. The evaluation considers the accuracy of the learned structure in representing the true causal relationships.

Available Algorithms

The CL module offers a variety of causal discovery algorithms, each suited for different data types and assumptions. These algorithms are listed below:

Algorithm

Data Type

Description

PC

Continuous, Mixed, Categorical

A constraint-based algorithm that uses conditional independence tests to learn the causal structure. Assumes causal sufficiency. Supports continuous, mixed, and categorical data.

CPC

Continuous, Mixed, Categorical

A variant of the PC algorithm that improves stability by handling non-faithful distributions. Assumes causal sufficiency. Supports continuous, mixed, and categorical data.

FGES

Continuous, Mixed, Categorical

A score-based algorithm that does not assume causal sufficiency. Suitable for high-dimensional data. Utilizes various scoring functions like SEM BIC Score, BDeu, Discrete BIC, CG BIC, and DG BIC.

FCI

Continuous, Mixed, Categorical

A constraint-based algorithm that accounts for latent confounders. Uses conditional independence tests to infer the causal structure. Supports continuous, mixed, and categorical data.

FCI-Max

Continuous, Mixed, Categorical

An extension of the FCI algorithm that maximizes certain criteria for improved causal discovery. Assumes the presence of latent variables. Supports continuous, mixed, and categorical data.

RFCI

Continuous, Mixed, Categorical

A relaxed version of the FCI algorithm that offers faster performance with slightly relaxed constraints. Assumes the presence of latent variables. Supports continuous, mixed, and categorical data.

GFCI

Continuous, Mixed, Categorical

A hybrid algorithm combining constraint-based and score-based methods. Allows for latent confounders and utilizes various conditional independence tests. Supports continuous, mixed, and categorical data.

CFCI

Continuous, Mixed, Categorical

Combines features of the FCI and RFCI algorithms to enhance causal discovery in the presence of latent variables. Supports continuous, mixed, and categorical data.

sVAR-FCI

Continuous, Mixed, Categorical (Time Series)

A time-series variant of the FCI algorithm that accounts for temporal dependencies. Supports time series data along with continuous, mixed, and categorical data.

svargFCI

Continuous, Mixed, Categorical (Time Series)

An extension of sVAR-FCI that incorporates additional scoring functions like SEM BIC Score, BDeu, Discrete BIC, CG BIC, and DG BIC for enhanced causal discovery in time-series data.

PCMCI

Continuous, Mixed, Categorical (Time Series)

A time-series causal discovery algorithm that does not assume causal sufficiency. Utilizes conditional mutual information tests and various

correlation-based methods like ParCor, RobustParCor, GPDC, CMIknn, ParCorrWLS, Gsquared, CMIsymb, and RegressionCI.

PCMCI+

Continuous, Mixed, Categorical (Time Series)

An enhanced version of PCMCI with improved handling of time lags and dependencies. Utilizes the same set of conditional mutual information tests and correlation-based methods as PCMCI.

LPCMCI

Continuous, Mixed, Categorical (Time Series)

A latent-variable variant of PCMCI that accounts for unobserved confounders. Utilizes conditional mutual information tests and correlation-based methods similar to PCMCI.

SAM

Continuous, Mixed

A neural network-based causal discovery algorithm that does not assume causal sufficiency. Includes parameters like learning rate, regularization, hidden neurons, training/testing epochs, batch size, and loss type.

NOTEARS

Continuous, Mixed, Categorical

An optimization-based algorithm that learns causal structures using least squares and L1-regularization. Assumes causal sufficiency. Supports continuous, mixed, and categorical data.

Available Independence Tests

The CL module supports a range of conditional independence tests, enabling flexibility in testing relationships between variables across different data types:

Test Name

Data Type

Description

FisherZ

Continuous

A widely used test for continuous data.

CG-LRT

Mixed

Conditional Gaussian Likelihood Ratio Test for mixed data (continuous and categorical).

DG-LRT

Mixed

Discrete Gaussian Likelihood Ratio Test for mixed data (discrete and Gaussian).

Chi-Square

Categorical

Test for independence in categorical data.

G-Square

Categorical

Another test for independence in categorical data, based on the G-statistic.

ParCor

Continuous

Test based on partial correlation.

RobustParCor

Continuous

A robust version of the partial correlation test, less sensitive to outliers.

GPDC

Continuous

Gaussian Process-based Dependency Criterion.

CMIknn

Continuous

Conditional Mutual Information test using nearest neighbors.

ParCorrWLS

Continuous

Partial Correlation with Weighted Least Squares.

Gsquared

Mixed

G-squared test adapted for mixed data types.

CMIsymb

Mixed

Symmetric Conditional Mutual Information test.

RegressionCI

Mixed

Regression-based Conditional Independence test.

Available Scoring Functions

To evaluate the causal graphs, the CL module includes several scoring functions, allowing flexibility in selecting the most appropriate metric for the data:

Score Name

Data Type

Description

SEM BIC Score

Continuous

Bayesian Information Criterion for Structural Equation Models. Suitable for continuous data.

BDeu

Categorical

Bayesian Dirichlet equivalent uniform score for categorical data.

Discrete BIC

Categorical

Bayesian Information Criterion for discrete data models.

CG-BIC

Mixed

BIC score for mixed data models (continuous and categorical).

DG-BIC

Mixed

BIC score for discrete Gaussian models.

GFCI Score

Mixed

Scoring function used by the GFCI algorithm to evaluate causal structures.

svargFCI Score

Mixed

Enhanced scoring function for svargFCI with additional metrics like SEM BIC Score, BDeu, Discrete BIC, CG BIC, and DG BIC.

Algorithm Parameters

Each algorithm may have specific parameters that can be tuned to optimize performance based on the dataset and requirements. Below are the parameters for each available algorithm:

PC Algorithm Parameters

PC Algorithm Parameters

Parameter

Type

Description

ci_test

string

Type of conditional independence test to use. Options: FisherZ, cg_lrt, dg_lrt, chisquare, gsquare.

stable

boolean

Whether to use the stable version of the PC algorithm.

CPC Parameters

CPC Parameters

Parameter

Type

Description

ci_test

string

Type of conditional independence test to use. Options: FisherZ, cg_lrt, dg_lrt, chisquare, gsquare.

stable

boolean

Whether to use the stable version of the CPC algorithm.

FGES Parameters

FGES Parameters

Parameter

Type

Description

score

string

Scoring function to use. Options: sem_bic_score, bdeu, discrete_bic, cg_bic, dg_bic.

FCI Parameters

FCI Parameters

Parameter

Type

Description

ci_test

string

Type of conditional independence test to use. Options: FisherZ, cg_lrt, dg_lrt, chisquare, gsquare.

stable

boolean

Whether to use a stable version of the FCI algorithm.

FCI-Max Parameters

FCI-Max Parameters

Parameter

Type

Description

ci_test

string

Type of conditional independence test to use. Options: FisherZ, cg_lrt, dg_lrt, chisquare, gsquare.

stable

boolean

Whether to use a stable version of the FCI-Max algorithm.

RFCI Parameters

RFCI Parameters

Parameter

Type

Description

ci_test

string

Type of conditional independence test to use. Options: FisherZ, cg_lrt, dg_lrt, chisquare, gsquare.

stable

boolean

Whether to use a stable version of the RFCI algorithm.

GFCI Parameters

GFCI Parameters

Parameter

Type

Description

ci_test

string

Type of conditional independence test to use. Options: FisherZ, cg_lrt, dg_lrt, chisquare, gsquare.

stable

boolean

Whether to use a stable version of the GFCI algorithm.

score

string

Additional scoring functions (optional): sem_bic_score, bdeu, discrete_bic, cg_bic, dg_bic.

CFCI Parameters

CFCI Parameters

Parameter

Type

Description

ci_test

string

Type of conditional independence test to use. Options: FisherZ, cg_lrt, dg_lrt, chisquare, gsquare.

stable

boolean

Whether to use a stable version of the CFCI algorithm.

sVAR-FCI Parameters

sVAR-FCI Parameters

Parameter

Type

Description

ci_test

string

Type of conditional independence test to use. Options: FisherZ, cg_lrt, dg_lrt, chisquare, gsquare.

stable

boolean

Whether to use a stable version of the sVAR-FCI algorithm.

time_series

boolean

Indicates if the data is a time series.

svargFCI Parameters

svargFCI Parameters

Parameter

Type

Description

ci_test

string

Type of conditional independence test to use. Options: FisherZ, cg_lrt, dg_lrt, chisquare, gsquare.

stable

boolean

Whether to use a stable version of the svargFCI algorithm.

score

string

Additional scoring functions: sem_bic_score, bdeu, discrete_bic, cg_bic, dg_bic.

time_series

boolean

Indicates if the data is a time series.

PCMCI Parameters

PCMCI Parameters

Parameter

Type

Description

ci_test

string

Type of conditional independence test to use. Options: ParCor, RobustParCor, GPDC, CMIknn, ParCorrWLS, Gsquared, CMIsymb, RegressionCI.

PCMCI+ Parameters

PCMCI+ Parameters

Parameter

Type

Description

ci_test

string

Type of conditional independence test to use. Options: ParCor, RobustParCor, GPDC, CMIknn, ParCorrWLS, Gsquared, CMIsymb, RegressionCI.

LPCMCI Parameters

LPCMCI Parameters

Parameter

Type

Description

ci_test

string

Type of conditional independence test to use. Options: ParCor, RobustParCor, GPDC, CMIknn, ParCorrWLS, Gsquared, CMIsymb, RegressionCI.

SAM Parameters

SAM Parameters

Parameter

Type

Description

lr

float

Learning rate. Options: 0.001, 0.01, 0.1.

dlr

float

Decay learning rate. Options: 0.0001, 0.001, 0.01.

lambda1

float

Regularization parameter 1. Options: 1, 10, 100.

lambda2

float

Regularization parameter 2. Options: 0.0001, 0.001, 0.01.

nh

int

Number of hidden neurons. Options: 10, 20, 50.

dnh

int

Decay hidden neurons. Options: 100, 200, 300.

train_epochs

int

Number of training epochs. Options: 1000, 3000, 5000.

test_epochs

int

Number of testing epochs. Options: 500, 1000, 1500.

batch_size

int

Batch size. Options: 50, 100, 200.

losstype

string

Type of loss function. Options: fgan, gan, mse.

NOTEARS Parameters

NOTEARS Parameters

Parameter

Type

Description

max_iter

int

Maximum number of iterations. Options: 100, 500, 1000.

h_tol

float

Tolerance for convergence. Options: 1e-7, 1e-5, 1e-3.

threshold

float

Threshold for edge inclusion. Options: 0.0, 0.5, 0.8.

Key Details

  • Latent Variables Supported: - : Supports latent (unobserved) variables. - : Does not support latent variables (causal sufficiency assumed).

  • Tests/Scores Used: - Conditional Independence Tests (`ci_test`): Methods like FisherZ, CG-LRT, DG-LRT, Chi-Square, G-Square, ParCor, RobustParCor, GPDC, CMIknn, ParCorrWLS, Gsquared, CMIsymb, RegressionCI. - Scores (`score`): Metrics like SEM BIC Score, BDeu, Discrete BIC, CG-BIC, DG-BIC, GFCI Score, svargFCI Score. - Additional Parameters: Algorithms like SAM and NOTEARS have specific parameters relevant to their optimization and learning processes.

  • Data Type: - Continuous: Numeric data without discrete categories. - Mixed: Combination of continuous and categorical data. - Categorical: Data with discrete categories. - Time Series: Data that includes temporal dependencies.

Notes

  • Assumptions: - Causal Sufficiency: If set to False, the algorithm accounts for potential latent variables. - Assume Faithfulness: Indicates whether the algorithm assumes the faithfulness condition holds, impacting its ability to recover the true causal graph.

### CL Output The output of the CL module includes:

  • A causal graph representing the learned structure between variables.

  • The best-performing causal discovery configuration, including the selected algorithm, independence test, and scoring function.

By providing an optimized causal discovery pipeline, the CL module ensures that the causal relationships discovered are both accurate and interpretable, facilitating further analysis and reasoning.

Main Class

The main entry point for using the CL module is the CausalLearner class. This class allows users to configure and run the causal discovery process, selecting from a variety of algorithms, tests, and scoring functions. The causal graphs generated can then be passed on for downstream reasoning or visualization tasks.

class CausalLearner(dataset_input: str | Dataset | None = None, configurations: Configurations | None = None, verbose: bool = False, n_jobs: int | None = None, random_seed: int | None = None)[source]

Bases: object

CausalLearner class for automated causal discovery.

Parameters:
  • dataset_input (str or Dataset) – Either a file path to the dataset or a Dataset instance containing the data.

  • configurations (Configurations, optional) – A Configurations object containing experiment configurations. If None, default configurations are used.

  • verbose (bool, optional) – If True, prints detailed logs. Default is False.

  • n_jobs (int, optional) – Number of jobs for parallel processing. Default is the number of CPU cores.

  • random_seed (int, optional) – Seed for random number generator to ensure reproducibility. Default is None.

learn_model()[source]

Runs the causal discovery process.

print_results(opt_conf=None)[source]

Prints the results of the causal discovery process.

set_dataset(dataset)[source]

Sets the dataset for the causal learner.

set_configurations(configurations)[source]

Sets the configurations for the causal learner.

save_progress(path=None)[source]

Saves the progress of the experiment to a file.

load_progress(path)[source]

Loads the progress of the experiment from a file.

add_configurations_from_file(filename)[source]

Adds additional configurations to the experiment from a JSON file.

update_learnt_model()[source]

Updates the learnt model with new configurations.

get_best_model_between_algorithms(algorithms)[source]

Gets the best model between specified algorithms.

get_best_model_between_family(**kwargs)[source]

Gets the best model within a family of algorithms based on specified criteria.

learn_model()[source]

Runs the causal discovery process using the OCT algorithm.

Returns:

  • opt_conf: The optimal configuration found.

  • matrix_mec_graph: The MEC graph matrix.

  • matrix_graph: The graph matrix

  • run_time: The runtime of the CDHPO process.

  • library_results: Results from the causal discovery libraries.

Return type:

Tuple containing

print_results(opt_conf=None)[source]

Prints the results of the causal discovery process.

Parameters:

opt_conf (dict, optional) – The optimal configuration to print. If None, uses self.opt_conf.

set_dataset(dataset)[source]

Sets the dataset for the causal learner.

Parameters:

dataset (Dataset) – The Dataset object to set.

Raises:

TypeError – If dataset is not of type Dataset.

set_configurations(configurations)[source]

Sets the configurations for the causal learner.

Parameters:

configurations (Configurations) – The Configurations object to set.

Raises:

TypeError – If configurations is not of type Configurations.

save_progress(path=None)[source]

Saves the progress of the experiment to a file.

Parameters:

path (str, optional) – The file path to save the progress to. If None, saves to ‘Experiment.pkl’ in results_folder.

static load_progress(path)[source]

Loads the progress of the experiment from a file.

Parameters:

path (str) – The file path to load the progress from.

Returns:

The loaded CausalLearner object.

Return type:

CausalLearner

add_configurations_from_file(filename)[source]

Adds additional configurations to the experiment from a JSON file.

Parameters:

filename (str) – The filename of the JSON file containing configurations.

update_learnt_model()[source]

Updates the learnt model with the new configurations.

get_best_model_between_algorithms(algorithms)[source]

Gets the best model between specified algorithms.

Parameters:

algorithms (list) – A list of algorithm names to consider.

Returns:

The best configuration among the specified algorithms.

Return type:

dict

get_best_model_between_family(causal_sufficiency=None, assume_faithfulness=None, is_output_mec=None, accepts_missing_values=None)[source]

Gets the best model within a family of algorithms based on specified criteria.

Parameters:
  • causal_sufficiency (bool, optional) – Filter algorithms that admit latent variables.

  • assume_faithfulness (bool, optional) – Filter algorithms based on faithfulness assumption.

  • is_output_mec (bool, optional) – Filter algorithms that output MEC graphs.

  • accepts_missing_values (bool, optional) – Filter algorithms that accept missing values.

Returns:

The best configuration among the filtered algorithms.

Return type:

dict

Helper Classes

Contents: