Causal Learning (CL)

The Causal Learning (CL) module is a core component of the ETIA framework, designed to automate the discovery of causal relationships in complex, high-dimensional datasets. It is responsible for learning a causal graph from the features selected by the Automated Feature Selection (AFS) module. This causal graph captures the directed dependencies between variables, facilitating further tasks such as causal reasoning and prediction.

The CL module optimizes the entire causal discovery pipeline by exploring a configuration space of algorithms and hyperparameters. It searches for the best-fitting causal model based on the available data, ensuring that the discovered relationships are accurate and interpretable. By supporting a wide variety of causal discovery algorithms, independence tests, and scoring functions, CL can be adapted to different data types (continuous, mixed, categorical) and assumptions about the underlying system (e.g., causal sufficiency, latent confounders).

Core Objectives

The main goals of the CL module include:

Learning an accurate causal graph from the selected features.
Optimizing the causal discovery process by searching over various algorithms and configurations.
Supporting different data types, including continuous, categorical, and mixed variables.
Handling both causally sufficient and insufficient systems (i.e., with or without latent confounders).
Allowing flexible integration with downstream reasoning and visualization tasks.

How CL Works

The CL module operates in three stages:

Causal Configuration Generator (CG): The generator explores the configuration space of causal discovery algorithms, independence tests, and scoring functions. It selects appropriate configurations based on the characteristics of the input data, including the type (continuous, mixed, or categorical) and any assumptions regarding causal sufficiency.
Causal Discovery: Once the best configuration is selected, the CL module applies the causal discovery algorithm to the data. The output is a causal graph that captures the directed dependencies between variables. This graph can be further analyzed to identify key causal relationships, intervention points, or adjustment sets.
Causal Evaluation: The discovered causal graphs are evaluated using scoring functions to assess their fit to the data. The evaluation considers the accuracy of the learned structure in representing the true causal relationships.

Available Algorithms

The CL module offers a variety of causal discovery algorithms, each suited for different data types and assumptions. These algorithms are listed below:

Algorithm	Data Type	Description
PC	Continuous, Mixed, Categorical	A constraint-based algorithm that uses conditional independence tests to learn the causal structure. Assumes causal sufficiency. Supports continuous, mixed, and categorical data.
CPC	Continuous, Mixed, Categorical	A variant of the PC algorithm that improves stability by handling non-faithful distributions. Assumes causal sufficiency. Supports continuous, mixed, and categorical data.
FGES	Continuous, Mixed, Categorical	A score-based algorithm that does not assume causal sufficiency. Suitable for high-dimensional data. Utilizes various scoring functions like SEM BIC Score, BDeu, Discrete BIC, CG BIC, and DG BIC.
FCI	Continuous, Mixed, Categorical	A constraint-based algorithm that accounts for latent confounders. Uses conditional independence tests to infer the causal structure. Supports continuous, mixed, and categorical data.
FCI-Max	Continuous, Mixed, Categorical	An extension of the FCI algorithm that maximizes certain criteria for improved causal discovery. Assumes the presence of latent variables. Supports continuous, mixed, and categorical data.
RFCI	Continuous, Mixed, Categorical	A relaxed version of the FCI algorithm that offers faster performance with slightly relaxed constraints. Assumes the presence of latent variables. Supports continuous, mixed, and categorical data.
GFCI	Continuous, Mixed, Categorical	A hybrid algorithm combining constraint-based and score-based methods. Allows for latent confounders and utilizes various conditional independence tests. Supports continuous, mixed, and categorical data.
CFCI	Continuous, Mixed, Categorical	Combines features of the FCI and RFCI algorithms to enhance causal discovery in the presence of latent variables. Supports continuous, mixed, and categorical data.
sVAR-FCI	Continuous, Mixed, Categorical (Time Series)	A time-series variant of the FCI algorithm that accounts for temporal dependencies. Supports time series data along with continuous, mixed, and categorical data.
svargFCI	Continuous, Mixed, Categorical (Time Series)	An extension of sVAR-FCI that incorporates additional scoring functions like SEM BIC Score, BDeu, Discrete BIC, CG BIC, and DG BIC for enhanced causal discovery in time-series data.
PCMCI	Continuous, Mixed, Categorical (Time Series)	A time-series causal discovery algorithm that does not assume causal sufficiency. Utilizes conditional mutual information tests and various correlation-based methods like ParCor, RobustParCor, GPDC, CMIknn, ParCorrWLS, Gsquared, CMIsymb, and RegressionCI.
PCMCI+	Continuous, Mixed, Categorical (Time Series)	An enhanced version of PCMCI with improved handling of time lags and dependencies. Utilizes the same set of conditional mutual information tests and correlation-based methods as PCMCI.
LPCMCI	Continuous, Mixed, Categorical (Time Series)	A latent-variable variant of PCMCI that accounts for unobserved confounders. Utilizes conditional mutual information tests and correlation-based methods similar to PCMCI.
SAM	Continuous, Mixed	A neural network-based causal discovery algorithm that does not assume causal sufficiency. Includes parameters like learning rate, regularization, hidden neurons, training/testing epochs, batch size, and loss type.
NOTEARS	Continuous, Mixed, Categorical	An optimization-based algorithm that learns causal structures using least squares and L1-regularization. Assumes causal sufficiency. Supports continuous, mixed, and categorical data.

Available Independence Tests

The CL module supports a range of conditional independence tests, enabling flexibility in testing relationships between variables across different data types:

Test Name	Data Type	Description
FisherZ	Continuous	A widely used test for continuous data.
CG-LRT	Mixed	Conditional Gaussian Likelihood Ratio Test for mixed data (continuous and categorical).
DG-LRT	Mixed	Discrete Gaussian Likelihood Ratio Test for mixed data (discrete and Gaussian).
Chi-Square	Categorical	Test for independence in categorical data.
G-Square	Categorical	Another test for independence in categorical data, based on the G-statistic.
ParCor	Continuous	Test based on partial correlation.
RobustParCor	Continuous	A robust version of the partial correlation test, less sensitive to outliers.
GPDC	Continuous	Gaussian Process-based Dependency Criterion.
CMIknn	Continuous	Conditional Mutual Information test using nearest neighbors.
ParCorrWLS	Continuous	Partial Correlation with Weighted Least Squares.
Gsquared	Mixed	G-squared test adapted for mixed data types.
CMIsymb	Mixed	Symmetric Conditional Mutual Information test.
RegressionCI	Mixed	Regression-based Conditional Independence test.

Available Scoring Functions

To evaluate the causal graphs, the CL module includes several scoring functions, allowing flexibility in selecting the most appropriate metric for the data:

Score Name	Data Type	Description
SEM BIC Score	Continuous	Bayesian Information Criterion for Structural Equation Models. Suitable for continuous data.
BDeu	Categorical	Bayesian Dirichlet equivalent uniform score for categorical data.
Discrete BIC	Categorical	Bayesian Information Criterion for discrete data models.
CG-BIC	Mixed	BIC score for mixed data models (continuous and categorical).
DG-BIC	Mixed	BIC score for discrete Gaussian models.
GFCI Score	Mixed	Scoring function used by the GFCI algorithm to evaluate causal structures.
svargFCI Score	Mixed	Enhanced scoring function for svargFCI with additional metrics like SEM BIC Score, BDeu, Discrete BIC, CG BIC, and DG BIC.

Algorithm Parameters

Each algorithm may have specific parameters that can be tuned to optimize performance based on the dataset and requirements. Below are the parameters for each available algorithm:

PC Algorithm Parameters

**PC Algorithm Parameters**
Parameter	Type	Description
ci_test	string	Type of conditional independence test to use. Options: FisherZ, cg_lrt, dg_lrt, chisquare, gsquare.
stable	boolean	Whether to use the stable version of the PC algorithm.

CPC Parameters

**CPC Parameters**
Parameter	Type	Description
ci_test	string	Type of conditional independence test to use. Options: FisherZ, cg_lrt, dg_lrt, chisquare, gsquare.
stable	boolean	Whether to use the stable version of the CPC algorithm.

FGES Parameters

**FGES Parameters**
Parameter	Type	Description
score	string	Scoring function to use. Options: sem_bic_score, bdeu, discrete_bic, cg_bic, dg_bic.

FCI Parameters

**FCI Parameters**
Parameter	Type	Description
ci_test	string	Type of conditional independence test to use. Options: FisherZ, cg_lrt, dg_lrt, chisquare, gsquare.
stable	boolean	Whether to use a stable version of the FCI algorithm.

FCI-Max Parameters

**FCI-Max Parameters**
Parameter	Type	Description
ci_test	string	Type of conditional independence test to use. Options: FisherZ, cg_lrt, dg_lrt, chisquare, gsquare.
stable	boolean	Whether to use a stable version of the FCI-Max algorithm.

RFCI Parameters

**RFCI Parameters**
Parameter	Type	Description
ci_test	string	Type of conditional independence test to use. Options: FisherZ, cg_lrt, dg_lrt, chisquare, gsquare.
stable	boolean	Whether to use a stable version of the RFCI algorithm.

GFCI Parameters

**GFCI Parameters**
Parameter	Type	Description
ci_test	string	Type of conditional independence test to use. Options: FisherZ, cg_lrt, dg_lrt, chisquare, gsquare.
stable	boolean	Whether to use a stable version of the GFCI algorithm.
score	string	Additional scoring functions (optional): sem_bic_score, bdeu, discrete_bic, cg_bic, dg_bic.

CFCI Parameters

**CFCI Parameters**
Parameter	Type	Description
ci_test	string	Type of conditional independence test to use. Options: FisherZ, cg_lrt, dg_lrt, chisquare, gsquare.
stable	boolean	Whether to use a stable version of the CFCI algorithm.

sVAR-FCI Parameters

**sVAR-FCI Parameters**
Parameter	Type	Description
ci_test	string	Type of conditional independence test to use. Options: FisherZ, cg_lrt, dg_lrt, chisquare, gsquare.
stable	boolean	Whether to use a stable version of the sVAR-FCI algorithm.
time_series	boolean	Indicates if the data is a time series.

svargFCI Parameters

**svargFCI Parameters**
Parameter	Type	Description
ci_test	string	Type of conditional independence test to use. Options: FisherZ, cg_lrt, dg_lrt, chisquare, gsquare.
stable	boolean	Whether to use a stable version of the svargFCI algorithm.
score	string	Additional scoring functions: sem_bic_score, bdeu, discrete_bic, cg_bic, dg_bic.
time_series	boolean	Indicates if the data is a time series.

PCMCI Parameters

**PCMCI Parameters**
Parameter	Type	Description
ci_test	string	Type of conditional independence test to use. Options: ParCor, RobustParCor, GPDC, CMIknn, ParCorrWLS, Gsquared, CMIsymb, RegressionCI.

PCMCI+ Parameters

**PCMCI+ Parameters**
Parameter	Type	Description
ci_test	string	Type of conditional independence test to use. Options: ParCor, RobustParCor, GPDC, CMIknn, ParCorrWLS, Gsquared, CMIsymb, RegressionCI.

LPCMCI Parameters

**LPCMCI Parameters**
Parameter	Type	Description
ci_test	string	Type of conditional independence test to use. Options: ParCor, RobustParCor, GPDC, CMIknn, ParCorrWLS, Gsquared, CMIsymb, RegressionCI.

SAM Parameters

**SAM Parameters**
Parameter	Type	Description
lr	float	Learning rate. Options: 0.001, 0.01, 0.1.
dlr	float	Decay learning rate. Options: 0.0001, 0.001, 0.01.
lambda1	float	Regularization parameter 1. Options: 1, 10, 100.
lambda2	float	Regularization parameter 2. Options: 0.0001, 0.001, 0.01.
nh	int	Number of hidden neurons. Options: 10, 20, 50.
dnh	int	Decay hidden neurons. Options: 100, 200, 300.
train_epochs	int	Number of training epochs. Options: 1000, 3000, 5000.
test_epochs	int	Number of testing epochs. Options: 500, 1000, 1500.
batch_size	int	Batch size. Options: 50, 100, 200.
losstype	string	Type of loss function. Options: fgan, gan, mse.

NOTEARS Parameters

**NOTEARS Parameters**
Parameter	Type	Description
max_iter	int	Maximum number of iterations. Options: 100, 500, 1000.
h_tol	float	Tolerance for convergence. Options: 1e-7, 1e-5, 1e-3.
threshold	float	Threshold for edge inclusion. Options: 0.0, 0.5, 0.8.

Key Details

Latent Variables Supported: - ✓: Supports latent (unobserved) variables. - ✕: Does not support latent variables (causal sufficiency assumed).
Tests/Scores Used: - Conditional Independence Tests (`ci_test`): Methods like FisherZ, CG-LRT, DG-LRT, Chi-Square, G-Square, ParCor, RobustParCor, GPDC, CMIknn, ParCorrWLS, Gsquared, CMIsymb, RegressionCI. - Scores (`score`): Metrics like SEM BIC Score, BDeu, Discrete BIC, CG-BIC, DG-BIC, GFCI Score, svargFCI Score. - Additional Parameters: Algorithms like SAM and NOTEARS have specific parameters relevant to their optimization and learning processes.
Data Type: - Continuous: Numeric data without discrete categories. - Mixed: Combination of continuous and categorical data. - Categorical: Data with discrete categories. - Time Series: Data that includes temporal dependencies.

Notes

Assumptions: - Causal Sufficiency: If set to False, the algorithm accounts for potential latent variables. - Assume Faithfulness: Indicates whether the algorithm assumes the faithfulness condition holds, impacting its ability to recover the true causal graph.

### CL Output The output of the CL module includes:

A causal graph representing the learned structure between variables.
The best-performing causal discovery configuration, including the selected algorithm, independence test, and scoring function.

By providing an optimized causal discovery pipeline, the CL module ensures that the causal relationships discovered are both accurate and interpretable, facilitating further analysis and reasoning.

Main Class

The main entry point for using the CL module is the CausalLearner class. This class allows users to configure and run the causal discovery process, selecting from a variety of algorithms, tests, and scoring functions. The causal graphs generated can then be passed on for downstream reasoning or visualization tasks.

Bases: object

CausalLearner class for automated causal discovery.

Parameters:

dataset_input (str or Dataset) – Either a file path to the dataset or a Dataset instance containing the data.
configurations (Configurations, optional) – A Configurations object containing experiment configurations. If None, default configurations are used.
verbose (bool, optional) – If True, prints detailed logs. Default is False.
n_jobs (int, optional) – Number of jobs for parallel processing. Default is the number of CPU cores.
random_seed (int, optional) – Seed for random number generator to ensure reproducibility. Default is None.

learn_model()[source]: Runs the causal discovery process.

print_results(opt_conf=None)[source]: Prints the results of the causal discovery process.

set_dataset(dataset)[source]: Sets the dataset for the causal learner.

set_configurations(configurations)[source]: Sets the configurations for the causal learner.

save_progress(path=None)[source]: Saves the progress of the experiment to a file.

load_progress(path)[source]: Loads the progress of the experiment from a file.

add_configurations_from_file(filename)[source]: Adds additional configurations to the experiment from a JSON file.

update_learnt_model()[source]: Updates the learnt model with new configurations.

get_best_model_between_algorithms(algorithms)[source]: Gets the best model between specified algorithms.

get_best_model_between_family(**kwargs)[source]: Gets the best model within a family of algorithms based on specified criteria.

learn_model()[source]

Runs the causal discovery process using the OCT algorithm.

Returns:

opt_conf: The optimal configuration found.
matrix_mec_graph: The MEC graph matrix.
matrix_graph: The graph matrix
run_time: The runtime of the CDHPO process.
library_results: Results from the causal discovery libraries.

Return type:

Tuple containing

print_results(opt_conf=None)[source]

Prints the results of the causal discovery process.

Parameters:: opt_conf (dict, optional) – The optimal configuration to print. If None, uses self.opt_conf.

set_dataset(dataset)[source]

Sets the dataset for the causal learner.

Parameters:: dataset (Dataset) – The Dataset object to set.
Raises:: TypeError – If dataset is not of type Dataset.

set_configurations(configurations)[source]

Sets the configurations for the causal learner.

Parameters:: configurations (Configurations) – The Configurations object to set.
Raises:: TypeError – If configurations is not of type Configurations.

save_progress(path=None)[source]

Saves the progress of the experiment to a file.

Parameters:: path (str, optional) – The file path to save the progress to. If None, saves to ‘Experiment.pkl’ in results_folder.

static load_progress(path)[source]

Loads the progress of the experiment from a file.

Parameters:: path (str) – The file path to load the progress from.
Returns:: The loaded CausalLearner object.
Return type:: CausalLearner

add_configurations_from_file(filename)[source]

Adds additional configurations to the experiment from a JSON file.

Parameters:: filename (str) – The filename of the JSON file containing configurations.

update_learnt_model()[source]: Updates the learnt model with the new configurations.

get_best_model_between_algorithms(algorithms)[source]

Gets the best model between specified algorithms.

Parameters:: algorithms (list) – A list of algorithm names to consider.
Returns:: The best configuration among the specified algorithms.
Return type:: dict

get_best_model_between_family(causal_sufficiency=None, assume_faithfulness=None, is_output_mec=None, accepts_missing_values=None)[source]

Gets the best model within a family of algorithms based on specified criteria.

Parameters:

causal_sufficiency (bool, optional) – Filter algorithms that admit latent variables.
assume_faithfulness (bool, optional) – Filter algorithms based on faithfulness assumption.
is_output_mec (bool, optional) – Filter algorithms that output MEC graphs.
accepts_missing_values (bool, optional) – Filter algorithms that accept missing values.

Returns:

The best configuration among the filtered algorithms.

Return type:

dict

Helper Classes

Contents: