Causal Learning (CL)
The Causal Learning (CL) module is a core component of the ETIA framework, designed to automate the discovery of causal relationships in complex, high-dimensional datasets. It is responsible for learning a causal graph from the features selected by the Automated Feature Selection (AFS) module. This causal graph captures the directed dependencies between variables, facilitating further tasks such as causal reasoning and prediction.
The CL module optimizes the entire causal discovery pipeline by exploring a configuration space of algorithms and hyperparameters. It searches for the best-fitting causal model based on the available data, ensuring that the discovered relationships are accurate and interpretable. By supporting a wide variety of causal discovery algorithms, independence tests, and scoring functions, CL can be adapted to different data types (continuous, mixed, categorical) and assumptions about the underlying system (e.g., causal sufficiency, latent confounders).
Core Objectives
The main goals of the CL module include:
Learning an accurate causal graph from the selected features.
Optimizing the causal discovery process by searching over various algorithms and configurations.
Supporting different data types, including continuous, categorical, and mixed variables.
Handling both causally sufficient and insufficient systems (i.e., with or without latent confounders).
Allowing flexible integration with downstream reasoning and visualization tasks.
How CL Works
The CL module operates in three stages:
Causal Configuration Generator (CG): The generator explores the configuration space of causal discovery algorithms, independence tests, and scoring functions. It selects appropriate configurations based on the characteristics of the input data, including the type (continuous, mixed, or categorical) and any assumptions regarding causal sufficiency.
Causal Discovery: Once the best configuration is selected, the CL module applies the causal discovery algorithm to the data. The output is a causal graph that captures the directed dependencies between variables. This graph can be further analyzed to identify key causal relationships, intervention points, or adjustment sets.
Causal Evaluation: The discovered causal graphs are evaluated using scoring functions to assess their fit to the data. The evaluation considers the accuracy of the learned structure in representing the true causal relationships.
Available Algorithms
The CL module offers a variety of causal discovery algorithms, each suited for different data types and assumptions. These algorithms are listed below:
Algorithm |
Data Type |
Description |
|---|---|---|
PC |
Continuous, Mixed, Categorical |
A constraint-based algorithm that uses conditional independence tests to learn the causal structure. Assumes causal sufficiency. Supports continuous, mixed, and categorical data. |
CPC |
Continuous, Mixed, Categorical |
A variant of the PC algorithm that improves stability by handling non-faithful distributions. Assumes causal sufficiency. Supports continuous, mixed, and categorical data. |
FGES |
Continuous, Mixed, Categorical |
A score-based algorithm that does not assume causal sufficiency. Suitable for high-dimensional data. Utilizes various scoring functions like SEM BIC Score, BDeu, Discrete BIC, CG BIC, and DG BIC. |
FCI |
Continuous, Mixed, Categorical |
A constraint-based algorithm that accounts for latent confounders. Uses conditional independence tests to infer the causal structure. Supports continuous, mixed, and categorical data. |
FCI-Max |
Continuous, Mixed, Categorical |
An extension of the FCI algorithm that maximizes certain criteria for improved causal discovery. Assumes the presence of latent variables. Supports continuous, mixed, and categorical data. |
RFCI |
Continuous, Mixed, Categorical |
A relaxed version of the FCI algorithm that offers faster performance with slightly relaxed constraints. Assumes the presence of latent variables. Supports continuous, mixed, and categorical data. |
GFCI |
Continuous, Mixed, Categorical |
A hybrid algorithm combining constraint-based and score-based methods. Allows for latent confounders and utilizes various conditional independence tests. Supports continuous, mixed, and categorical data. |
CFCI |
Continuous, Mixed, Categorical |
Combines features of the FCI and RFCI algorithms to enhance causal discovery in the presence of latent variables. Supports continuous, mixed, and categorical data. |
sVAR-FCI |
Continuous, Mixed, Categorical (Time Series) |
A time-series variant of the FCI algorithm that accounts for temporal dependencies. Supports time series data along with continuous, mixed, and categorical data. |
svargFCI |
Continuous, Mixed, Categorical (Time Series) |
An extension of sVAR-FCI that incorporates additional scoring functions like SEM BIC Score, BDeu, Discrete BIC, CG BIC, and DG BIC for enhanced causal discovery in time-series data. |
PCMCI |
Continuous, Mixed, Categorical (Time Series) |
|
PCMCI+ |
Continuous, Mixed, Categorical (Time Series) |
An enhanced version of PCMCI with improved handling of time lags and dependencies. Utilizes the same set of conditional mutual information tests and correlation-based methods as PCMCI. |
LPCMCI |
Continuous, Mixed, Categorical (Time Series) |
A latent-variable variant of PCMCI that accounts for unobserved confounders. Utilizes conditional mutual information tests and correlation-based methods similar to PCMCI. |
SAM |
Continuous, Mixed |
A neural network-based causal discovery algorithm that does not assume causal sufficiency. Includes parameters like learning rate, regularization, hidden neurons, training/testing epochs, batch size, and loss type. |
NOTEARS |
Continuous, Mixed, Categorical |
An optimization-based algorithm that learns causal structures using least squares and L1-regularization. Assumes causal sufficiency. Supports continuous, mixed, and categorical data. |
Available Independence Tests
The CL module supports a range of conditional independence tests, enabling flexibility in testing relationships between variables across different data types:
Test Name |
Data Type |
Description |
|---|---|---|
FisherZ |
Continuous |
A widely used test for continuous data. |
CG-LRT |
Mixed |
Conditional Gaussian Likelihood Ratio Test for mixed data (continuous and categorical). |
DG-LRT |
Mixed |
Discrete Gaussian Likelihood Ratio Test for mixed data (discrete and Gaussian). |
Chi-Square |
Categorical |
Test for independence in categorical data. |
G-Square |
Categorical |
Another test for independence in categorical data, based on the G-statistic. |
ParCor |
Continuous |
Test based on partial correlation. |
RobustParCor |
Continuous |
A robust version of the partial correlation test, less sensitive to outliers. |
GPDC |
Continuous |
Gaussian Process-based Dependency Criterion. |
CMIknn |
Continuous |
Conditional Mutual Information test using nearest neighbors. |
ParCorrWLS |
Continuous |
Partial Correlation with Weighted Least Squares. |
Gsquared |
Mixed |
G-squared test adapted for mixed data types. |
CMIsymb |
Mixed |
Symmetric Conditional Mutual Information test. |
RegressionCI |
Mixed |
Regression-based Conditional Independence test. |
Available Scoring Functions
To evaluate the causal graphs, the CL module includes several scoring functions, allowing flexibility in selecting the most appropriate metric for the data:
Score Name |
Data Type |
Description |
|---|---|---|
SEM BIC Score |
Continuous |
Bayesian Information Criterion for Structural Equation Models. Suitable for continuous data. |
BDeu |
Categorical |
Bayesian Dirichlet equivalent uniform score for categorical data. |
Discrete BIC |
Categorical |
Bayesian Information Criterion for discrete data models. |
CG-BIC |
Mixed |
BIC score for mixed data models (continuous and categorical). |
DG-BIC |
Mixed |
BIC score for discrete Gaussian models. |
GFCI Score |
Mixed |
Scoring function used by the GFCI algorithm to evaluate causal structures. |
svargFCI Score |
Mixed |
Enhanced scoring function for svargFCI with additional metrics like SEM BIC Score, BDeu, Discrete BIC, CG BIC, and DG BIC. |
Algorithm Parameters
Each algorithm may have specific parameters that can be tuned to optimize performance based on the dataset and requirements. Below are the parameters for each available algorithm:
PC Algorithm Parameters
Parameter |
Type |
Description |
|---|---|---|
ci_test |
string |
Type of conditional independence test to use. Options: FisherZ, cg_lrt, dg_lrt, chisquare, gsquare. |
stable |
boolean |
Whether to use the stable version of the PC algorithm. |
CPC Parameters
Parameter |
Type |
Description |
|---|---|---|
ci_test |
string |
Type of conditional independence test to use. Options: FisherZ, cg_lrt, dg_lrt, chisquare, gsquare. |
stable |
boolean |
Whether to use the stable version of the CPC algorithm. |
FGES Parameters
Parameter |
Type |
Description |
|---|---|---|
score |
string |
Scoring function to use. Options: sem_bic_score, bdeu, discrete_bic, cg_bic, dg_bic. |
FCI Parameters
Parameter |
Type |
Description |
|---|---|---|
ci_test |
string |
Type of conditional independence test to use. Options: FisherZ, cg_lrt, dg_lrt, chisquare, gsquare. |
stable |
boolean |
Whether to use a stable version of the FCI algorithm. |
FCI-Max Parameters
Parameter |
Type |
Description |
|---|---|---|
ci_test |
string |
Type of conditional independence test to use. Options: FisherZ, cg_lrt, dg_lrt, chisquare, gsquare. |
stable |
boolean |
Whether to use a stable version of the FCI-Max algorithm. |
RFCI Parameters
Parameter |
Type |
Description |
|---|---|---|
ci_test |
string |
Type of conditional independence test to use. Options: FisherZ, cg_lrt, dg_lrt, chisquare, gsquare. |
stable |
boolean |
Whether to use a stable version of the RFCI algorithm. |
GFCI Parameters
Parameter |
Type |
Description |
|---|---|---|
ci_test |
string |
Type of conditional independence test to use. Options: FisherZ, cg_lrt, dg_lrt, chisquare, gsquare. |
stable |
boolean |
Whether to use a stable version of the GFCI algorithm. |
score |
string |
Additional scoring functions (optional): sem_bic_score, bdeu, discrete_bic, cg_bic, dg_bic. |
CFCI Parameters
Parameter |
Type |
Description |
|---|---|---|
ci_test |
string |
Type of conditional independence test to use. Options: FisherZ, cg_lrt, dg_lrt, chisquare, gsquare. |
stable |
boolean |
Whether to use a stable version of the CFCI algorithm. |
sVAR-FCI Parameters
Parameter |
Type |
Description |
|---|---|---|
ci_test |
string |
Type of conditional independence test to use. Options: FisherZ, cg_lrt, dg_lrt, chisquare, gsquare. |
stable |
boolean |
Whether to use a stable version of the sVAR-FCI algorithm. |
time_series |
boolean |
Indicates if the data is a time series. |
svargFCI Parameters
Parameter |
Type |
Description |
|---|---|---|
ci_test |
string |
Type of conditional independence test to use. Options: FisherZ, cg_lrt, dg_lrt, chisquare, gsquare. |
stable |
boolean |
Whether to use a stable version of the svargFCI algorithm. |
score |
string |
Additional scoring functions: sem_bic_score, bdeu, discrete_bic, cg_bic, dg_bic. |
time_series |
boolean |
Indicates if the data is a time series. |
PCMCI Parameters
Parameter |
Type |
Description |
|---|---|---|
ci_test |
string |
Type of conditional independence test to use. Options: ParCor, RobustParCor, GPDC, CMIknn, ParCorrWLS, Gsquared, CMIsymb, RegressionCI. |
PCMCI+ Parameters
Parameter |
Type |
Description |
|---|---|---|
ci_test |
string |
Type of conditional independence test to use. Options: ParCor, RobustParCor, GPDC, CMIknn, ParCorrWLS, Gsquared, CMIsymb, RegressionCI. |
LPCMCI Parameters
Parameter |
Type |
Description |
|---|---|---|
ci_test |
string |
Type of conditional independence test to use. Options: ParCor, RobustParCor, GPDC, CMIknn, ParCorrWLS, Gsquared, CMIsymb, RegressionCI. |
SAM Parameters
Parameter |
Type |
Description |
|---|---|---|
lr |
float |
Learning rate. Options: 0.001, 0.01, 0.1. |
dlr |
float |
Decay learning rate. Options: 0.0001, 0.001, 0.01. |
lambda1 |
float |
Regularization parameter 1. Options: 1, 10, 100. |
lambda2 |
float |
Regularization parameter 2. Options: 0.0001, 0.001, 0.01. |
nh |
int |
Number of hidden neurons. Options: 10, 20, 50. |
dnh |
int |
Decay hidden neurons. Options: 100, 200, 300. |
train_epochs |
int |
Number of training epochs. Options: 1000, 3000, 5000. |
test_epochs |
int |
Number of testing epochs. Options: 500, 1000, 1500. |
batch_size |
int |
Batch size. Options: 50, 100, 200. |
losstype |
string |
Type of loss function. Options: fgan, gan, mse. |
NOTEARS Parameters
Parameter |
Type |
Description |
|---|---|---|
max_iter |
int |
Maximum number of iterations. Options: 100, 500, 1000. |
h_tol |
float |
Tolerance for convergence. Options: 1e-7, 1e-5, 1e-3. |
threshold |
float |
Threshold for edge inclusion. Options: 0.0, 0.5, 0.8. |
Key Details
Latent Variables Supported: - ✓: Supports latent (unobserved) variables. - ✕: Does not support latent variables (causal sufficiency assumed).
Tests/Scores Used: - Conditional Independence Tests (`ci_test`): Methods like FisherZ, CG-LRT, DG-LRT, Chi-Square, G-Square, ParCor, RobustParCor, GPDC, CMIknn, ParCorrWLS, Gsquared, CMIsymb, RegressionCI. - Scores (`score`): Metrics like SEM BIC Score, BDeu, Discrete BIC, CG-BIC, DG-BIC, GFCI Score, svargFCI Score. - Additional Parameters: Algorithms like SAM and NOTEARS have specific parameters relevant to their optimization and learning processes.
Data Type: - Continuous: Numeric data without discrete categories. - Mixed: Combination of continuous and categorical data. - Categorical: Data with discrete categories. - Time Series: Data that includes temporal dependencies.
Notes
Assumptions: - Causal Sufficiency: If set to False, the algorithm accounts for potential latent variables. - Assume Faithfulness: Indicates whether the algorithm assumes the faithfulness condition holds, impacting its ability to recover the true causal graph.
### CL Output The output of the CL module includes:
A causal graph representing the learned structure between variables.
The best-performing causal discovery configuration, including the selected algorithm, independence test, and scoring function.
By providing an optimized causal discovery pipeline, the CL module ensures that the causal relationships discovered are both accurate and interpretable, facilitating further analysis and reasoning.
Main Class
The main entry point for using the CL module is the CausalLearner class. This class allows users to configure and run the causal discovery process, selecting from a variety of algorithms, tests, and scoring functions. The causal graphs generated can then be passed on for downstream reasoning or visualization tasks.
- class CausalLearner(dataset_input: str | Dataset | None = None, configurations: Configurations | None = None, verbose: bool = False, n_jobs: int | None = None, random_seed: int | None = None)[source]
Bases:
objectCausalLearner class for automated causal discovery.
- Parameters:
dataset_input (str or Dataset) – Either a file path to the dataset or a Dataset instance containing the data.
configurations (Configurations, optional) – A Configurations object containing experiment configurations. If None, default configurations are used.
verbose (bool, optional) – If True, prints detailed logs. Default is False.
n_jobs (int, optional) – Number of jobs for parallel processing. Default is the number of CPU cores.
random_seed (int, optional) – Seed for random number generator to ensure reproducibility. Default is None.
- add_configurations_from_file(filename)[source]
Adds additional configurations to the experiment from a JSON file.
- get_best_model_between_algorithms(algorithms)[source]
Gets the best model between specified algorithms.
- get_best_model_between_family(**kwargs)[source]
Gets the best model within a family of algorithms based on specified criteria.
- learn_model()[source]
Runs the causal discovery process using the OCT algorithm.
- Returns:
opt_conf: The optimal configuration found.
matrix_mec_graph: The MEC graph matrix.
matrix_graph: The graph matrix
run_time: The runtime of the CDHPO process.
library_results: Results from the causal discovery libraries.
- Return type:
Tuple containing
- print_results(opt_conf=None)[source]
Prints the results of the causal discovery process.
- Parameters:
opt_conf (dict, optional) – The optimal configuration to print. If None, uses self.opt_conf.
- set_dataset(dataset)[source]
Sets the dataset for the causal learner.
- Parameters:
dataset (Dataset) – The Dataset object to set.
- Raises:
TypeError – If dataset is not of type Dataset.
- set_configurations(configurations)[source]
Sets the configurations for the causal learner.
- Parameters:
configurations (Configurations) – The Configurations object to set.
- Raises:
TypeError – If configurations is not of type Configurations.
- save_progress(path=None)[source]
Saves the progress of the experiment to a file.
- Parameters:
path (str, optional) – The file path to save the progress to. If None, saves to ‘Experiment.pkl’ in results_folder.
- static load_progress(path)[source]
Loads the progress of the experiment from a file.
- Parameters:
path (str) – The file path to load the progress from.
- Returns:
The loaded CausalLearner object.
- Return type:
- add_configurations_from_file(filename)[source]
Adds additional configurations to the experiment from a JSON file.
- Parameters:
filename (str) – The filename of the JSON file containing configurations.
- get_best_model_between_algorithms(algorithms)[source]
Gets the best model between specified algorithms.
- Parameters:
algorithms (list) – A list of algorithm names to consider.
- Returns:
The best configuration among the specified algorithms.
- Return type:
dict
- get_best_model_between_family(causal_sufficiency=None, assume_faithfulness=None, is_output_mec=None, accepts_missing_values=None)[source]
Gets the best model within a family of algorithms based on specified criteria.
- Parameters:
causal_sufficiency (bool, optional) – Filter algorithms that admit latent variables.
assume_faithfulness (bool, optional) – Filter algorithms based on faithfulness assumption.
is_output_mec (bool, optional) – Filter algorithms that output MEC graphs.
accepts_missing_values (bool, optional) – Filter algorithms that accept missing values.
- Returns:
The best configuration among the filtered algorithms.
- Return type:
dict
Helper Classes
Contents:
- ETIA.CausalLearning.algorithms.causalnex_algorithm package
- ETIA.CausalLearning.algorithms.cdt_algorithms package
- ETIA.CausalLearning.algorithms package
- ETIA.CausalLearning.algorithms.tetrad_algorithm package
- ETIA.CausalLearning.algorithms.tigramite_algorithm package
- ETIA.CausalLearning.CDHPO.OCT package
- ETIA.CausalLearning.CDHPO package
- ETIA.CausalLearning.configurations package
- Submodules
- ETIA.CausalLearning.configurations.causal_configs module
- ETIA.CausalLearning.configurations.ci_tests module
- ETIA.CausalLearning.configurations.class_causal_configurator module
- ETIA.CausalLearning.configurations.configurations module
- ETIA.CausalLearning.configurations.default_parameters module
- ETIA.CausalLearning.configurations.parameters module
- ETIA.CausalLearning.configurations.scores module
- ETIA.CausalLearning.model_validation_protocols.kfold package
- ETIA.CausalLearning.model_validation_protocols package
- ETIA.CausalLearning.regressors package
- ETIA.CausalLearning.utils.oct_functions package
- Submodules
- ETIA.CausalLearning.utils.oct_functions.function_run_tetrads_algorithms module
- ETIA.CausalLearning.utils.oct_functions.function_tetrad_algs module
- ETIA.CausalLearning.utils.oct_functions.functions_data_graph_utils module
- ETIA.CausalLearning.utils.oct_functions.functions_prepData module
- ETIA.CausalLearning.utils.oct_functions.functions_tetrad module
- ETIA.CausalLearning.utils.oct_functions.functions_tetrad_utils module
- ETIA.CausalLearning.utils package