Automated Feature Selection (AFS)

The Automated Feature Selection (AFS) module plays a critical role in automating the selection of relevant features from large, high-dimensional datasets. Its primary purpose is to identify the Markov Boundary (Mb) of a target variable. This significantly reduces the complexity of subsequent causal modeling or prediction tasks by focusing only on the variables most relevant to the target. By automatically selecting features and configuring prediction models, AFS streamlines the data analysis process and supports researchers in efficiently building robust, interpretable models.

AFS operates within the broader ETIA framework, designed for automated causal discovery and reasoning. It serves as the first step in a pipeline where dimensionality reduction is essential to improve the efficiency of downstream causal learning and predictive modeling tasks. The ability to handle various data types—continuous, categorical, and mixed—allows AFS to be adaptable to numerous problem domains. Its flexible architecture and seamless integration with different algorithms enable it to cater to both non-experts and experienced researchers.

Core Objectives

The core objectives of AFS include:

  • Identifying the Markov boundary of the target variable(s).

  • Selecting and configuring predictive models to assess feature relevance.

  • Optimizing predictive performance while ensuring minimal feature selection.

  • Handling large datasets efficiently, leveraging parallel processing.

How AFS Works

AFS employs a two-stage process:

  1. Predictive Configuration Generator (CG): This module generates multiple configurations of feature selection and predictive algorithms. It uses a predefined search space of hyperparameters tailored to each dataset and target. Feature selection algorithms like FBED and SES are configured and applied to identify features that are statistically equivalent or most relevant.

  2. Predictive Configuration Evaluator (CE): The CE assesses the performance of the generated configurations using cross-validation (5-fold by default). It measures the predictive performance based on metrics like the Area Under the Receiver Operating Characteristic (AUROC) for classification tasks or the coefficient of determination (R²) for regression tasks. The best-performing configuration is selected and applied to all data, returning the final set of selected features along with the optimal predictive model.

AFS Output

The output of AFS includes:

  • A set of selected features, which are the Markov boundaries of the target(s).

  • The best-performing predictive model.

  • An evaluation of the model’s predictive performance.

  • The reduced dataset

AFS ensures that the selected features are not only statistically relevant but also optimized for prediction, improving both the efficiency and accuracy of subsequent analysis.

Available Algorithms

The AFS module includes several feature selection and prediction algorithms. Below is a table summarizing the available algorithms and their hyperparameters:

Feature Selection Algorithms

Algorithm

Hyperparameters

Default Values

FBED

alpha k ind_test_name

[0.05, 0.01] [3, 5] [‘testIndFisher’]

SES

alpha k ind_test_name

[0.05, 0.01] [3, 5] [‘testIndFisher’]

Predictive Algorithms

Algorithm

Hyperparameters

Default Values

Random Forest

n_estimators min_samples_leaf max_features

[50, 100] [0.1] [‘sqrt’]

Linear Regression

None

Main Class

The main entry point for using the AFS module is through the AFS class. This class provides methods to configure, execute feature selection, and manage results. It integrates preprocessing steps, feature selection, and predictive modeling in a seamless workflow.

class AFS(depth: int = 1, verbose: bool = False, num_processors: int | None = None, oos_protocol: Dict[str, Any] | None = None, random_seed: int | None = None)[source]

Bases: object

Automated Feature Selection (AFS) class.

Parameters:
  • depth (int, optional) – The depth of the feature selection process. Default is 1.

  • verbose (bool, optional) – If True, prints detailed logs. Default is False.

  • num_processors (int, optional) – Number of processors to use for parallel processing. Default is the number of CPU cores.

  • oos_protocol (dict, optional) – A dictionary specifying the out-of-sample protocol. Default is a 5-fold cross-validation.

  • random_seed (int, optional) – Seed for random number generator to ensure reproducibility. Default is None.

run_AFS(data, target_features, pred_configs=None, dataset_name='dataset')[source]

Runs the AFS process on the provided data and target features.

run_AFS(data: str | DataFrame | ndarray, target_features: List[str] | Dict[str, str], pred_configs: List[Dict[str, Any]] | float | None = None, dataset_name: str = 'dataset') Dict[str, Any][source]

Runs the AFS process on the provided data and target features.

Parameters:
  • data (str or pd.DataFrame or np.ndarray) – The dataset to use. Can be a filename (str), a pandas DataFrame, or a NumPy array.

  • target_features (Union[Dict[str, str], List[str]]) – A dictionary mapping feature names to their types, or a list of feature names (in which case the types are inferred).

  • pred_configs (Union[List[Dict[str, Any]], float], optional) –

    • If list, it is a list of predictive configurations provided by the user.

    • If float (between 0 and 1), it indicates the percentage of default configurations to sample and run.

    • If None, all default configurations are used.

  • dataset_name (str, optional) – The name of the dataset (used for saving intermediate files). Default is ‘dataset’.

Returns:

A dictionary containing: - ‘original_data’: The original dataset - ‘reduced_data’: The dataset with only the selected features and target features - ‘best_config’: The configuration that led to the best feature selection - ‘selected_features’: The selected features for each target

Return type:

dict

Examples

To run feature selection on a dataset: >>> afs = AFS() >>> result = afs.run_AFS(data=”data.csv”, target_features=[“feature1”, “feature2”]) >>> print(result[“selected_features”])

recursive_fs_for_target(data: DataFrame, target_feature: str, target_type: str, pred_configs: List[Dict[str, Any]], dataset_name: str, depth: int, visited_features: set | None = None) Dict[str, Any][source]

Recursively runs feature selection for a specific target feature up to the specified depth.

run_fs_for_config(data: DataFrame, target_feature: str, target_type: str, config: Dict[str, Any], dataset_name: str, train_inds: List[ndarray], test_inds: List[ndarray], feature_columns: List[str]) Tuple[List[float], List[Tuple[ndarray, ndarray, Dict[str, Any], Any, Preprocessor | None]], DataFrame][source]

Runs the feature selection process for a specific configuration.

bootstrap_bias_correction(fold_predictions: List[Tuple[ndarray, ndarray]], target_type: str, B: int = 1000, conf_interval: float = 0.95) float[source]

Applies bootstrap bias correction to the fold predictions.

Helper Classes

Below is a list of available classes in the AFS module:

### Helper Classes

Each class is responsible for different aspects of the feature selection and prediction pipeline, ensuring flexibility and modularity in the system.