Skip to content

ML pipeline

Dataset construction

The training dataset is built on top of a Latin hypercube sample of five physical parameters (log concentration, valence, temperature, Stern thickness, Stern permittivity) via build_capacitance_dataset. Each sample runs a full GCS sweep across 81 electrode potentials in [-0.4, 0.4] V, yielding a tidy long-format DataFrame.

  • Concentration is sampled on log10(c) because C_d ∝ √c.
  • Valence is drawn from a discrete set {1, 2} rather than a continuous range, matching the physics.
  • Splits are made by sweep, not by row, so no feature vector appears in both the train and test splits (split_by_sample).

Surrogate architecture

The surrogate is an MLP with

  • input dim 6 (5 physical features + electrode potential),
  • (128, 128, 64) hidden widths by default,
  • SiLU activations,
  • optional BatchNorm and dropout,
  • Kaiming initialisation, linear scalar head.

See MLPConfig and CapacitanceMLP.

Training loop

train_model runs AdamW with cosine annealing, gradient norm clipping, and early stopping. Scalers are fitted on the training split only, preventing data leakage. An optional MLflow run logs every hyperparameter and per-epoch metric.

from edl_ml.data import SamplingBounds, build_capacitance_dataset, split_by_sample
from edl_ml.ml import MLPConfig, TrainConfig, build_loaders, train_model

df = build_capacitance_dataset(SamplingBounds(), n_samples=1000)
train, val, test = split_by_sample(df)
loaders = build_loaders(train, val, test, batch_size=256)
report = train_model(
    loaders,
    MLPConfig(input_dim=6, hidden_dims=(128, 128, 64), dropout=0.05),
    TrainConfig(max_epochs=200, learning_rate=1e-3,
                mlflow_experiment="edl-ml"),
    checkpoint_path="data/models/model.pt",
)
print(report.test_metrics)   # {'mse': ..., 'rmse': ..., 'mae': ..., 'r2': ..., 'mape': ...}

Hyperparameter optimisation

run_optuna_study wraps an Optuna TPE study over width, depth, activation, BatchNorm, dropout, learning rate, and weight decay. Pass a persistent storage URL (sqlite:///edl.db) to resume.

Interpretability

Two complementary methods ship with the package: