Data API¶
edl_ml.data.features ¶
Feature definitions and sampling utilities for the EDL dataset.
The ML surrogate is trained to reproduce the full differential capacitance
curve C_dl(E) produced by the Gouy-Chapman-Stern solver. Each training
sample is parameterised by five physical variables drawn from a Latin
hypercube inside SamplingBounds. Concentration is sampled on a
log-uniform grid because the diffuse-layer capacitance depends on
:math:\sqrt{c} through the Debye length, giving poor coverage under a
uniform scale.
FEATURE_COLUMNS
module-attribute
¶
FEATURE_COLUMNS: Final[tuple[str, ...]] = ('log10_concentration_mol_l', 'valence', 'temperature_k', 'stern_thickness_ang', 'stern_permittivity')
Ordered feature names stored in every dataset row.
TARGET_COLUMN
module-attribute
¶
Target variable name (differential capacitance).
SamplingBounds
dataclass
¶
Inclusive sampling bounds for the five input features.
Defaults bracket physically reasonable aqueous electrochemistry: 1 mM–1 M symmetric electrolyte, z=1 or 2, 283–343 K, Stern thickness 2.5–6 Å, Stern permittivity 5–15.
Source code in src/edl_ml/data/features.py
latin_hypercube_samples ¶
latin_hypercube_samples(bounds: SamplingBounds, n_samples: int, seed: int | None = 0) -> NDArray[np.float64]
Generate a Latin hypercube sample of feature vectors.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
bounds
|
SamplingBounds
|
Sampling bounds object. |
required |
n_samples
|
int
|
Number of samples to draw. |
required |
seed
|
int | None
|
Random seed for reproducibility. |
0
|
Returns:
| Type | Description |
|---|---|
ndarray of shape ``(n_samples, 5)``
|
Columns in the order given by :data: |
Source code in src/edl_ml/data/features.py
edl_ml.data.generate ¶
High-throughput dataset generation driven by the Gouy-Chapman-Stern solver.
SweepResult
dataclass
¶
Outputs of a single Gouy-Chapman-Stern sweep over electrode potential.
Attributes:
| Name | Type | Description |
|---|---|---|
features |
NDArray[float64]
|
Length-5 feature vector, matching :data: |
potentials_v |
NDArray[float64]
|
Electrode potentials, V. |
capacitance_f_m2 |
NDArray[float64]
|
Total differential capacitance at each potential, F/m². |
surface_charge_c_m2 |
NDArray[float64]
|
Diffuse-layer surface charge, C/m². |
Source code in src/edl_ml/data/generate.py
build_capacitance_dataset ¶
build_capacitance_dataset(bounds: SamplingBounds, n_samples: int, *, seed: int | None = 0, parallel: bool = True, max_workers: int | None = None) -> pd.DataFrame
Build a tidy dataset of capacitance values for ML training.
The returned DataFrame is long-format: every row represents one
(features, electrode_potential) pair with the corresponding total
differential capacitance. This layout is convenient for scikit-learn and
torch dataset consumers.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
bounds
|
SamplingBounds
|
Sampling bounds object. |
required |
n_samples
|
int
|
Number of Latin hypercube samples. |
required |
seed
|
int | None
|
Random seed. |
0
|
parallel
|
bool
|
Whether to run sweeps in a process pool. |
True
|
max_workers
|
int | None
|
Process pool size. |
None
|
Returns:
| Type | Description |
|---|---|
DataFrame with columns:
|
|
Source code in src/edl_ml/data/generate.py
run_single_sweep ¶
Run the GCS solver for one feature vector over an electrode potential grid.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
feature_vector
|
NDArray[float64]
|
Five-element array matching :data: |
required |
potentials_v
|
NDArray[float64]
|
Electrode potentials to sweep, V. |
required |
Returns:
| Type | Description |
|---|---|
SweepResult
|
|
Source code in src/edl_ml/data/generate.py
save_dataset ¶
Save a dataset to a parquet file, creating parent directories as needed.
load_dataset ¶
split_by_sample ¶
split_by_sample(df: DataFrame, val_fraction: float = 0.15, test_fraction: float = 0.15, seed: int | None = 0) -> tuple[pd.DataFrame, pd.DataFrame, pd.DataFrame]
Split the dataset so that every sweep is entirely in one split.
Splitting at the sweep level (rather than the row level) prevents information leakage between train and test capacitance curves that share the same physical parameters.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df
|
DataFrame
|
Output of :func: |
required |
val_fraction
|
float
|
Fractions in (0, 1). Their sum must be strictly below 1. |
0.15
|
test_fraction
|
float
|
Fractions in (0, 1). Their sum must be strictly below 1. |
0.15
|
seed
|
int | None
|
RNG seed. |
0
|
Returns:
| Type | Description |
|---|---|
tuple
|
|
Source code in src/edl_ml/data/generate.py
summarise_dataset ¶
Return simple summary statistics for logging.
Returns:
| Type | Description |
|---|---|
dict
|
Keys: |