Research

Symbolic Regression in Scientific Machine Learning: From Data Noise to Governing Equations

Published: April 2026 | Author: ActarusLab Research | Topic: Symbolic Regression · SciML

Introduction

In recent years, machine learning has achieved strong predictive performance across a broad range of scientific domains. Yet a fundamental limitation has become progressively apparent: the inability of most modern architectures to produce representations that are simultaneously accurate and interpretable. A model that yields low test error but cannot be analytically examined offers little scientific value when the goal is understanding, not merely prediction.

At ActarusLab, our research centres on this epistemological gap. We investigate whether it is possible, starting from noisy and high-dimensional datasets, to recover compact, interpretable mathematical structures that faithfully describe the governing dynamics of the underlying system.

The interpretability deficit in black-box models

Deep neural networks and ensemble methods have demonstrated remarkable capacity to approximate complex functions. However, the functional form learned by such architectures is encoded implicitly in millions of parameters that admit no direct physical interpretation. In practice this means:

Predictions can be accurate without revealing the mechanism that produces them.
Generalisation beyond the training distribution cannot be formally assessed.
Regulatory validation — in pharmaceutics, finance, or energy — requires explicit justification that black-box models cannot easily provide.

This is not a secondary concern. In domains where interpretation constitutes part of scientific validation — quantitative finance, computational chemistry, fluid dynamics — the inability to explain a model is equivalent to an incomplete proof.

Symbolic regression as a principled alternative

Symbolic regression (SR) differs fundamentally from parametric regression. Rather than fitting the coefficients of a predefined functional form, SR searches the space of mathematical expressions directly, treating both the structure and the parameters of the model as free variables. The objective is to identify an expression of the form:

y = f(x₁, x₂, …, xₙ)

where f is constructed from primitive operations (addition, multiplication, exponentiation, trigonometric functions, etc.) and where the search is guided by a Pareto-optimal trade-off between predictive accuracy and symbolic complexity. This is sometimes referred to as the Occam's Razor principle instantiated algorithmically.

The output is not a learned weight vector, but an explicit, human-readable mathematical expression that can be inspected, differentiated, integrated, and falsified.

High-fidelity simulation as a controlled data source

In many scientific applications, empirical data are scarce, noisy, or confounded by measurement artefacts. A robust SR pipeline therefore typically begins with the construction of a high-fidelity simulation environment that maps the relevant state space of the system under study. This serves two purposes:

It allows systematic exploration of parameter regimes that may be experimentally inaccessible.
It provides a ground-truth reference against which recovered symbolic expressions can be validated rigorously.

At ActarusLab we employ simulation frameworks such as QuTiP for quantum dynamics, Monte Carlo samplers for stochastic systems, and custom CFD integrators for continuum problems, generating datasets that can span hundreds of thousands of distinct system states.

The discovery pipeline: three operational stages

1. Data generation and preprocessing — Raw data from simulations or experimental sources are cleaned, normalised, and stratified. Feature engineering is kept minimal to avoid injecting domain bias into the symbolic search.

2. Symbolic search — SR algorithms (PySR, SINDy) are applied to scan the space of candidate mathematical expressions. The search is parallelised across expression complexity levels and evaluated against held-out data.

3. Pareto selection and validation — Candidate expressions on the complexity–accuracy Pareto front are subjected to out-of-fold validation on unseen data partitions. Only expressions that demonstrate stability across folds and distributional shifts are retained.

A concrete illustration

As a simplified illustration, a symbolic search applied to a physical dynamical system might recover an expression of the form:

y = 0.83x² + 1.2e^−0.4t

The significance of such a result lies not in its formal complexity, but in what it enables:

The expression can be analytically differentiated to identify critical points.
Its asymptotic behaviour is immediately readable from the exponential decay term.
It can be deployed as a surrogate model without any computational graph or runtime dependency.

Interpretability as a scientific constraint, not a feature

It is important to resist the framing of interpretability as an optional property — a desirable but secondary characteristic that can be traded against predictive performance. In scientific contexts, a model that cannot be examined is difficult to validate, and a model that cannot be validated is not science. Interpretability is therefore a hard constraint in our methodology, not a post-hoc consideration.

This implies a deliberate willingness to accept marginally lower predictive performance in exchange for symbolic transparency. The resulting models are verifiable, portable, and compatible with the formal language of scientific publication and regulatory compliance.

Conclusion

Symbolic regression offers a principled route from raw data to mathematical structure. It does not replace scientific reasoning; it automates one of its most laborious components — the enumeration and evaluation of candidate functional forms — allowing researchers to focus on interpretation, validation, and application. In this sense, it provides a genuine bridge between the complexity of data and the clarity of governing equations.