The University of Manchester
Inter-atomic potentials are surrogate models that replace complex quantum-mechanical (QM) calculations with fast-to-evaluate functions, directly or indirectly dependent on the positions of atoms, returning forces and energies. This way, we can simulate millions of atoms at timescales of pico or nanoseconds, something beyond the reach of QM methods like density functional theory (DFT).
There is always a demand for highly-accurate surrogate models that can handle many diverse configurations, and the most straightforward way of their development is to employ an empirical framework. However, such models have to be complex, i.e. ”flexible” enough. Therefore, they are particularly susceptible to insufficient information in the training set.
Model complexity is not the only challenge. The learning problem cannot be easily formulated on a domain with a comprehensive metric (e.g. $\mathbb{R}^{3\times N}$, where $N$ is a number of atoms). The map, linking to energies and forces, will be rather defined (implicitly) on a collection of positions of various lengths. Additionally, formulation of potentials is facilitated by descriptors of chemical environments – functions that represent a collection of atoms as a vector, tensor or a set (with the simplest form being pairwise distance). As such, we have two, rather than one, consecutive and non-trivial relationships. The first is defined by our decision to use a particular model and descriptor, and the second is a result of solving a regression problem.All this renders the exploration of the domain particularly difficult. In other words, how can we tell if we have the right data to train a model? Associated challenges are unavoidable, regardless of the method we intend to use. We need a systematic approach to the generation of the training data rather than rely on our intuition and ability to navigate high-dimensional spaces that represent physical systems in an abstract way.
We argue that classical concepts of statistical planning and design of experiments, many of which were developed when supercomputers could not match the speed of current smartphones, are in many ways a better solution, to problems outlined above, than popular approaches, e.g. based on active learning.Discussed methods allow us to assess the informativeness of data (how much we can improve the model by adding/swapping a training example) and verify if the training is feasible with the current set before obtaining any reference energies and forces. Hence, we can avoid unnecessary evaluations and extra complexity associated with constant re-training.
Abstract
Erwerben Sie einen Zugang, um dieses Dokument anzusehen.
© 2026