Do Two AI Scientists Agree?

Xinghong Fu fxh@mit.edu    Ziming Liu    Max Tegmark Department of Physics, Institute of Artificial Intelligence and Fundamental Interactions, Massachusetts Institute of Technology, Cambridge, USA
(April 3, 2025)
Abstract

When two AI models are trained on the same scientific task, do they learn the same theory or two different theories? Throughout the history of science, we have witnessed the rise and fall of theories driven by experimental validation or falsification: many theories may co-exist when experimental data is lacking, but the space of surviving theories becomes more constrained with more experimental data becoming available. We show the same story is true for AI scientists. With increasingly more systems provided in training data, AI scientists tend to converge in the theories they learned, although sometimes they form distinct groups corresponding to different theories. To mechanistically interpret what theories AI scientists learn and quantify their agreement, we propose MASS, Hamiltonian-Lagrangian neural networks as AI Scientists, trained on standard problems in physics, aggregating training results across many seeds simulating the different configurations of AI scientists. Our key findings include: 1) when trained on textbook problems in classical mechanics, AI scientists prefers either a complete Hamiltonian or Lagrangian description; 2) when extended to non-standard physical problems, the Lagrangian description generalizes, suggesting that Lagrangian dynamics remain as the singular accurate family of descriptions in a rich theory space. We also observe strong seed dependence of the training dynamics and final learned weights, controlling the rise and fall of relevant theories. Besides interpretability, MASS unifies and generalizes beyond the Lagrangian neural networks and the Hamiltonian neural networks, providing a new tool for learning of dynamical systems. We release our code at https://github.com/shinfxh/ai-scientists.

preprint: APS/123-QED

I Introduction

Refer to caption
Figure 1: The evolution of AI scientists. Different AI scientists learning from data within the same physical system, even in the simple pendulum, arrives at different results. Theories that fail to support the current data are marked wrong. Surviving AI scientists are exposed to more complex systems, such as the double pendulum. AI scientists modify their theories to model the new data. Ultimately, what will the remaining AI scientists learn?

Throughout human history, our collective curiosity has driven scientific progress. From Archimedes’ principle of buoyancy, to Galileo’s systematic study of motion, to Newton’s formulation of classical mechanics, and finally to Einstein’s revolutionary theories of relativity, these luminaries meticulously analyzed observations and experiments to develop robust hypotheses that explained known phenomena and predicted new ones. Over the centuries, as technology has advanced, so too has our ability to refine experiments, test theories against increasingly precise datasets, and update our frameworks accordingly. Some hypotheses have eventually fallen out of favor, while others have evolved into more nuanced theories capable of describing phenomena at previously uncharted scales [1].

Today, in the twenty-first century, we are witnessing the emergence of a new paradigm. Machine learning (ML) and data-driven methods have already begun to supplant traditional statistical tools in fields as diverse as particle physics [2], astronomy [3], materials science [4] and quantum chemistry [5]. A natural next step is to contemplate a future in which ML methods shift from mere assistive instruments to becoming full-fledged “AI scientists,” capable of formulating hypotheses, designing experiments, and interpreting results with minimal human intervention. Pioneering endeavors have already produced end-to-end AI platforms that can discover physical laws from raw experimental data [6, 7], and molecular structures from protein sequences [8]. The recent improvements in architectures [9] with capabilities to absorb and process a large amount of data have fueled the development of large-language models [10, 11, 12, 13, 14, 15]. These LLMs have already started becoming the backbone of fully-automated AI research scientists [16].

As these AI scientists begin to operate autonomously, it is worth asking: What scientific theories will they propose? History shows that different human researchers, such as Newton and Leibniz, can arrive at complementary yet distinct formulations of the same phenomenon (e.g., calculus). Analogously, modern ML systems vary in architecture, initialization schemes, and training paradigms [17], leading to the possibility that independently trained AI scientists might converge on different theoretical formulations or complementary perspectives. Moreover, as AI scientists venture into analyzing larger and more intricate datasets—ranging from high-dimensional cosmological surveys to complex molecular dynamics simulations—their learned representations and theories may evolve in unexpected ways [18].

This paper does not seek to predict precisely how AI will transform science in the decades ahead. Instead, we offer a set of controlled experiments to investigate whether and how multiple AI scientists, trained under varying conditions, converge or diverge in their scientific theories. By exploring synthetic datasets, we aim to shed light on how the complexity of the data, the choice of model architectures, and the selection of training methods may influence not only what these AI systems learn but also how their internal representations and resulting theories develop over time [19].

In doing so, we hope to provide a window into the kind of questions that will shape future inquiries into the role of AI in science: will AI scientists unify disparate theories or fragment into multiple, equally valid viewpoints? Will their theories be understandable to humans, or will interpretability become a bigger challenge? The experimental framework and preliminary results presented herein serve as a stepping stone for these discussions, highlighting the potential—and potential pitfalls—of our emerging AI Scientists.

To list, our contributions in this paper are:

  1. 1.

    We propose a new architecture, MASS (Multiple AI Scalar Scientists), to allow a single neural network to learn diverse theories across multiple physical systems.

  2. 2.

    We train MASS across datasets including the simple pendulum, the Kepler problem, and synthetic potentials.

  3. 3.

    We analyze significant activations in MASS and distill the theories learned by MASS.

Using MASS as a proxy for an AI scientist, our findings suggest that

  1. 1.

    An AI scientist can learn many diverse explanations of the same physical phenomenon.

  2. 2.

    Encountering more complex systems, successful AI scientists modify their existing theories to suit new observations.

  3. 3.

    AI scientists tend to learn similar theories, evaluated in terms of similarity of the networks’ internal activations. These theories also closely resemble either the Hamiltonian or Lagrangian description.

  4. 4.

    The recovered theories resemble Hamiltonian dynamics initially, then shift closer to a Lagrangian formulation as complexity of systems increase. This suggests that the Lagrangian formulation remains as the only correct theory within a rich theory space.

II Related Work

Scientists aim to recover equations from observation. So do AI scientists. Given a set of data of some physical system, we aim to uncover the underlying “truth” in terms of physical equations. Efforts to tackle this problem has been a mix of discrete methods such as combinatorial optimization, making use of methods involving genetic programming [20] and continuous ones centering around symbolic regression [6]. The underlying assumption is that there are few number of terms in the final expression, inspiring methods of sparse linear regression [21]. Physical priors were introduced [22] to improve the ability of symbolic regression techniques in discovering known physical equations. In this paper, we propose a method to discover underlying physical laws with minimal physical priors, making use of the principle of stationary action, learning a single scalar function. These two properties are shared by the Hamiltonian Neural Network (HNN) [23] and the Lagrangian Neural Network (LNN) [24].

Inspired by the Hamiltonian formulation of classical mechanics, The HNN breaks down the task of learning the equations of motion of a physical system to first learning a scalar function, the Hamiltonian H𝐻Hitalic_H, then obtaining (q˙,p˙)˙𝑞˙𝑝(\dot{q},\dot{p})( over˙ start_ARG italic_q end_ARG , over˙ start_ARG italic_p end_ARG ) using Hamilton’s canonical equations:

q˙=Hp,p˙=Hq,formulae-sequence˙𝑞𝐻𝑝˙𝑝𝐻𝑞\dot{q}=\frac{\partial H}{\partial p},\quad\dot{p}=-\frac{\partial H}{\partial q},over˙ start_ARG italic_q end_ARG = divide start_ARG ∂ italic_H end_ARG start_ARG ∂ italic_p end_ARG , over˙ start_ARG italic_p end_ARG = - divide start_ARG ∂ italic_H end_ARG start_ARG ∂ italic_q end_ARG , (1)

where q,p𝑞𝑝q,pitalic_q , italic_p are the canonical position and momentum. However, in some cases the expression for these canonical coordinates is not trivial to write down. The LNN solves this problem by learning the scalar function as the Lagrangian instead and taking derivatives according to the Euler-Lagrange equations.

ddt(q˙)q=0.𝑑𝑑𝑡˙𝑞𝑞0\frac{d}{dt}\left(\frac{\partial\mathcal{L}}{\partial\dot{q}}\right)-\frac{% \partial\mathcal{L}}{\partial q}=0.divide start_ARG italic_d end_ARG start_ARG italic_d italic_t end_ARG ( divide start_ARG ∂ caligraphic_L end_ARG start_ARG ∂ over˙ start_ARG italic_q end_ARG end_ARG ) - divide start_ARG ∂ caligraphic_L end_ARG start_ARG ∂ italic_q end_ARG = 0 . (2)

This avoids the need for an explicit expression of canonical momentum, making LNNs advantageous for certain physical systems.

Since the introduction of these works, there has been significant leaps in improving the efficiency of training [25, 26], as well as applying these networks to problems in domains such as rigid-body dynamics [27], particle interactions [28], video prediction [29] and generative modeling [30]. However, in many of these works the underlying equations of motion (Equations 1 and 2) are baked into the model architecture, and the model resultingly learns the corresponding theory governed by this equation. Instead, we ask the following question: when the model is given a freedom of learning multiple theories, what will it learn?

In this work, our proposed model, Multiple AI Scalar Scientists (MASS), is a generalized framework that includes both LNN and HNN as special cases. MASS is similarly inspired by the principle of stationary action. Like LNNs and HNNs, we aim to learn a free-form scalar function from data. However, unlike LNN and HNNs, which have hard-coded equations of motion, we equip MASS with the ability to also learn the equations of motion. For a physical system described by generalized coordinates q𝑞qitalic_q and velocities q˙˙𝑞\dot{q}over˙ start_ARG italic_q end_ARG, one can learn a scalar function (akin to a Lagrangian or Hamiltonian) that governs the system’s evolution, such that the path obeys the principle of stationary action.

The architecture design of MASS allows it to learn a rich space of theories, defined by the coefficients on each term in the governing equation learned by MASS. Similar to [24], our experiments are done in generalized coordinates. Through a series of controlled experiments on an ensemble of MASS scientists in these coordinates, we will probe the underlying theories learned.

III MASS: The AI scientist

Refer to caption
Figure 2: The MASS (Multi-physics AI Scalar Scientist) network.

To emulate how human scientists operate, the core idea behind MASS is to embed within a single neural network the capacity to learn and unify information from multiple physical systems. Rather than fitting individual models for each system, MASS aims to internalize a shared framework that captures fundamental patterns across all datasets. Specifically, it achieves this by learning a scalar function—analogous to Lagrangian or Hamiltonian—whose derivatives encode the system-specific dynamics. As illustrated in Figure 2, MASS adopts the following workflow:

  1. 1.

    Data ingestion: MASS receives observational data (e.g., trajectories, states, or energy values) from diverse physical systems such as pendulums, orbital problems, or other synthetic potentials.

  2. 2.

    Hypothesis formation: A separate neural network for each system learns a single scalar function that encapsulates the system-specific dynamics.

  3. 3.

    Theory evaluation: A final layer shared across all systems differentiates the learned scalar function with respect to the system coordinates (position, momentum, and/or velocity), MASS infers a system’s governing equations. This enforces the consistency of an overarching theory across multiple systems.

  4. 4.

    Refinement and generalization: The outputs of the model, in this case the time derivatives of the input, are then evaluated against the ground truth training data to compute errors. The error is summed across all systems, then backpropagation optimizes a single theory that is simultaneously consistent with multi-physics observations.

By iterating through these steps, MASS strives to discover a single, scalar function for each system and a shared final layer forming a generalized theory across systems. Together, the scalar function and the weights in the final layer (i.e. how MASS takes its derivatives) form the theory that it learns.

IV Method

We denote by \mathcal{M}caligraphic_M a single MASS scientist network. \mathcal{M}caligraphic_M learns from n𝑛nitalic_n different physical systems. Some examples of systems include the spring-mass system, the gravitational system, and quantum mechanical systems, etc. Each system obeys some underlying physical law, be it the inverse square law of gravitational attraction, or Schrödinger’s equation. For simplicity and as a proof of concept, we constrain our systems to classical mechanics below.

Data ingestion: The input variables for system j𝑗jitalic_j to \mathcal{M}caligraphic_M are the generalized coordinates in d𝑑ditalic_d-dimensions, expressed as xj,yjdsubscriptx𝑗subscripty𝑗superscript𝑑\textbf{x}_{j},\textbf{y}_{j}\in\mathbb{R}^{d}x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT, where xjsubscriptx𝑗\textbf{x}_{j}x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT and yjsubscripty𝑗\textbf{y}_{j}y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT are the generalized coordinates and their time derivative, respectively. For a simple pendulum, we can either express (xj,yj):=(θ,θ˙)assignsubscriptx𝑗subscripty𝑗𝜃˙𝜃(\textbf{x}_{j},\textbf{y}_{j}):=(\theta,\dot{\theta})( x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) := ( italic_θ , over˙ start_ARG italic_θ end_ARG ) as a one-dimensional problem, or with xj:=(x,y)assignsubscriptx𝑗𝑥𝑦\textbf{x}_{j}:=(x,y)x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT := ( italic_x , italic_y ) and yj:=(x˙,y˙)assignsubscripty𝑗˙𝑥˙𝑦\textbf{y}_{j}:=(\dot{x},\dot{y})y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT := ( over˙ start_ARG italic_x end_ARG , over˙ start_ARG italic_y end_ARG ) to express the problem in two dimensions with Cartesian coordinates.

Hypothesis formulation: This block consists of of n𝑛nitalic_n separate neural networks each learning a separate potential function Sjsubscript𝑆𝑗S_{j}italic_S start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT for each system j𝑗jitalic_j. We denote this forward pass as

Sj=fj(xj,yj).subscript𝑆𝑗subscript𝑓𝑗subscriptx𝑗subscripty𝑗\displaystyle S_{j}=f_{j}(\textbf{x}_{j},\textbf{y}_{j}).italic_S start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = italic_f start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) . (3)

In this paper, we focus on MLPs, which suffice for learning S𝑆Sitalic_S.

Theory evaluation: The shared derivatives layer computes the derivatives, up to second-order, of Sjsubscript𝑆𝑗S_{j}italic_S start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT with respect to the input variables xj,yjsubscriptx𝑗subscripty𝑗\textbf{x}_{j},\textbf{y}_{j}x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT. Note that given d𝑑ditalic_d-dimensional inputs, i.e. xj,yjdsubscriptx𝑗subscripty𝑗superscript𝑑\textbf{x}_{j},\textbf{y}_{j}\in\mathbb{R}^{d}x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT, the single-variable derivatives Sx,Sydsubscript𝑆xsubscript𝑆ysuperscript𝑑S_{\textbf{x}},S_{\textbf{y}}\in\mathbb{R}^{d}italic_S start_POSTSUBSCRIPT x end_POSTSUBSCRIPT , italic_S start_POSTSUBSCRIPT y end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT are column vectors while the second-order derivatives (and their inverses) are Hessian matrices, i.e. Sxx,Syy,Sxy,Sxx1,Syy1,Sxy1d×dsubscript𝑆xxsubscript𝑆yysubscript𝑆xysuperscriptsubscript𝑆xx1superscriptsubscript𝑆yy1superscriptsubscript𝑆xy1superscript𝑑𝑑S_{\textbf{xx}},S_{\textbf{yy}},S_{\textbf{xy}},S_{\textbf{xx}}^{-1},S_{% \textbf{yy}}^{-1},S_{\textbf{xy}}^{-1}\in\mathbb{R}^{d\times d}italic_S start_POSTSUBSCRIPT xx end_POSTSUBSCRIPT , italic_S start_POSTSUBSCRIPT yy end_POSTSUBSCRIPT , italic_S start_POSTSUBSCRIPT xy end_POSTSUBSCRIPT , italic_S start_POSTSUBSCRIPT xx end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT , italic_S start_POSTSUBSCRIPT yy end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT , italic_S start_POSTSUBSCRIPT xy end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d × italic_d end_POSTSUPERSCRIPT. To allow the network to learn a diverse set of theories, we compute all products of terms, up to three terms in the product, such that the final result is a dsuperscript𝑑\mathbb{R}^{d}blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT vector that predicts the time derivatives x˙j,y˙jdsubscript˙x𝑗subscript˙y𝑗superscript𝑑\dot{\textbf{x}}_{j},\dot{\textbf{y}}_{j}\in\mathbb{R}^{d}over˙ start_ARG x end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , over˙ start_ARG y end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT. Particularly, let the set of dsuperscript𝑑\mathbb{R}^{d}blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT vectors be 𝒱={x,y,Sx,Sy}𝒱xysubscript𝑆xsubscript𝑆y\mathcal{V}=\{\textbf{x},\textbf{y},S_{\textbf{x}},S_{\textbf{y}}\}caligraphic_V = { x , y , italic_S start_POSTSUBSCRIPT x end_POSTSUBSCRIPT , italic_S start_POSTSUBSCRIPT y end_POSTSUBSCRIPT } and the set of d×dsuperscript𝑑𝑑\mathbb{R}^{d\times d}blackboard_R start_POSTSUPERSCRIPT italic_d × italic_d end_POSTSUPERSCRIPT matrices be 𝒜={Sxx,Syy,Sxy,Sxx1,Syy1,Sxy1}𝒜subscript𝑆xxsubscript𝑆yysubscript𝑆xysuperscriptsubscript𝑆xx1superscriptsubscript𝑆yy1superscriptsubscript𝑆xy1\mathcal{A}=\{S_{\textbf{xx}},S_{\textbf{yy}},S_{\textbf{xy}},S_{\textbf{xx}}^% {-1},S_{\textbf{yy}}^{-1},S_{\textbf{xy}}^{-1}\}caligraphic_A = { italic_S start_POSTSUBSCRIPT xx end_POSTSUBSCRIPT , italic_S start_POSTSUBSCRIPT yy end_POSTSUBSCRIPT , italic_S start_POSTSUBSCRIPT xy end_POSTSUBSCRIPT , italic_S start_POSTSUBSCRIPT xx end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT , italic_S start_POSTSUBSCRIPT yy end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT , italic_S start_POSTSUBSCRIPT xy end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT }. there are three types of terms that can potentially predict x˙j,y˙jsubscript˙x𝑗subscript˙y𝑗\dot{\textbf{x}}_{j},\dot{\textbf{y}}_{j}over˙ start_ARG x end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , over˙ start_ARG y end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT:

  1. 1.

    vVv𝑉\textbf{v}\in Vv ∈ italic_V

  2. 2.

    Av where 𝐀𝒜𝐀𝒜\mathbf{A}\in\mathcal{A}bold_A ∈ caligraphic_A and vVv𝑉\textbf{v}\in Vv ∈ italic_V

  3. 3.

    A1A2vsubscriptA1subscriptA2v\textbf{A}_{1}\textbf{A}_{2}\textbf{v}A start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT A start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT v where A1,A2𝒜subscriptA1subscriptA2𝒜\textbf{A}_{1},\textbf{A}_{2}\in\mathcal{A}A start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , A start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∈ caligraphic_A and vVv𝑉\textbf{v}\in Vv ∈ italic_V.

In our implementation, there are a total of T=172𝑇172T=172italic_T = 172 different terms across the three types described above, and we explicitly compute them using

tj=D(fj(xj,yj)).subscriptt𝑗𝐷subscript𝑓𝑗subscriptx𝑗subscripty𝑗\displaystyle\textbf{t}_{j}=D(f_{j}(\textbf{x}_{j},\textbf{y}_{j})).t start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = italic_D ( italic_f start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) ) . (4)

where D𝐷Ditalic_D is the derivative layer and tT×dtsuperscript𝑇𝑑\textbf{t}\in\mathbb{R}^{T\times d}t ∈ blackboard_R start_POSTSUPERSCRIPT italic_T × italic_d end_POSTSUPERSCRIPT gives the terms that can potentially enter into the final equation.

In the final layer, the network learns a linear combination of these dsuperscript𝑑\mathbb{R}^{d}blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT vectors to predict the time derivatives of the inputs. Note that since we are using generalized position and momentum, x˙=y˙xy\dot{\textbf{x}}=\textbf{y}over˙ start_ARG x end_ARG = y trivially up to a constant factor. The remaining of the paper focuses on investigating the set of activations in the final layer that predicts y˙˙y\dot{\textbf{y}}over˙ start_ARG y end_ARG. We denote this final layer as Lfsubscript𝐿𝑓L_{f}italic_L start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT, and the output prediction of y˙˙y\dot{\textbf{y}}over˙ start_ARG y end_ARG will be given by

yj˙^=Lf(tj)=Lf(D(fj(xj,yj))).^˙subscripty𝑗subscript𝐿𝑓subscriptt𝑗subscript𝐿𝑓𝐷subscript𝑓𝑗subscriptx𝑗subscripty𝑗\displaystyle\hat{\dot{\textbf{y}_{j}}}=L_{f}(\textbf{t}_{j})=L_{f}(D(f_{j}(% \textbf{x}_{j},\textbf{y}_{j}))).over^ start_ARG over˙ start_ARG y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_ARG end_ARG = italic_L start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT ( t start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) = italic_L start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT ( italic_D ( italic_f start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) ) ) . (5)

Refinement and generalization: For a specific system j𝑗jitalic_j, we predict yj˙^^˙subscripty𝑗\hat{\dot{\textbf{y}_{j}}}over^ start_ARG over˙ start_ARG y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_ARG end_ARG and then compute the MSE loss with the ground truth data. We then sum up the losses across all systems that \mathcal{M}caligraphic_M is learning over, and do a backpropagation on the accumulated gradients. After convergence, the model develops a theory that is consistent across its multiple physical systems. The optimization objective is written as

minθj=1n𝔼(X,Y)Y˙jYj˙^22,subscript𝜃superscriptsubscript𝑗1𝑛subscript𝔼X,Ysuperscriptsubscriptnormsubscript˙Y𝑗^˙subscriptY𝑗22\displaystyle\min_{\theta}\sum_{j=1}^{n}\mathbb{E}_{(\textbf{X,Y})}||\dot{% \textbf{Y}}_{j}-\hat{\dot{\textbf{Y}_{j}}}||_{2}^{2},roman_min start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT blackboard_E start_POSTSUBSCRIPT ( X,Y ) end_POSTSUBSCRIPT | | over˙ start_ARG Y end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT - over^ start_ARG over˙ start_ARG Y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_ARG end_ARG | | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , (6)

where YjN×dsubscriptY𝑗superscript𝑁𝑑\textbf{Y}_{j}\in\mathbb{R}^{N\times d}Y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_N × italic_d end_POSTSUPERSCRIPT is the concatenation of N𝑁Nitalic_N samples in system j𝑗jitalic_j, and the expectation is taken over the samples X, Y drawn i.i.d. for each system.

We find that optimization over θ𝜃\thetaitalic_θ, the parameterization of \mathcal{M}caligraphic_M is highly unstable (as observed in [24]), due to the computation of derivatives and inverses in the matrices 𝒜𝒜\mathcal{A}caligraphic_A. The experimental procedure and hyperparameter setting is more detailed in Appendix A, but some key design choices help achieve stable training:

  • Computing the pseudo-inverse with a regularized stabilization term. Instead of computing inv(Sxx)invsubscript𝑆xx{\rm inv}(S_{\textbf{xx}})roman_inv ( italic_S start_POSTSUBSCRIPT xx end_POSTSUBSCRIPT ), we compute pinv(Sxx+b)pinvsubscript𝑆xx𝑏{\rm pinv}(S_{\textbf{xx}}+b)roman_pinv ( italic_S start_POSTSUBSCRIPT xx end_POSTSUBSCRIPT + italic_b ) where b𝑏bitalic_b is penalized as a regularization term in training.

  • AdamW [31] optimizer with cosine learning rate scheduling [32] and warm restarts.

  • Augmenting inputs to include second-order terms of x, y.

V Experiments

V.1 Single scientist: Correlated theories

“It might be that to describe the universe, we have to employ different theories in different situations. Each theory may have its own version of reality, but according to model-dependent realism, that is acceptable so long as the theories agree in their predictions whenever they overlap, that is, whenever they can both be applied.”
— Stephen Hawking & Leonard Mlodinow, The Grand Design (2010)

Refer to caption
(a) Training dynamics
Refer to caption
(b) Simulated trajectory
Figure 3: Training results for MASS on the simple harmonic oscillator. (a) MASS (seed 0) trains to an MSE loss of 3×1043superscript1043\times 10^{-4}3 × 10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT over 10000 steps of a batch size of 512512512512 at each step. The number of significant weights, calculated as the number of weights in the final layer that account the first 99% the total norm, decreases with loss. (b) The recreated motion of a single oscillator accurately captures the frequency and amplitude of the motion.

The central message of Hawking’s statement is that multiple theoretical frameworks can provide equally valid descriptions of physical phenomena, as long as their predictions agree with experiment. A typical classroom demonstration of this principle is the undamped spring-mass system. One can appeal to Newton’s laws of motion, where the governing equation is

mx¨=kx,𝑚¨𝑥𝑘𝑥m\ddot{x}=-kx,italic_m over¨ start_ARG italic_x end_ARG = - italic_k italic_x ,

or switch to a Hamiltonian formulation in which energy functions and conservation laws offer an alternative viewpoint.

Machine learning models, on the other hand, tend to be overparameterized, often giving them considerable flexibility in fitting data, even for relatively simple physical systems [33, 34]. An intriguing question arises: If we train a single “AI scientist” network on a simple harmonic oscillator, what sort of theoretical representation will it learn, and how will it compare to the standard Newtonian or Hamiltonian descriptions?

To investigate this, we trained MASS on simulated data from the undamped spring-mass system. Figure 3 shows the training results. Particularly, we observe that training on the simple harmonic oscillator is not a difficult task for MASS, converging to an MSE loss of 3×1043superscript1043\times 10^{-4}3 × 10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT. We are interested in understanding how the model learns and simplifies its theory, under addition of L1subscript𝐿1L_{1}italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and L2subscript𝐿2L_{2}italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT regularization to the final layer. To that end, we track the number of significant weights across training steps, calculated as the number of weights in the final layer that account for the first 99%percent9999\%99 % total norm of the final layer weight vector. Observe that this also decreases with the total number of training steps, but it plateaus at a rather large number of 42. This means that almost 42424242 weight terms have significant magnitude, not at all close to a simple theory of y˙=x˙𝑦𝑥\dot{y}=-xover˙ start_ARG italic_y end_ARG = - italic_x!

Using MASS, we can also simulate a trajectory of the oscillator rather easily, and Figure 3 show the consistency of predictions given by MASS.

Refer to caption
(a) Contour Plot
Refer to caption
(b) Hamiltonian x2+y2superscript𝑥2superscript𝑦2x^{2}+y^{2}italic_x start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_y start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
Figure 4: Contour of (a) learnt scalar function S𝑆Sitalic_S, compared with (b) the Hamiltonian x2+y2superscript𝑥2superscript𝑦2x^{2}+y^{2}italic_x start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_y start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT. MASS can in general, learn functions that resemble yet differ from conventional physical priors.

Figure 4 depicts the learned scalar function S𝑆Sitalic_S over phase space, compared to a canonical Hamiltonian function H𝐻Hitalic_H. A single MASS scientist is able to recover the sum of potential and kinetic energy expression. However, we already see some differences between these two expressions. To note, our learnt S𝑆Sitalic_S is convex rather than concave as in H𝐻Hitalic_H. S𝑆Sitalic_S also takes on negative values, not typically allowed in expressions of energy. In this example and also in general, the contour can also look skewed, translated or even resemble an entirely different conic section. This goes to show the richness of the theory space offered to MASS.

While the weights wisubscript𝑤𝑖w_{i}italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT in the final layer offer a glimpse as to which terms are important, they are not the full story. Firstly, the contribution to the final prediction of y˙˙𝑦\dot{y}over˙ start_ARG italic_y end_ARG exists as the sum

y˙^=i=1Twidi^˙𝑦superscriptsubscript𝑖1𝑇subscript𝑤𝑖subscript𝑑𝑖\displaystyle\hat{\dot{y}}=\sum_{i=1}^{T}w_{i}d_{i}over^ start_ARG over˙ start_ARG italic_y end_ARG end_ARG = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT

where disubscript𝑑𝑖d_{i}italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT i𝑖iitalic_i-th derivative term, such as Sx,Sxy,subscript𝑆𝑥subscript𝑆𝑥𝑦S_{x},S_{xy},\dotsitalic_S start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT , italic_S start_POSTSUBSCRIPT italic_x italic_y end_POSTSUBSCRIPT , …

Refer to caption
(a) Weights and Activations
Refer to caption
(b) Correlation map (Activations)
Figure 5: (a) Weights in final layer (blue) and the mean activation norms (red). The top 5 terms in mean activation magnitude: Syy1Syy1x,Sxy1Syyx,Syy1Sxxx,Sxxx,Syy1Sxy1xsuperscriptsubscript𝑆𝑦𝑦1superscriptsubscript𝑆𝑦𝑦1𝑥superscriptsubscript𝑆𝑥𝑦1subscript𝑆𝑦𝑦𝑥superscriptsubscript𝑆𝑦𝑦1subscript𝑆𝑥𝑥𝑥subscript𝑆𝑥𝑥𝑥superscriptsubscript𝑆𝑦𝑦1superscriptsubscript𝑆𝑥𝑦1𝑥S_{yy}^{-1}S_{yy}^{-1}x,\;S_{xy}^{-1}S_{yy}x,\;S_{yy}^{-1}S_{xx}x,\;S_{xx}x,\;% S_{yy}^{-1}S_{xy}^{-1}xitalic_S start_POSTSUBSCRIPT italic_y italic_y end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_S start_POSTSUBSCRIPT italic_y italic_y end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_x , italic_S start_POSTSUBSCRIPT italic_x italic_y end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_S start_POSTSUBSCRIPT italic_y italic_y end_POSTSUBSCRIPT italic_x , italic_S start_POSTSUBSCRIPT italic_y italic_y end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_S start_POSTSUBSCRIPT italic_x italic_x end_POSTSUBSCRIPT italic_x , italic_S start_POSTSUBSCRIPT italic_x italic_x end_POSTSUBSCRIPT italic_x , italic_S start_POSTSUBSCRIPT italic_y italic_y end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_S start_POSTSUBSCRIPT italic_x italic_y end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_x. (b) Correlation of significant activations, keeping only indices i𝑖iitalic_i contributing to the first cumulative 99%percent9999\%99 % of i𝔼[ai]subscript𝑖𝔼delimited-[]subscript𝑎𝑖\sum_{i}\mathbb{E}[a_{i}]∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT blackboard_E [ italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ], plotted after hierachial clustering. Most terms are strongly correlated.

We hence compute the activation vector ai=widisubscript𝑎𝑖subscript𝑤𝑖subscript𝑑𝑖a_{i}=w_{i}d_{i}italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT over a sample batch of 512 data points. In Figure 5, we compare the mean norm of the activations 𝔼(ai)𝔼subscript𝑎𝑖\mathbb{E}(a_{i})blackboard_E ( italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) (expectation taken over the 512 data points) with the weights, wisubscript𝑤𝑖w_{i}italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. In general, non-zero weights correspond to non-zero activation norms, but the relative order of the magnitudes of each term are not necessarily preserved. Particularly, the inverse of Hessian matrices, such as Sxx1superscriptsubscript𝑆𝑥𝑥1S_{xx}^{-1}italic_S start_POSTSUBSCRIPT italic_x italic_x end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT are large when the second derivatives of S𝑆Sitalic_S is small.

The largest magnitude activations that contribute to the final prediction of y˙˙𝑦\dot{y}over˙ start_ARG italic_y end_ARG are in descending order: Sxxx,Sxy1Syyx,Syy1Syy1x,Syy1Sxxx,Syy1x,subscript𝑆𝑥𝑥𝑥superscriptsubscript𝑆𝑥𝑦1subscript𝑆𝑦𝑦𝑥superscriptsubscript𝑆𝑦𝑦1superscriptsubscript𝑆𝑦𝑦1𝑥superscriptsubscript𝑆𝑦𝑦1subscript𝑆𝑥𝑥𝑥superscriptsubscript𝑆𝑦𝑦1𝑥S_{xx}x,S_{xy}^{-1}S_{yy}x,S_{yy}^{-1}S_{yy}^{-1}x,S_{yy}^{-1}S_{xx}x,S_{yy}^{% -1}x,\dotsitalic_S start_POSTSUBSCRIPT italic_x italic_x end_POSTSUBSCRIPT italic_x , italic_S start_POSTSUBSCRIPT italic_x italic_y end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_S start_POSTSUBSCRIPT italic_y italic_y end_POSTSUBSCRIPT italic_x , italic_S start_POSTSUBSCRIPT italic_y italic_y end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_S start_POSTSUBSCRIPT italic_y italic_y end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_x , italic_S start_POSTSUBSCRIPT italic_y italic_y end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_S start_POSTSUBSCRIPT italic_x italic_x end_POSTSUBSCRIPT italic_x , italic_S start_POSTSUBSCRIPT italic_y italic_y end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_x , …. When sorting instead by the norm of the weights, the top 5 terms are Sxxx,Sxy1Syyx,Syy1Syy1x,Syy1Sxxx,Syy1xsubscript𝑆𝑥𝑥𝑥superscriptsubscript𝑆𝑥𝑦1subscript𝑆𝑦𝑦𝑥superscriptsubscript𝑆𝑦𝑦1superscriptsubscript𝑆𝑦𝑦1𝑥superscriptsubscript𝑆𝑦𝑦1subscript𝑆𝑥𝑥𝑥superscriptsubscript𝑆𝑦𝑦1𝑥S_{xx}x,\;S_{xy}^{-1}S_{yy}x,\;S_{yy}^{-1}S_{yy}^{-1}x,\;S_{yy}^{-1}S_{xx}x,\;% S_{yy}^{-1}xitalic_S start_POSTSUBSCRIPT italic_x italic_x end_POSTSUBSCRIPT italic_x , italic_S start_POSTSUBSCRIPT italic_x italic_y end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_S start_POSTSUBSCRIPT italic_y italic_y end_POSTSUBSCRIPT italic_x , italic_S start_POSTSUBSCRIPT italic_y italic_y end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_S start_POSTSUBSCRIPT italic_y italic_y end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_x , italic_S start_POSTSUBSCRIPT italic_y italic_y end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_S start_POSTSUBSCRIPT italic_x italic_x end_POSTSUBSCRIPT italic_x , italic_S start_POSTSUBSCRIPT italic_y italic_y end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_x. The similarity of these terms is strong indicator of the important terms contributing to the final theory learnt by MASS.

In the next step, we filter to the number of significant terms. Following the convention in Figure 3, we keep only those terms that contribute to the first 99% of the total magnitude. That is, j𝑗jitalic_j is the number of significant terms if i=1j𝔼[|ai|]>i=1T𝔼[|ai|]superscriptsubscript𝑖1𝑗𝔼delimited-[]subscript𝑎𝑖superscriptsubscript𝑖1𝑇𝔼delimited-[]subscript𝑎𝑖\sum_{i=1}^{j}\mathbb{E}[|a_{i}|]>\sum_{i=1}^{T}\mathbb{E}[|a_{i}|]∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT blackboard_E [ | italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | ] > ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT blackboard_E [ | italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | ]. On these remaining significant activations, we compute the correlations in the heatmap in Figure 5, sorted according to similarity in correlations by hierarchical clustering [35]. There are three distinct clusters, consisting of terms that are linear matrix products of the vectors y,Sx𝑦subscript𝑆𝑥y,S_{x}italic_y , italic_S start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT and x𝑥xitalic_x respectively, from the upper left to lower right.

The existence of multiple terms lies in contrast to a trivial theory in which the only significant activation is x𝑥-x- italic_x, that gives a perfect prediction of y˙˙𝑦\dot{y}over˙ start_ARG italic_y end_ARG. The Hamiltonian expression would construct S=12x2+12y2𝑆12superscript𝑥212superscript𝑦2S=\frac{1}{2}x^{2}+\frac{1}{2}y^{2}italic_S = divide start_ARG 1 end_ARG start_ARG 2 end_ARG italic_x start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + divide start_ARG 1 end_ARG start_ARG 2 end_ARG italic_y start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT and predict y˙=Sx˙𝑦subscript𝑆𝑥\dot{y}=-S_{x}over˙ start_ARG italic_y end_ARG = - italic_S start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT. The reason behind the multi-expressivity of the network is that most second order terms are constant when the scalar function S𝑆Sitalic_S is at most second order. For instance, one can easily conceive a network that learns S=x2+y2+xy𝑆superscript𝑥2superscript𝑦2𝑥𝑦S=x^{2}+y^{2}+xyitalic_S = italic_x start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_y start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_x italic_y, where Hessian matrices Sxx,Syy,Sxysubscript𝑆𝑥𝑥subscript𝑆𝑦𝑦subscript𝑆𝑥𝑦S_{xx},S_{yy},S_{xy}italic_S start_POSTSUBSCRIPT italic_x italic_x end_POSTSUBSCRIPT , italic_S start_POSTSUBSCRIPT italic_y italic_y end_POSTSUBSCRIPT , italic_S start_POSTSUBSCRIPT italic_x italic_y end_POSTSUBSCRIPT and their inverses become constant products. The invariance in learning these products give rise to the mix of expressions we see. Nonetheless, these terms turn out to be extremely correlated and they mainly represent only one theory. In the next section, we will discuss how the significant terms evolve, which terms survive and which die, when the AI scientist is exposed to more complex physical systems.

The main takeaways from this section is

  1. 1.

    A single AI scientist can very effectively learn a single simple system (Figure 3), and it learns to filter its theory as training progresses.

  2. 2.

    The underlying theory resembles some familiar physical function (Figure 4).

  3. 3.

    When incorporated with large capacities, a single AI scientists tends to learn many seemingly separate theories (Figure 5(a)).

  4. 4.

    However, many of these theories are strongly correlated (Figure 5(b)).

V.2 Multiple systems: Sparsification and diversification

“In the beginning of the twentieth century, it became apparent that the motion of the planet Mercury was not exactly right. This caused a lot of trouble and was not explained until it was shown by Einstein that Newton’s Laws were slightly off and that they had to be modified. ”
— Richard Feynman, The Character of Physics Law (1967)[36]

The simple harmonic oscillator is perhaps too simple for a machine learning model to fit. You see, it just has to fit x𝑥-x- italic_x. We now extend our experiments to investigate what happens when an AI scientist, when starting out by observing just a single system, encounters more complex physical systems. Following the training paradigm in Section III, MASS learns a separate scalar function for each system, while sharing the same final layer. We aggregate the loss across all the systems in one step of training. The specific systems of interest here:

  1. 1.

    Simple harmonic oscillator

  2. 2.

    Simple pendulum

  3. 3.

    Kepler (Gravitational potential)

  4. 4.

    Relativistic harmonic oscillator

Refer to caption
Figure 6: MASS trained on increasingly complex systems. The dashed lines indicate the different phases of training. Starting from the simple harmonic oscillator, the system is exposed to the simple pendulum, gravitational potential and relativistic harmonic oscillator at the 10000th,20000th,30000thsuperscript10000𝑡superscript20000𝑡superscript30000𝑡10000^{th},20000^{th},30000^{th}10000 start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT , 20000 start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT , 30000 start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT step respectively. Loss is aggregated over all systems MASS is exposed to at each step of training.

Figure 6 shows the training results as we introduce each system one after another at intervals of 10000 steps, in the aforementioned order, i.e. a single training phase lasts for 10000 steps. This specific order represents at a high level the level of complexity of the systems to a human scientist. We observe that as more systems are introduced, existing theories either survive or falter, depending on the random seed controlling the initialization of the MASS network. For example, seed 80 fails at the simple pendulum, seed 96 fails at the gravitational potential, and seed 569 fails at the relativistic potential. This means that although they survived previous tasks, they probably discovered “wrong” theories that only overfit to previous tasks. It is also interesting to note that while some MASS fails initially, they can start learning accurate representations when tasked with a more complex system. The intuition for this late start can be understood by the larger number of constraints that more complex systems impose on MASS, that help its convergence.

The aggregated behaviors across seeds at each system will be further discussed in Section V.3. In the remaining of this section, we analyze a single MASS and its surviving terms. Particularly, this will be seed 52.

Similar to section V.1, we again analyze the activations in Figure 7.

Refer to caption
Figure 7: (Seed 52) Pairwise correlations learnt by a single AI scientist trained on increasingly complex systems moving down. From the top, each rows correspond to activations at steps 10000, 20000, 30000 and 40000 respectively. Correlation map plotted after filtering for significant terms, which contribute the first 99%percent9999\%99 % of the total magnitude of the activation vector. The number of significant terms, shown by the number of distinct squares, decrease as the number of systems increase. Correlation map plotted after hierarchical clustering. Note that many of these terms are strongly correlated (either positively or negatively).

In general, we make the following observations:

  1. 1.

    As the number of systems increases, the number of distinct terms learned decreases.

  2. 2.

    As the number of systems increases, the theories become more diverse.

The first result, from Figure 7 that the number of significant terms, counted by the number of squares in each correlation map of Figure 7, decrease from 20 to 6 for the SHO, 12 to 7 for the pendulum and 10 to 5 for the gravitational problem, show that fewer terms can simultaneously explain all the systems, as opposed to just a smaller subset of the systems. The second result is observed from the increasing occurrence of non-correlated terms trending towards the bottom right of Figure 7. We also find that when tasked with explaining an ensemble of systems, MASS uses almost the same terms! To see this, the last row of Figure 7 comprises essentially the same 6 to 7 terms used for explaining all 4 systems. These terms correspond to

Syy1x,SxxSxxx,Sxx1Sxx1x,superscriptsubscript𝑆𝑦𝑦1𝑥subscript𝑆𝑥𝑥subscript𝑆𝑥𝑥𝑥superscriptsubscript𝑆𝑥𝑥1superscriptsubscript𝑆𝑥𝑥1𝑥\displaystyle S_{yy}^{-1}x,\;S_{xx}S_{xx}x,\;S_{xx}^{-1}S_{xx}^{-1}x,\;italic_S start_POSTSUBSCRIPT italic_y italic_y end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_x , italic_S start_POSTSUBSCRIPT italic_x italic_x end_POSTSUBSCRIPT italic_S start_POSTSUBSCRIPT italic_x italic_x end_POSTSUBSCRIPT italic_x , italic_S start_POSTSUBSCRIPT italic_x italic_x end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_S start_POSTSUBSCRIPT italic_x italic_x end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_x ,
SyySyyx,SxySxx1x,subscript𝑆𝑦𝑦subscript𝑆𝑦𝑦𝑥subscript𝑆𝑥𝑦superscriptsubscript𝑆𝑥𝑥1𝑥\displaystyle S_{yy}S_{yy}x,\;S_{xy}S_{xx}^{-1}x,\;italic_S start_POSTSUBSCRIPT italic_y italic_y end_POSTSUBSCRIPT italic_S start_POSTSUBSCRIPT italic_y italic_y end_POSTSUBSCRIPT italic_x , italic_S start_POSTSUBSCRIPT italic_x italic_y end_POSTSUBSCRIPT italic_S start_POSTSUBSCRIPT italic_x italic_x end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_x ,
Sxx1Syyx,Sxx1Sxyx,superscriptsubscript𝑆𝑥𝑥1subscript𝑆𝑦𝑦𝑥superscriptsubscript𝑆𝑥𝑥1subscript𝑆𝑥𝑦𝑥\displaystyle S_{xx}^{-1}S_{yy}x,\;S_{xx}^{-1}S_{xy}x,\;italic_S start_POSTSUBSCRIPT italic_x italic_x end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_S start_POSTSUBSCRIPT italic_y italic_y end_POSTSUBSCRIPT italic_x , italic_S start_POSTSUBSCRIPT italic_x italic_x end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_S start_POSTSUBSCRIPT italic_x italic_y end_POSTSUBSCRIPT italic_x ,
SyySxx1x,Sxx1Syyx.subscript𝑆𝑦𝑦superscriptsubscript𝑆𝑥𝑥1𝑥superscriptsubscript𝑆𝑥𝑥1subscript𝑆𝑦𝑦𝑥\displaystyle S_{yy}S_{xx}^{-1}x,\;S_{xx}^{-1}S_{yy}x.italic_S start_POSTSUBSCRIPT italic_y italic_y end_POSTSUBSCRIPT italic_S start_POSTSUBSCRIPT italic_x italic_x end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_x , italic_S start_POSTSUBSCRIPT italic_x italic_x end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_S start_POSTSUBSCRIPT italic_y italic_y end_POSTSUBSCRIPT italic_x .

The large dependence on x𝑥xitalic_x is hypothesized to survive from the initial explanation of the simple harmonic oscillator, in which MASS begins by learning terms that are constant products of x𝑥xitalic_x, and from these terms it develops a theory for the new systems. We verified in a separate set of experiments, that permuting the order of the systems, starting with more difficult systems first, results in more emphasis on terms more related with Sxsubscript𝑆𝑥S_{x}italic_S start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT and y𝑦yitalic_y.

We state succinctly the main conclusion from this section: as an AI scientist encounters more systems, the number of distinct terms decrease.

V.3 Multiple scientists: Mixture of theories

“For a brief period at the beginning of 1926, it looked as though there were, suddenly, two self-contained but quite distinct systems of explanation extant: matrix mechanics and wave mechanics. But Schrödinger himself soon demonstrated their complete equivalence.”
— Max Born, Nobel Prize Lecture (1954)

When multiple scientists work on the same problem independently, some arrive at theories that seem vastly different but later become obvious that they were just two sides of the same coin (think Newton and Leibniz’s description of calculus). Differences in theory, reconciled later, happen more so in today’s advances in machine learning [37, 38, 39, 40]. Whereas, in some other instances, theories remain different from each other, though they both obey the same experimental results, very much like the Hamiltonian and Lagrangian scalar function descriptions of classical mechanics.

In this section, we investigate the relations between the theories learnt by different MASS scientists (which we will represent by different initial seeds) studying the same system.

The exact weights and values of each activation differs a lot between different scientists. Depending on the initialization, the exact terms which matter changes drastically (refer to Figure 13 and more in Appendix B). While the magnitudes of the individual terms vary, the significant terms chosen by each scientist remains rather identical. We illustrate the relative magnitude of each activation term in Figure 8. Observe that there are clear lines along this strip, indicating the terms on which it is possible to learn a describing theory of the system.

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 8: Activation strengths in 50 MASS scientists studying various physical systems separately. Darkness represents stronger activations. The distinct vertical lines indicate the terms on which it is possible, under the MASS framework, to learn a theory of the underlying system.

Nonetheless, the large variations in activation magnitudes and weights indicate that while the theories learned by MASS all lie within the dark lines in Figure 8, it might very well be the case that each scientist learns something different. Examining the scalar functions S𝑆Sitalic_S learnt by individual AI scientists(refer to Figure 16 in Appendix C), it is difficult to tell the underlying similarities and differences. Are these AI scientists all learning something entirely different. We will now show that this is not the case.

Consider the activations on the final layer of MASS which has a shape B×T𝐵𝑇B\times Titalic_B × italic_T on a batch of B𝐵Bitalic_B samples where the final layer has T𝑇Titalic_T terms. Specifically in our case, we have B=512,T=172formulae-sequence𝐵512𝑇172B=512,T=172italic_B = 512 , italic_T = 172. We conduct dimensionality reduction by PCA. It turns out that in majority of seeds, the first principal component already explains more than 90% of the variance. Reducing into this first principal component gives the B×1𝐵1B\times 1italic_B × 1 set of activations, and we observe in Figure 14 that in fact, each of the activation values are in fact distributionally equivalent to a uniform distribution (see Figure  14).

Refer to caption
(a) SHO
Refer to caption
(b) Pendulum
Refer to caption
(c) Gravitational
Refer to caption
(d) Relativistic
Figure 9: Correlations of the first principal component in 50 MASS scientists studying various physical systems separately. Majority of correlations are high, with the exception of correlations close to 11-1- 1 representing a parity flip. 96.4%, 74.8%, 93.7%, 87.5% of seeds have their first PCA component explaining more than 80% of variance for systems (a), (b), (c), (d) respectively.

Such observations are corroborated across multi-scientists set-ups when run on the relativistic spring-mass and the simple pendulum, as given in Figures 15(b) and 15(a).

Computing the correlations between the B×1𝐵1B\times 1italic_B × 1 activations shows that each scientist is in fact strongly correlated with all others (see figure 9). Note that correlations close to 11-1- 1 denote parity flips, which is surprisingly only rarely learnt.

These results allow us to conclude that multiple scientists learn the same underlying theory when trained on the same physical system. In fact, this already gives the answer to our very first research question: two AI scientists do agree!

V.4 Exploring the unknown: Lagrangian is all you need

“I think that the prize is recognizing, in part, the fact that understanding the deep problems of things like mind is not going to come forth in some simple way like Newtonian physics. ”
— John Hopfield, Nobel Prize Interview (2024)

Refer to caption
Figure 10: Average number of significant terms and number of correct scientists as we increase the number of systems. Starting from the SHO, we include the pendulum, gravitational and relativistic harmonic problems at systems 2, 3, and 4 respectively, followed by two synthetic potentials (see Table 1). The solid blue line (% Correct) gives the percentage of seeds that such that the converged loss after the n𝑛nitalic_n-th training phase is less than 5×1035superscript1035\times 10^{-3}5 × 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT. The dashed blue line gives the percentage where the converged loss is less than 5×1035superscript1035\times 10^{-3}5 × 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT for training phases up till the n𝑛nitalic_n-th one, i.e. MASS scientists that have always been correct. Results are parallelized over 1000 training seeds.

In the remaining of this section, we extend the analysis to a fully general case: multiple MASS scientists trained on multiple physical systems. Again, we train in the manner of Section IV, continuously exposing MASS to increasingly difficult systems and summing the errors across each system.

Simultaneously, we present an extension of our MASS framework to unseen physical systems. Thus far we have been replicating the results of known problems: the simple harmonic oscillator, the simple pendulum, the gravitational potential and the relativistic oscillator. The original motivation for training MASS on these systems was that they were already well-studied, giving us a decent baseline to benchmark the performance of MASS against. However, a natural advancement in the direction of scientific progress, is what happens when we extend our current framework to systems yet to be discovered. At the same time, these four canonical systems lie far within the capabilities of MASS. The learnt theories are not very diverse (see Figure 9) and some terms in the final layer are almost consistently never used (see Figure 8). Theoretically, this can be attributed to the fact that one-dimensional systems yield potential functions that typically do not involve the cross-terms Sxysubscript𝑆𝑥𝑦S_{xy}italic_S start_POSTSUBSCRIPT italic_x italic_y end_POSTSUBSCRIPT. For example, even in the most complex relativistic harmonic oscillator that MASS has been exposed to can be expressed with a potential function following that of a Lagrangian

S==1y212x2𝑆1superscript𝑦212superscript𝑥2\displaystyle S=\mathcal{L}=\sqrt{1-y^{2}}-\frac{1}{2}x^{2}italic_S = caligraphic_L = square-root start_ARG 1 - italic_y start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG - divide start_ARG 1 end_ARG start_ARG 2 end_ARG italic_x start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT

for which Sxy=0subscript𝑆𝑥𝑦0S_{xy}=0italic_S start_POSTSUBSCRIPT italic_x italic_y end_POSTSUBSCRIPT = 0.

To extend our studies to unseen physical systems and also fully utilize the capacity of the MASS network, we introduce synthetic systems. We list the modifications in Table 1 by describing the kinetic energy T𝑇Titalic_T and potential energy V𝑉Vitalic_V of each system. In particular, we introduce two additional synthetic systems which serve as extensions of the relativistic harmonic oscillator with a more complex potential energy term.

Table 1: Summary of the seven 1D systems used in this work. For each system, we show the usual kinetic energy T(x,y)𝑇𝑥𝑦T(x,y)italic_T ( italic_x , italic_y ) and potential energy V(x,y)𝑉𝑥𝑦V(x,y)italic_V ( italic_x , italic_y ). The total energy is given by T+V𝑇𝑉T+Vitalic_T + italic_V. In this paper’s convention, x˙=y˙𝑥𝑦\dot{x}=yover˙ start_ARG italic_x end_ARG = italic_y. The synthetic systems α,β𝛼𝛽\alpha,\betaitalic_α , italic_β are designed such that their first order Taylor expansions match the relativistic harmonic oscillator, up to addition and scaling by a constant. Note that in the relativistic cases (systems 4 to 6), the Lagrangian is not simply TV𝑇𝑉T-Vitalic_T - italic_V but the kinetic energy terms rather appears as γ1superscript𝛾1\gamma^{-1}italic_γ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT.
System 𝑻(𝒙,𝒚)𝑻𝒙𝒚T(x,y)bold_italic_T bold_( bold_italic_x bold_, bold_italic_y bold_) 𝑽(𝒙,𝒚)𝑽𝒙𝒚V(x,y)bold_italic_V bold_( bold_italic_x bold_, bold_italic_y bold_)
(1) Classical 12y212superscript𝑦2\tfrac{1}{2}\,y^{2}divide start_ARG 1 end_ARG start_ARG 2 end_ARG italic_y start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT 12x212superscript𝑥2\tfrac{1}{2}\,x^{2}divide start_ARG 1 end_ARG start_ARG 2 end_ARG italic_x start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
(2) Pendulum 12y212superscript𝑦2\tfrac{1}{2}\,y^{2}divide start_ARG 1 end_ARG start_ARG 2 end_ARG italic_y start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT 1cos(x)1𝑥1-\cos(x)1 - roman_cos ( italic_x )
(3) Kepler 12y212superscript𝑦2\tfrac{1}{2}\,y^{2}divide start_ARG 1 end_ARG start_ARG 2 end_ARG italic_y start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT 1|x|1𝑥-\dfrac{1}{|x|}- divide start_ARG 1 end_ARG start_ARG | italic_x | end_ARG
(4) Relativistic γ𝛾\gammaitalic_γ 12x2(1y2)3/212superscript𝑥2superscript1superscript𝑦232\tfrac{1}{2}\,x^{2}\,\bigl{(}1-y^{2}\bigr{)}^{3/2}divide start_ARG 1 end_ARG start_ARG 2 end_ARG italic_x start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( 1 - italic_y start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 3 / 2 end_POSTSUPERSCRIPT
(5) α𝛼\alphaitalic_α (Synthetic) γ𝛾\gammaitalic_γ 12x21γ12superscript𝑥21𝛾\tfrac{1}{2}x^{2}\cdot\tfrac{1}{\gamma}divide start_ARG 1 end_ARG start_ARG 2 end_ARG italic_x start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ⋅ divide start_ARG 1 end_ARG start_ARG italic_γ end_ARG
(6) β𝛽\betaitalic_β (Synthetic) γ𝛾\gammaitalic_γ 12x2cos(y)12superscript𝑥2𝑦\tfrac{1}{2}x^{2}\cdot\cos(y)divide start_ARG 1 end_ARG start_ARG 2 end_ARG italic_x start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ⋅ roman_cos ( italic_y )

Our key results are presented in Figure 10, where we count the number of correct MASS scientists, defined as the number of seeds where the evaluation loss on the converged model, computed as the maximum MSE across all seen physical systems, is less than 5×1035superscript1035\times 10^{-3}5 × 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT. We also count the number of significant terms, defined as the number of terms in the final layer (out of T=172𝑇172T=172italic_T = 172 terms) needed to reach 95% of the total norm. These values are aggregated at the end of each training phase. Recall that in a single training phase, a MASS scientist is exposed to a new system and trained on the sum of the losses. Typically, a phase lasts for 10000 steps.

As we increase the number of systems, the number of MASS scientists that have been consistently correct decreases (dashed blue line of Figure 10), where to be consistently correct at phase n𝑛nitalic_n is to have a low converged loss for all phases up till the n𝑛nitalic_n-th phase. This is intuitive, since the consistently correct MASS scientists at the end of the n𝑛nitalic_n-th phase is always a subset of that at the end of phase n1𝑛1n-1italic_n - 1. What is not very intuitive is the solid blue line: the number of correct scientists can increase with number of systems. This is analogous to seed 506 on Figure 6, where a MASS scientist can fail at a less complex system, but when exposed to more systems, learns the overarching underlying theory and succeeds. Such revivals of scientist networks highlight the importance of augmenting physical neural networks with more difficult tasks for it to work on simpler ones.

The number of significant terms also show a consistently decreasing trend. This cements the results of Figure 9 but is still surprising! To describe each system independently, the MASS scientist relies on rather different sets of weights, as in Figure 8. Rather than learning separate terms to describe separate systems, i.e. learn the union of the terms for each theory, MASS instead learns the intersection of the terms, exemplifying the purpose of the shared final layer.

After training on 6 systems, the number of significant terms is still more than 6. A six term theory is neat, but nowhere near the simplicity of equations 1 and 2. In the remaining of this section, we show that we can easily distill the underlying theory, and that this underlying theory is in fact a Lagrangian.

Refer to caption
(a) Fraction of each theory
Refer to caption
(b) R2superscript𝑅2R^{2}italic_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT of linear fitting
Figure 11: MASS switches from learning a Hamiltonian theory to a Lagrangian one. (a) The fraction of MASS scientists that learns c1subscript𝑐1c_{1}italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and c2subscript𝑐2c_{2}italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT to be opposite signs (Lagrangian) vs same sign (Hamiltonian). (b) R2superscript𝑅2R^{2}italic_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT score of linear fitting of activations to those derived by the Lagrangian vs Hamiltonian potential. The error bars show the standard deviation of the R2superscript𝑅2R^{2}italic_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT score.

V.4.1 Simple Problems: Hamiltonian is all you need?

Recall that in the Hamiltonian formulation, i.e. we learn S𝑆Sitalic_S to be =T+V𝑇𝑉\mathcal{H}=T+Vcaligraphic_H = italic_T + italic_V, where \mathcal{H}caligraphic_H is the Hamiltonian, T𝑇Titalic_T is the kinetic energy, V𝑉Vitalic_V is the potential energy. In the Lagrangian formulation, we learn S𝑆Sitalic_S to be =TV𝑇𝑉\mathcal{L}=T-Vcaligraphic_L = italic_T - italic_V. The sign flip here is crucial.

Given data coordinates xi,yisubscript𝑥𝑖subscript𝑦𝑖x_{i},y_{i}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, and weights θ𝜃\thetaitalic_θ of the MASS scientist, we can compute the scalar function S(xi,yi)𝑆subscript𝑥𝑖subscript𝑦𝑖S(x_{i},y_{i})italic_S ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ). We can also pre-compute the kinetic and potential energy terms Tisubscript𝑇𝑖T_{i}italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and Visubscript𝑉𝑖V_{i}italic_V start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, then linear fit S𝑆Sitalic_S with c1T+c2Vsubscript𝑐1𝑇subscript𝑐2𝑉c_{1}T+c_{2}Vitalic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_T + italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_V. We denote a MASS scientist to have learnt a Lagrangian theory if c1subscript𝑐1c_{1}italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and c2subscript𝑐2c_{2}italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT have opposite sign and a Hamiltonian theory if c1subscript𝑐1c_{1}italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and c2subscript𝑐2c_{2}italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT have the same sign.

Alongside the discrete counting method described above, we can also directly fit a batch of activations against the Lagrangian and Hamiltonian activations which we can compute from an analytic expression. This fitting should not be expected to be perfect, for MASS can learn a simple variant of a clean theory yet give similar accuracies. For example, learning S𝑆Sitalic_S and S+x𝑆𝑥S+xitalic_S + italic_x may end up being effectively the same since terms second first derivative terms change by a constant while second derivative terms are entirely the same. Despite the imperfections of linear fitting, the aggregate trend of the mean R2superscript𝑅2R^{2}italic_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT across many samples can tell us a bit about the relation to each theory.

Figure 11 summarizes these results and shows the evolution of theories learned across a number of systems. When trained on just the simple harmonic oscillator or the pendulum, MASS learns almost a complete Hamiltonian description (with more than 90% of the scientists agreeing). In this simple setting, there exists some choice of sparse terms among the T=172𝑇172T=172italic_T = 172 derivative terms that under Hamilton’s equations (Equation 1) that give low loss, and MASS tends towards this. The learned scalar functions themselves also display strong correlation.

V.4.2 Complex Problems: Lagrangian is all you need

The story changes when we extend beyond the simple pendulum to more complex problems. On these systems (3 to 6 in Table 1), MASS switches to a Lagrangian theory. One reason for this, as discussed in [24] is that the Lagrangian can be applied directly in generalized coordinates while the Hamiltonian requires canonical coordinates. As our data is presented in generalized coordinates, the MASS architecture supports calculations done in this coordinate system, following that of the Lagrangian formulation. What is surprising here is that the correlation to the Lagrangian scalar function itself also increases, suggesting that on an aggregate level, AI scientists tend towards this singular family of descriptions of physical systems: the Lagrangian description!

The results of Figure 11 show a bias toward the Lagrangian formulation, but never a definitive proof that the calculations faithfully follow that of the Lagrangian. Of course we should not expect that, given the capacity imbued to MASS, why would it follow some “nice” theory? But turns out, it almost exactly does! We will show this with a method of constrained optimization.

In the Lagrangian formulation, the prediction of y˙˙𝑦\dot{y}over˙ start_ARG italic_y end_ARG will be given by [24]

y˙=Syy1(SxSxyy).˙𝑦superscriptsubscript𝑆𝑦𝑦1subscript𝑆𝑥subscript𝑆𝑥𝑦𝑦\displaystyle\dot{y}=S_{yy}^{-1}(S_{x}-S_{xy}y).over˙ start_ARG italic_y end_ARG = italic_S start_POSTSUBSCRIPT italic_y italic_y end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( italic_S start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT - italic_S start_POSTSUBSCRIPT italic_x italic_y end_POSTSUBSCRIPT italic_y ) .

The activations in the final layer of MASS will hence be concentrated on the terms Syy1Sxsuperscriptsubscript𝑆𝑦𝑦1subscript𝑆𝑥S_{yy}^{-1}S_{x}italic_S start_POSTSUBSCRIPT italic_y italic_y end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_S start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT and Syy1Sxyysuperscriptsubscript𝑆𝑦𝑦1subscript𝑆𝑥𝑦𝑦S_{yy}^{-1}S_{xy}yitalic_S start_POSTSUBSCRIPT italic_y italic_y end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_S start_POSTSUBSCRIPT italic_x italic_y end_POSTSUBSCRIPT italic_y. However, the multi-expressivity of our network allows for many terms to be linearly related to these two terms.

Table 2: Constraint optimization on the objective of Equation 7. The goal is to reduce the activations for predicting y˙˙𝑦\dot{y}over˙ start_ARG italic_y end_ARG to either one or two terms. High R2superscript𝑅2R^{2}italic_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT values for the Lagrangian indicate that the learned network recovers the same functional dependence as the analytical Lagrangian, just embedded in higher dimensions.
System 𝑹𝓛𝟐subscriptsuperscript𝑹2𝓛R^{2}_{\mathcal{L}}bold_italic_R start_POSTSUPERSCRIPT bold_2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_caligraphic_L end_POSTSUBSCRIPT 𝑹𝓗𝟐subscriptsuperscript𝑹2𝓗R^{2}_{\mathcal{H}}bold_italic_R start_POSTSUPERSCRIPT bold_2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_caligraphic_H end_POSTSUBSCRIPT
(1D) Relativistic 0.9999 0.9995
(1D) α𝛼\alphaitalic_α (Synthetic) 0.9835 0.8205
(1D) β𝛽\betaitalic_β (Synthetic) 0.9306 0.8734
(2D) Double pendulum 0.9712 0.7317

We solve the constrained optimization problem. Given data coordinates xi,yisubscript𝑥𝑖subscript𝑦𝑖x_{i},y_{i}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, and weights θ𝜃\thetaitalic_θ of the MASS scientist \mathcal{M}caligraphic_M, we can compute the scalar function S(xi,yi)𝑆subscript𝑥𝑖subscript𝑦𝑖S(x_{i},y_{i})italic_S ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) and from this obtain two terms representative of the Lagrangian theories. We call these ui=Syy1Sxsubscript𝑢𝑖superscriptsubscript𝑆𝑦𝑦1subscript𝑆𝑥u_{i}=S_{yy}^{-1}S_{x}italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_S start_POSTSUBSCRIPT italic_y italic_y end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_S start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT, and vi=Syy1Sxyysubscript𝑣𝑖superscriptsubscript𝑆𝑦𝑦1subscript𝑆𝑥𝑦𝑦v_{i}=S_{yy}^{-1}S_{xy}yitalic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_S start_POSTSUBSCRIPT italic_y italic_y end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_S start_POSTSUBSCRIPT italic_x italic_y end_POSTSUBSCRIPT italic_y. uisubscript𝑢𝑖u_{i}italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and visubscript𝑣𝑖v_{i}italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT can be easily computed with JVP. We can also obtain the activations aisubscripta𝑖\textbf{a}_{i}a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT of the final layers with a forward pass of (xi,yi)subscript𝑥𝑖subscript𝑦𝑖(x_{i},y_{i})( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) through θsubscript𝜃\mathcal{M}_{\theta}caligraphic_M start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT. Then, we solve the constrained optimization problem

min\displaystyle\min\;roman_min 𝔼[(u^iui)2+(v^ivi)2]𝔼delimited-[]superscriptsubscript^𝑢𝑖subscript𝑢𝑖2superscriptsubscript^𝑣𝑖subscript𝑣𝑖2\displaystyle\mathbb{E}[(\hat{u}_{i}-u_{i})^{2}+(\hat{v}_{i}-v_{i})^{2}]blackboard_E [ ( over^ start_ARG italic_u end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ( over^ start_ARG italic_v end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] (7)
s.t.formulae-sequence𝑠𝑡\displaystyle s.t.\;italic_s . italic_t . u^i=j=1Tcjaijsubscript^𝑢𝑖superscriptsubscript𝑗1𝑇subscript𝑐𝑗subscript𝑎𝑖𝑗\displaystyle\hat{u}_{i}=\sum_{j=1}^{T}c_{j}a_{ij}over^ start_ARG italic_u end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT (8)
v^i=j=1Tdjaijsubscript^𝑣𝑖superscriptsubscript𝑗1𝑇subscript𝑑𝑗subscript𝑎𝑖𝑗\displaystyle\hat{v}_{i}=\sum_{j=1}^{T}d_{j}a_{ij}over^ start_ARG italic_v end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT (9)
cj+dj=1j[1,T]subscript𝑐𝑗subscript𝑑𝑗1for-all𝑗1𝑇\displaystyle c_{j}+d_{j}=1\;\forall j\in[1,T]italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT + italic_d start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = 1 ∀ italic_j ∈ [ 1 , italic_T ] (10)

where u^i,v^isubscript^𝑢𝑖subscript^𝑣𝑖\hat{u}_{i},\hat{v}_{i}over^ start_ARG italic_u end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , over^ start_ARG italic_v end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is a transformation of the 172 term MASS activation space to the 2 term Lagrangian activation space, and the constraint 10 restricts the transformation to one which exactly uses all the weights in the activations aisubscripta𝑖\textbf{a}_{i}a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, i.e. no cheating by placing avoiding some activations completely and overusing others. Just a technical note: in the first four systems of Table 1, we always get a trivial solution of vi^=0,cj=1formulae-sequence^subscript𝑣𝑖0subscript𝑐𝑗1\hat{v_{i}}=0,c_{j}=1over^ start_ARG italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG = 0 , italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = 1 since vi=0subscript𝑣𝑖0v_{i}=0italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = 0 (due to the cross term Sxy=0subscript𝑆𝑥𝑦0S_{xy}=0italic_S start_POSTSUBSCRIPT italic_x italic_y end_POSTSUBSCRIPT = 0. Of interest is what happens in the synthetic systems, where the cross terms are not zero and MASS is forced to learn something non-trivial in both u𝑢uitalic_u and v𝑣vitalic_v.

We summarize these results in Table 2, which consists of the single term Sxsubscript𝑆𝑥-S_{x}- italic_S start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT. We average the R2superscript𝑅2R^{2}italic_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT scores of this constrained optimization fitting across the correct scientists to give Table 2. Coherent with previous observations, MASS can almost be directly transformed into a Lagrangian theory with R2superscript𝑅2R^{2}italic_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT values above 0.9. If we try to pick any two random terms from the T=172𝑇172T=172italic_T = 172 available terms, or even the two terms with the highest activation magnitudes, the constrained optimization will typically fail, observed as a negative R2superscript𝑅2R^{2}italic_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT score on the holdout test set.

Such strong correlations to the Lagrangian raises a broader question: can we find a third description of classical mechanics? At least with MASS working in the rich theory space of T=172𝑇172T=172italic_T = 172 terms, the answer appears to be no! The Lagrangian is all you need.

V.5 Extensions to high dimensions

Refer to caption
Figure 12: Trajectories of the double pendulum solved by MASS to an MSE of 5×1035superscript1035\times 10^{-3}5 × 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT. ODE is solved with RK4 integration with a time-step of dt=0.05𝑑𝑡0.05dt=0.05italic_d italic_t = 0.05.

While in the previous sections, we have mainly worked with one-dimensional problems, i.e. x𝑥x\in\mathbb{R}italic_x ∈ blackboard_R, most physical problems in nature are higher dimensional. In this section, we study one classic example of that: the chaotic double pendulum problem. The two degrees of freedom are the angles of the two pendulums. Our results show that MASS can be effectively extended to higher dimensions.

Following an identical training scheme as in Section IV, we reproduce the analytically correct trajectory of a double pendulum in Figure 12, calling the MASS solver for 𝐲˙˙𝐲\dot{\mathbf{y}}over˙ start_ARG bold_y end_ARG at each step and using RK4 integration [41, 42].

Not only can we achieve rather accurate prediction of the angles, the energy discrepancy is only at 0.4% of the total energy per 100 steps. This is comparable to the results of the Lagrangian neural network [24]. Even without introducing the Lagrangian and Euler-Lagrange equations directly into the architecture to enforce energy conservation, MASS learns to reproduce it.

We also observe, consistent with our expectations, that the learnt theories resemble a Lagrangian, with the results further included in Table 2.

We present more results of solution trajectories to the spherical pendulum and the multi-body gravitational problem in Appendix D. We are not claiming that MASS as a state-of-the-art method for solving physical systems, especially since it is out of the scope of this project to tune MASS for efficiency and accuracy on higher dimensional problems. In fact, the computation of the Hessian matrix and its inverse incurs an O(d2)𝑂superscript𝑑2O(d^{2})italic_O ( italic_d start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) dependence on the dimensionality d𝑑ditalic_d of the problem, so directly applying the current solver to problems of extremely high dimensionality would be very expensive. Nonetheless, the evident applicability of MASS to solving the double pendulum problem at a sufficient level demonstrates its potential for future exploration, and drives home the spirit of this paper: to build AI scientists that are both simple and interpretable, and also generally applicable to complex physical systems.

VI Discussions

So do two AI scientists agree? The short answer is yes. But it comes with some caveats.

Looking back, we question the relationship between our results in Figure 9 and those in Figure 11. In the former, we observe a strong correlation between the theories described by each MASS scientist. Compared to Figure 11, we see an indication that scientists can learn different theories. In combination, this is suggestive that a number of theories lie on the boundary between a “Hamiltonian”-like contour and a “Lagrangian”-like contour. We did not perform rigorous symbolic regression on the results of the learned scalar function. Given the vast number of terms that can be learned by MASS, we believe the results of Figure 11 and Table 2 tell a much richer story about the underlying theories. We performed thorough numerical analysis of the trained systems through counting the number of Hamiltonian and Lagrangian theories, and measuring correlations of the activations, to conclude the generalizability of the Lagrangian theory.

To answer the original research question, we chose to use different seeds as proxies for different AI scientists. While only affecting the initializations of the MASS network, we already see strikingly different training behavior (Figure 6). Our initial experiments on varying model width and depth suggest that the extent of agreement increases with model capacity. Preliminary testing on varying architectures, using convolutions and attention instead of pure feed-forwards, shows to be much less stable in training, primarily due to the low-dimensionality of our data.

Looking ahead, while most of our results, due to the extensive parallelization over many seeds and systems, were conducted on one-dimensional physical systems, preliminary results (Figure 12) show that these can be readily extended to higher dimensions.

MASS offers a tradeoff between inductive bias (through including physical priors such as the Euler-Lagrange equations) with training efficiency. When calculating many of the T=172𝑇172T=172italic_T = 172 terms, particularly the Hessian matrix inverses, training slows and becomes unstable, which was solved only with strong regularization and initialization techniques. Nonetheless, these additional terms should not be seen as irrelevant. One should not expect the Euler-Lagrange equations to be the end-all for physics-based machine learning, and certainly not for physics itself.

VII Future Work

There are several low-hanging fruits to extend our work in this paper. We list several below:

  1. 1.

    Coordinate choice. Our experiments done in generalized coordinates forbid the Hamiltonian expression to achieve low loss, while the Lagrangian remains a perfect description. Hence, results such as Figure 11 are not extremely surprising. But what happens if we allow MASS to work in arbitrary coordinates? This can be done by allowing MASS to learn a coordinate transform (through a simple MLP) then take derivative in the transformed coordinates [43]. On these coordinates, will MASS still prefer the Lagrangian expression?

  2. 2.

    Loss function. We can modify the loss function to encourage the learning our un-learning of specific theories. In particular, the measure of Hamiltonicity [44] quantifies how ”Hamiltonian”-like the theory is. How does including this as a loss term bias MASS into learning different theories?

  3. 3.

    Model architecture. Our choice of variation of AI scientists was across the random initializations. What happens if we modify the architecture completely? Will AI scientists still agree?

  4. 4.

    High dimensions. We show results up to six dimensions in Figure 19. But many physical problems are even higher dimensional. How do we efficiently extend our model to solve those problems?

VIII Conclusion

In this paper, we have developed a novel architecture and training scheme, MASS and rigorously investigated the evolution of theories studied by MASS across multiple physical systems. Through our experiments, we show that AI scientists, when modeled as a high capacity neural network, often learns multiple equivalent expressions of the same theory. As we expose our AI scientists to new, and more complex systems, some of these theories prove inconsistent with these previously unseen systems, while others successfully generalize to more difficult problems. Even within these surviving theories, the underlying theories change over increasing systems, starting from resembling a Hamiltonian to resembling a Lagrangian.

We hope that MASS will not just be an interesting story of Hamiltonian v.s. Lagrangian, but also lays the groundwork to build models that are more interpretable and capable. Then, we will revisit the question: do two AI scientists agree?

Acknowledgement Z.L. and M.T. are supported by IAIFI through NSF grant PHY-2019786. Z.L. is supported by the Google PhD Fellowship.

References

  • Kuhn [1970] T. Kuhn, The Structure of Scientific Revolutions, 2nd ed. (University of Chicago Press, 1970).
  • Baldi et al. [2014] P. Baldi, P. Sadowski, and D. Whiteson, Searching for exotic particles in high-energy physics with deep learning, Nature Communications 5, 4308 (2014).
  • Ball and Brunner [2010] N. M. Ball and R. J. Brunner, Data mining and machine learning in astronomy, International Journal of Modern Physics D 19, 1049 (2010).
  • Ramprasad et al. [2017] R. Ramprasad, R. Batra, G. Pilania, A. Mannodi-Kanakkithodi, and C. Kim, Machine learning in materials informatics: recent applications and prospects, npj Computational Materials 3, 1 (2017).
  • Pfau et al. [2020] D. Pfau, J. S. Spencer, A. G. D. G. Matthews, and W. M. C. Foulkes, Ab initio solution of the many-electron schrödinger equation with deep neural networks, Phys. Rev. Res. 2, 033429 (2020).
  • Schmidt and Lipson [2009] M. Schmidt and H. Lipson, Distilling free-form natural laws from experimental data, Science 324, 81 (2009).
  • Cranmer and et al. [2020] M. Cranmer and et al., Discovering symbolic models from deep learning with inductive biases, in Advances in Neural Information Processing Systems (NeurIPS), Vol. 33 (2020) pp. 17429–17442.
  • Jumper and et al. [2021] J. Jumper and et al., Highly accurate protein structure prediction with alphafold, Nature 596, 583 (2021).
  • Vaswani et al. [2023] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, Attention is all you need (2023), arXiv:1706.03762 [cs.CL] .
  • Devlin et al. [2019] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, Bert: Pre-training of deep bidirectional transformers for language understanding (2019), arXiv:1810.04805 [cs.CL] .
  • Brown et al. [2020] T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, S. Agarwal, A. Herbert-Voss, G. Krueger, T. Henighan, R. Child, A. Ramesh, D. M. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark, C. Berner, S. McCandlish, A. Radford, I. Sutskever, and D. Amodei, Language models are few-shot learners (2020), arXiv:2005.14165 [cs.CL] .
  • Touvron et al. [2023] H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozière, N. Goyal, E. Hambro, F. Azhar, A. Rodriguez, A. Joulin, E. Grave, and G. Lample, Llama: Open and efficient foundation language models (2023), arXiv:2302.13971 [cs.CL] .
  • Jiang et al. [2024] A. Q. Jiang, A. Sablayrolles, A. Roux, A. Mensch, B. Savary, C. Bamford, D. S. Chaplot, D. de las Casas, E. B. Hanna, F. Bressand, G. Lengyel, G. Bour, G. Lample, L. R. Lavaud, L. Saulnier, M.-A. Lachaux, P. Stock, S. Subramanian, S. Yang, S. Antoniak, T. L. Scao, T. Gervet, T. Lavril, T. Wang, T. Lacroix, and W. E. Sayed, Mixtral of experts (2024), arXiv:2401.04088 [cs.LG] .
  • Gemma-Team [2024] Gemma-Team, Gemma: Open models based on gemini research and technology (2024), arXiv:2403.08295 [cs.CL] .
  • DeepSeek-AI [2025] DeepSeek-AI, Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning (2025), arXiv:2501.12948 [cs.CL] .
  • Lu et al. [2024] C. Lu, C. Lu, R. T. Lange, J. Foerster, J. Clune, and D. Ha, The ai scientist: Towards fully automated open-ended scientific discovery (2024), arXiv:2408.06292 [cs.AI] .
  • LeCun et al. [2015] Y. LeCun, Y. Bengio, and G. Hinton, Deep learning, Nature 521, 436 (2015).
  • Carleo and et al. [2019] G. Carleo and et al., Machine learning and the physical sciences, Reviews of Modern Physics 91, 045002 (2019).
  • Rudin [2019] C. Rudin, Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead, Nature Machine Intelligence 1, 206 (2019).
  • Koza [1994] J. R. Koza, Genetic programming as a means for programming computers by natural selection, Statistics and Computing 4, 87 (1994).
  • Brunton et al. [2016] S. L. Brunton, J. L. Proctor, and J. N. Kutz, Discovering governing equations from data by sparse identification of nonlinear dynamical systems, Proceedings of the National Academy of Sciences 113, 3932–3937 (2016).
  • Udrescu and Tegmark [2020] S.-M. Udrescu and M. Tegmark, Ai feynman: a physics-inspired method for symbolic regression (2020), arXiv:1905.11481 [physics.comp-ph] .
  • Greydanus et al. [2019] S. Greydanus, M. Dzamba, and J. Yosinski, Hamiltonian neural networks (2019), arXiv:1906.01563 [cs.NE] .
  • Cranmer et al. [2020] M. Cranmer, S. Greydanus, S. Hoyer, P. Battaglia, D. Spergel, and S. Ho, Lagrangian neural networks (2020), arXiv:2003.04630 [cs.LG] .
  • Xiao et al. [2024] S. Xiao, J. Zhang, and Y. Tang, Generalized lagrangian neural networks (2024), arXiv:2401.03728 [math.DS] .
  • Finzi et al. [2020] M. Finzi, K. A. Wang, and A. G. Wilson, Simplifying hamiltonian and lagrangian neural networks via explicit constraints (2020), arXiv:2010.13581 [cs.LG] .
  • Bhattoo et al. [2022] R. Bhattoo, S. Ranu, and N. M. A. Krishnan, Learning articulated rigid body dynamics with lagrangian graph neural network (2022), arXiv:2209.11588 [cs.LG] .
  • Bhattoo et al. [2023] R. Bhattoo, S. Ranu, and N. M. A. Krishnan, Learning the dynamics of particle-based systems with lagrangian graph neural networks, Machine Learning: Science and Technology 4, 015003 (2023).
  • Allen-Blanchette et al. [2020] C. Allen-Blanchette, S. Veer, A. Majumdar, and N. E. Leonard, Lagnetvip: A lagrangian neural network for video prediction (2020), arXiv:2010.12932 [cs.LG] .
  • Toth et al. [2020] P. Toth, D. J. Rezende, A. Jaegle, S. Racanière, A. Botev, and I. Higgins, Hamiltonian generative networks (2020), arXiv:1909.13789 [cs.LG] .
  • Loshchilov and Hutter [2019] I. Loshchilov and F. Hutter, Decoupled weight decay regularization (2019), arXiv:1711.05101 [cs.LG] .
  • Loshchilov and Hutter [2017] I. Loshchilov and F. Hutter, Sgdr: Stochastic gradient descent with warm restarts (2017), arXiv:1608.03983 [cs.LG] .
  • Zhang et al. [2017] C. Zhang, S. Bengio, Y. Singer, and Y. LeCun, Rethinking generalization in deep learning, in Proceedings of the 34th International Conference on Machine Learning, Vol. 70 (2017).
  • Allen-Zhu et al. [2019] Z. Allen-Zhu, Y. Li, and Y. Liang, Learning and generalization in overparameterized neural networks, going beyond two layers, Advances in Neural Information Processing Systems  (2019).
  • Sneath and Sokal [1973] P. H. A. Sneath and R. R. Sokal, Numerical Taxonomy: The Principles and Practice of Numerical Classification (W. H. Freeman, San Francisco, 1973).
  • Feynman [1967] R. P. Feynman, The Character of Physical Law (MIT Press, Cambridge, MA, 1967).
  • Sohl-Dickstein et al. [2015] J. Sohl-Dickstein, E. Weiss, N. Maheswaranathan, and S. Ganguli, Deep unsupervised learning using nonequilibrium thermodynamics, in International Conference on Machine Learning (ICML) (2015).
  • Du and Mordatch [2019] Y. Du and I. Mordatch, Implicit generation and generalization in energy-based models, in Advances in Neural Information Processing Systems (NeurIPS) (2019).
  • Song and Ermon [2019] Y. Song and S. Ermon, Generative modeling by estimating gradients of the data distribution, in Advances in Neural Information Processing Systems (NeurIPS) (2019).
  • Song et al. [2021] Y. Song, J. Sohl-Dickstein, D. P. Kingma, A. Kumar, S. Ermon, and B. Poole, Score-based generative modeling through stochastic differential equations, in International Conference on Learning Representations (ICLR) (2021).
  • Runge [1895] C. Runge, Über die numerische auflösung von differentialgleichungen, Mathematische Annalen 46, 167 (1895).
  • Kutta [1901] W. Kutta, Beitrag zur näherungsweisen integration totaler differentialgleichungen, Zeitschrift für Mathematik und Physik 46, 435 (1901).
  • Chen et al. [2021] Y. Chen, T. Matsubara, and T. Yaguchi, Neural symplectic form: Learning hamiltonian equations on general coordinate systems, in Advances in Neural Information Processing Systems, Vol. 34, edited by M. Ranzato, A. Beygelzimer, Y. Dauphin, P. Liang, and J. W. Vaughan (Curran Associates, Inc., 2021) pp. 16659–16670.
  • Liu and Tegmark [2022] Z. Liu and M. Tegmark, Machine learning hidden symmetries, Physical Review Letters 12810.1103/physrevlett.128.180201 (2022).
  • He et al. [2015] K. He, X. Zhang, S. Ren, and J. Sun, Delving deep into rectifiers: Surpassing human-level performance on imagenet classification (2015), arXiv:1502.01852 [cs.CV] .
  • Glorot and Bengio [2010] X. Glorot and Y. Bengio, Understanding the difficulty of training deep feedforward neural networks, in Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, Proceedings of Machine Learning Research, Vol. 9, edited by Y. W. Teh and M. Titterington (PMLR, Chia Laguna Resort, Sardinia, Italy, 2010) pp. 249–256.

Appendix A Training methods

Table 3: Hyperparameter settings for training MASS.
Parameter Value
MLP hidden layers 4
MLP width 20
Batch size 512
Steps (per phase) 10000
Linear warmup steps 100
Learning rate 5×1045superscript1045\times 10^{-4}5 × 10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT
Weight decay 0.01
β1subscript𝛽1\beta_{1}italic_β start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT 0.7
β2subscript𝛽2\beta_{2}italic_β start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT 0.8
EMA 0.99
λbsubscript𝜆𝑏\lambda_{b}italic_λ start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT 0.5
λ1subscript𝜆1\lambda_{1}italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT 0.1
λ2subscript𝜆2\lambda_{2}italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT 0.01

As stated in Section IV, training MASS is extremely unstable. The “ground truth” scalar potentials in many of the systems(Table 1) we are studying induce Hessian matrices that are identically zero, making their inverses hard to compute. To remedy this, we introduce a regularization parameter bjsubscript𝑏𝑗b_{j}italic_b start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT to each Sjsubscript𝑆𝑗S_{j}italic_S start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT, computing pinv(𝐀+b)pinv𝐀b\rm{pinv}(\mathbf{A}+b)roman_pinv ( bold_A + roman_b ) instead of inv(A)invA\rm{inv}(A)roman_inv ( roman_A ), and minimize the loss with the inclusion of a λbb1subscript𝜆𝑏subscriptnorm𝑏1\lambda_{b}||b||_{1}italic_λ start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT | | italic_b | | start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT term to each system. We consistently use λb=0.5subscript𝜆𝑏0.5\lambda_{b}=0.5italic_λ start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT = 0.5 in our experiments. In addition, we augment the initializations. Under a typical initialization scheme, like Kaiming initialization [45] and Xavier initialization [46], the second derivatives are very small, leading to the same problem of exploding inverses. Instead of symbolically optimizing for the variances at each layer [24], we simply augment the input of the MLP to take not just (𝐱,𝐲)𝐱𝐲(\mathbf{x},\mathbf{y})( bold_x , bold_y ) but the entirety of (𝐱,𝐲,𝐱˙,𝐲˙,𝐱𝐲)𝐱𝐲˙𝐱˙𝐲𝐱𝐲(\mathbf{x},\mathbf{y},\mathbf{\dot{x}},\mathbf{\dot{y}},\mathbf{xy})( bold_x , bold_y , over˙ start_ARG bold_x end_ARG , over˙ start_ARG bold_y end_ARG , bold_xy ). Together, these allow for stable training of MASS networks up to high learning rates of even 1×1021superscript1021\times 10^{-2}1 × 10 start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT.

To encourage sparsification of terms, we introduce a regularization term on the final layer weights and activations. Note that regularizing the weights alone is not enough, since MASS can simply cheat by increasing the magnitude of S𝑆Sitalic_S and increasing the activations. Let the final layer weights be 𝐰𝐰\mathbf{w}bold_w and the activations be 𝐚jsubscript𝐚𝑗\mathbf{a}_{j}bold_a start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT for system j𝑗jitalic_j, then system-wise we include the regularization term λ1T𝐰1+λ2T𝐚j1subscript𝜆1𝑇subscriptnorm𝐰1subscript𝜆2𝑇subscriptnormsubscript𝐚𝑗1\frac{\lambda_{1}}{T}||\mathbf{w}||_{1}+\frac{\lambda_{2}}{T}||\mathbf{a}_{j}|% |_{1}divide start_ARG italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG start_ARG italic_T end_ARG | | bold_w | | start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + divide start_ARG italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG start_ARG italic_T end_ARG | | bold_a start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT | | start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT. We use λ1=0.1,λ2=0.01formulae-sequencesubscript𝜆10.1subscript𝜆20.01\lambda_{1}=0.1,\lambda_{2}=0.01italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 0.1 , italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = 0.01 in our experiments.

We report the hyperparameter settings in the Table 3.

For the higher dimensional problems, we used larger MLP width ranging from 40 to 100 and up to 6 hidden layers. Correspondingly, the learning rate varied between 5×1045superscript1045\times 10^{-4}5 × 10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT to 5×1035superscript1035\times 10^{-3}5 × 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT.

Appendix B Activations on different 1D systems

In the following set of visualizations (Figures 13 to 15(b)), we support our claims that while the exact terms learnt by MASS differ, the underlying theories, described by the histograms, are mostly identical. We find that this first PCA component corresponds mainly to the ground truth acceleration for 1D simple systems (1 to 4), but not necessarily for more complex systems. It is of future interest to investigate that this direction means and what the theories that have a low correlation to this direction actually represent.

Refer to caption
Figure 13: Mean absolute activations of multiple scientists when trained on the same system: simple harmonic oscillator. Exact activation magnitudes differ, and many terms are activated in general.
Refer to caption
Figure 14: Mean absolute activations of multiple scientists when trained on the same system: spring-mass.
Refer to caption
(a) Relativistic
Refer to caption
(b) Simple pendulum
Figure 15: Mean absolute activations for multiple MASS scientists when trained on the same single system, either (a)the relativistic harmonic oscillator or (b) the simple pendulum

Appendix C Visualizing learned scalar functions

Refer to caption
(a) 1.1
Refer to caption
(b) 1.2
Refer to caption
(c) 1.3
Refer to caption
(d) 2.1
Refer to caption
(e) 2.2
Refer to caption
(f) 2.3
Refer to caption
(g) 3.1
Refer to caption
(h) 3.2
Refer to caption
(i) 3.3
Figure 16: Learned scalar functions S𝑆Sitalic_S for MASS scientist i.jformulae-sequence𝑖𝑗i.jitalic_i . italic_j where i{1,2,3}𝑖123i\in\{1,2,3\}italic_i ∈ { 1 , 2 , 3 } corresponding to SHO (top), simple pendulum(middle) and gravitational (bottom) and j𝑗jitalic_j represents seed index.

Below we provide some additional visualizations of the learned scalar functions S𝑆Sitalic_S. Note the various shapes: elliptical, parabolic, hyperbolic, and degenerate (where the level curves are nearly straight lines). In genreral, the shape closely resembles a conic section, in large part due to the nature of these problems: kinetic and potential energy terms are typically second order in the terms of the generalized coordinates. Even for the gravitational problem where we have a 1|x|1𝑥-\frac{1}{|x|}- divide start_ARG 1 end_ARG start_ARG | italic_x | end_ARG potential, the learnt scalar functions still resembles a conic section!

In general, we observe that while the learned scalar function look different. The differences lie in simple parity swaps (positive to negative, elliptical to hyperbolic) and learned theories are in fact similar, according to our discussions in the main paper. The number of curves near straight lines indicate that many theories lie on the border of a “Hamiltonian” or “Lagrangian” contour.

Appendix D Higher dimensional problems

Refer to caption
Figure 17: Comparison of MASS and analytical solution to the spherical pendulum, solved with initial conditions (θ,ϕ)=(1,0.1),(θ˙,ϕ˙)=(0,1)formulae-sequence𝜃italic-ϕ10.1˙𝜃˙italic-ϕ01(\theta,\phi)=(1,0.1),(\dot{\theta},\dot{\phi})=(0,1)( italic_θ , italic_ϕ ) = ( 1 , 0.1 ) , ( over˙ start_ARG italic_θ end_ARG , over˙ start_ARG italic_ϕ end_ARG ) = ( 0 , 1 ).
Refer to caption
(a) Trajectories
Refer to caption
(b) Coordinates
Figure 18: A four-dimensional problem. Two body solution for MASS compared to the analytic solution. Problem is posed in Cartesian coordinates, with the initial conditions (x1,y1,x2,y2)=(1,0,1,0),(x˙1,y˙1,x˙2,y˙2)=(0,0.5,0,0.25)formulae-sequencesubscript𝑥1subscript𝑦1subscript𝑥2subscript𝑦21010subscript˙𝑥1subscript˙𝑦1subscript˙𝑥2subscript˙𝑦200.500.25(x_{1},y_{1},x_{2},y_{2})=(-1,0,1,0),(\dot{x}_{1},\dot{y}_{1},\dot{x}_{2},\dot% {y}_{2})=(0,0.5,0,-0.25)( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) = ( - 1 , 0 , 1 , 0 ) , ( over˙ start_ARG italic_x end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , over˙ start_ARG italic_y end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , over˙ start_ARG italic_x end_ARG start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , over˙ start_ARG italic_y end_ARG start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) = ( 0 , 0.5 , 0 , - 0.25 ).
Refer to caption
(a) Trajectories
Refer to caption
(b) Coordinates
Figure 19: A six-dimensional problem. Three body solution for MASS compared to the analytic solution. Problem is posed in Cartesian coordinates, with the initial conditions (x1,y1,x2,y2,x3,y3)=(1,0.25,0,0,1,0.25),(x˙1,y˙1,x˙2,y˙2,x˙3,y˙3)=(0.45,0.43,1.,0.9,0.44,0.43)(x_{1},y_{1},x_{2},y_{2},x_{3},y_{3})=(-1,0.25,0,0,1,-0.25),(\dot{x}_{1},\dot{% y}_{1},\dot{x}_{2},\dot{y}_{2},\dot{x}_{3},\dot{y}_{3})=(0.45,0.43,-1.,-0.9,0.% 44,0.43)( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ) = ( - 1 , 0.25 , 0 , 0 , 1 , - 0.25 ) , ( over˙ start_ARG italic_x end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , over˙ start_ARG italic_y end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , over˙ start_ARG italic_x end_ARG start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , over˙ start_ARG italic_y end_ARG start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , over˙ start_ARG italic_x end_ARG start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT , over˙ start_ARG italic_y end_ARG start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ) = ( 0.45 , 0.43 , - 1 . , - 0.9 , 0.44 , 0.43 ). This is a slight perturbation from the stable figure-8 solution.

We can apply MASS to also solve higher dimensional problems, beyond that of the double pendulum in Figure 12. In particular, we demonstrate the ability of MASS to solve for periodic orbits reliably.

Another natural extension to the simple pendulum to two dimensions is the spherical pendulum, parameterized by the two degrees of freedom θ𝜃\thetaitalic_θ and ϕitalic-ϕ\phiitalic_ϕ. We present a typical solution displaying the oscillation about an equilibrium conical solution given by ϕ˙=gLcosθ˙italic-ϕ𝑔𝐿𝜃\dot{\phi}=\sqrt{\frac{g}{L\cos\theta}}over˙ start_ARG italic_ϕ end_ARG = square-root start_ARG divide start_ARG italic_g end_ARG start_ARG italic_L roman_cos italic_θ end_ARG end_ARG. θ˙˙𝜃\dot{\theta}over˙ start_ARG italic_θ end_ARG oscillates in a near harmonic motion while ϕ˙˙italic-ϕ\dot{\phi}over˙ start_ARG italic_ϕ end_ARG oscillates about a constant drift which is the initial angular velocity.

The exact equations of motion are given by

訨𝜃\displaystyle\ddot{\theta}over¨ start_ARG italic_θ end_ARG =sinθcosθϕ˙2gLsinθabsent𝜃𝜃superscript˙italic-ϕ2𝑔𝐿𝜃\displaystyle=\sin\theta\cos\theta\,\dot{\phi}^{2}-\frac{g}{L}\sin\theta= roman_sin italic_θ roman_cos italic_θ over˙ start_ARG italic_ϕ end_ARG start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - divide start_ARG italic_g end_ARG start_ARG italic_L end_ARG roman_sin italic_θ
ϕ¨¨italic-ϕ\displaystyle\ddot{\phi}over¨ start_ARG italic_ϕ end_ARG =2θ˙ϕ˙cosθsinθabsent2˙𝜃˙italic-ϕ𝜃𝜃\displaystyle=-\frac{2\dot{\theta}\dot{\phi}\cos\theta}{\sin\theta}= - divide start_ARG 2 over˙ start_ARG italic_θ end_ARG over˙ start_ARG italic_ϕ end_ARG roman_cos italic_θ end_ARG start_ARG roman_sin italic_θ end_ARG

with the energy of the system given by

E=[12ml2θ˙2+12ml2sin2θϕ˙2]T+[mglcosθ]V𝐸subscriptdelimited-[]12𝑚superscript𝑙2superscript˙𝜃212𝑚superscript𝑙2superscript2𝜃superscript˙italic-ϕ2𝑇subscriptdelimited-[]𝑚𝑔𝑙𝜃𝑉\displaystyle E=\underbrace{\left[\frac{1}{2}ml^{2}\dot{\theta}^{2}+\frac{1}{2% }ml^{2}\sin^{2}\theta\dot{\phi}^{2}\right]}_{T}+\underbrace{\left[-mgl\cos% \theta\right]}_{V}italic_E = under⏟ start_ARG [ divide start_ARG 1 end_ARG start_ARG 2 end_ARG italic_m italic_l start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT over˙ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + divide start_ARG 1 end_ARG start_ARG 2 end_ARG italic_m italic_l start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT roman_sin start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_θ over˙ start_ARG italic_ϕ end_ARG start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] end_ARG start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT + under⏟ start_ARG [ - italic_m italic_g italic_l roman_cos italic_θ ] end_ARG start_POSTSUBSCRIPT italic_V end_POSTSUBSCRIPT

and we set all physical constants to 1111 for the purpose of this experiment.

Another problem we can tackle with MASS is the n𝑛nitalic_n-body problem. The n𝑛nitalic_n-body problem involves interacting masses under the influence of gravitational forces. The equations of motion for the -body problem are given by:

mi𝐫¨isubscript𝑚𝑖subscript¨𝐫𝑖\displaystyle m_{i}\ddot{\mathbf{r}}_{i}italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT over¨ start_ARG bold_r end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT =jiGmimj|𝐫j𝐫i|3(𝐫j𝐫i),i=1,2,,n.formulae-sequenceabsentsubscript𝑗𝑖𝐺subscript𝑚𝑖subscript𝑚𝑗superscriptsubscript𝐫𝑗subscript𝐫𝑖3subscript𝐫𝑗subscript𝐫𝑖𝑖12𝑛\displaystyle=\sum_{j\neq i}G\frac{m_{i}m_{j}}{|\mathbf{r}_{j}-\mathbf{r}_{i}|% ^{3}}(\mathbf{r}_{j}-\mathbf{r}_{i}),\quad i=1,2,...,n.= ∑ start_POSTSUBSCRIPT italic_j ≠ italic_i end_POSTSUBSCRIPT italic_G divide start_ARG italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_m start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_ARG start_ARG | bold_r start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT - bold_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT end_ARG ( bold_r start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT - bold_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) , italic_i = 1 , 2 , … , italic_n .

where misubscript𝑚𝑖m_{i}italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is the mass of the i𝑖iitalic_i-th body, 𝐫isubscript𝐫𝑖\mathbf{r}_{i}bold_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is its position vector, and G𝐺Gitalic_G is the gravitational constant. As with all previous problems, we set all physical constants to 1.

For the two-body problem in Cartesian coordinates, represented by 𝐱=(x1,y1,x2,y2)𝐱subscript𝑥1subscript𝑦1subscript𝑥2subscript𝑦2\mathbf{x}=(x_{1},y_{1},x_{2},y_{2})bold_x = ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ), we report the comparisons between the analytic and MASS results in Figure 18, from which we can observe an accurate learning of the behavior including the drift of the two bodies as well as their orbits about the common center of mass. Note that this problem can effectively be reduced to two dimensions with a coordinate transform using the reduced mass, but nonetheless MASS is able to learn the higher dimensional general representation in Cartesian coordinates.

The two-body problem is not so difficult. But what about the three-body problem? This is known to be chaotic. Turns out, we can solve this too! We can use MASS on this problem directly in 6 dimensions, and the result is shown in Figure 19. The initial conditions are chosen to be a deviation from the known stable figure-8 solution, and shows that MASS can capture the interaction between all three bodies accurately.

For all the systems presented, we use Runge-Kutta fourth-order integration solver for the ODE. Together with the accuracy of the MASS solver, the integration solver conserves the energy of the systems significantly. Again, we are not claiming that MASS is the state-of-the-art solver for physical systems. In fact, many of these toy examples are not solved to the best precision, and some are only exhibited near equilibrium states of which the behavior of the system is regular. In fact, a persistent problem is the stability of training of MASS, which is accentuated in irregular regimes. Nonetheless, the ability of MASS to be adapted to higher dimensional problems without much change in architecture and hyperparameters is a promising sign in building general and interpretable AI physics scientists in the future.