Do Two AI Scientists Agree?
Abstract
When two AI models are trained on the same scientific task, do they learn the same theory or two different theories? Throughout the history of science, we have witnessed the rise and fall of theories driven by experimental validation or falsification: many theories may co-exist when experimental data is lacking, but the space of surviving theories becomes more constrained with more experimental data becoming available. We show the same story is true for AI scientists. With increasingly more systems provided in training data, AI scientists tend to converge in the theories they learned, although sometimes they form distinct groups corresponding to different theories. To mechanistically interpret what theories AI scientists learn and quantify their agreement, we propose MASS, Hamiltonian-Lagrangian neural networks as AI Scientists, trained on standard problems in physics, aggregating training results across many seeds simulating the different configurations of AI scientists. Our key findings include: 1) when trained on textbook problems in classical mechanics, AI scientists prefers either a complete Hamiltonian or Lagrangian description; 2) when extended to non-standard physical problems, the Lagrangian description generalizes, suggesting that Lagrangian dynamics remain as the singular accurate family of descriptions in a rich theory space. We also observe strong seed dependence of the training dynamics and final learned weights, controlling the rise and fall of relevant theories. Besides interpretability, MASS unifies and generalizes beyond the Lagrangian neural networks and the Hamiltonian neural networks, providing a new tool for learning of dynamical systems. We release our code at https://github.com/shinfxh/ai-scientists.
I Introduction

Throughout human history, our collective curiosity has driven scientific progress. From Archimedes’ principle of buoyancy, to Galileo’s systematic study of motion, to Newton’s formulation of classical mechanics, and finally to Einstein’s revolutionary theories of relativity, these luminaries meticulously analyzed observations and experiments to develop robust hypotheses that explained known phenomena and predicted new ones. Over the centuries, as technology has advanced, so too has our ability to refine experiments, test theories against increasingly precise datasets, and update our frameworks accordingly. Some hypotheses have eventually fallen out of favor, while others have evolved into more nuanced theories capable of describing phenomena at previously uncharted scales [1].
Today, in the twenty-first century, we are witnessing the emergence of a new paradigm. Machine learning (ML) and data-driven methods have already begun to supplant traditional statistical tools in fields as diverse as particle physics [2], astronomy [3], materials science [4] and quantum chemistry [5]. A natural next step is to contemplate a future in which ML methods shift from mere assistive instruments to becoming full-fledged “AI scientists,” capable of formulating hypotheses, designing experiments, and interpreting results with minimal human intervention. Pioneering endeavors have already produced end-to-end AI platforms that can discover physical laws from raw experimental data [6, 7], and molecular structures from protein sequences [8]. The recent improvements in architectures [9] with capabilities to absorb and process a large amount of data have fueled the development of large-language models [10, 11, 12, 13, 14, 15]. These LLMs have already started becoming the backbone of fully-automated AI research scientists [16].
As these AI scientists begin to operate autonomously, it is worth asking: What scientific theories will they propose? History shows that different human researchers, such as Newton and Leibniz, can arrive at complementary yet distinct formulations of the same phenomenon (e.g., calculus). Analogously, modern ML systems vary in architecture, initialization schemes, and training paradigms [17], leading to the possibility that independently trained AI scientists might converge on different theoretical formulations or complementary perspectives. Moreover, as AI scientists venture into analyzing larger and more intricate datasets—ranging from high-dimensional cosmological surveys to complex molecular dynamics simulations—their learned representations and theories may evolve in unexpected ways [18].
This paper does not seek to predict precisely how AI will transform science in the decades ahead. Instead, we offer a set of controlled experiments to investigate whether and how multiple AI scientists, trained under varying conditions, converge or diverge in their scientific theories. By exploring synthetic datasets, we aim to shed light on how the complexity of the data, the choice of model architectures, and the selection of training methods may influence not only what these AI systems learn but also how their internal representations and resulting theories develop over time [19].
In doing so, we hope to provide a window into the kind of questions that will shape future inquiries into the role of AI in science: will AI scientists unify disparate theories or fragment into multiple, equally valid viewpoints? Will their theories be understandable to humans, or will interpretability become a bigger challenge? The experimental framework and preliminary results presented herein serve as a stepping stone for these discussions, highlighting the potential—and potential pitfalls—of our emerging AI Scientists.
To list, our contributions in this paper are:
-
1.
We propose a new architecture, MASS (Multiple AI Scalar Scientists), to allow a single neural network to learn diverse theories across multiple physical systems.
-
2.
We train MASS across datasets including the simple pendulum, the Kepler problem, and synthetic potentials.
-
3.
We analyze significant activations in MASS and distill the theories learned by MASS.
Using MASS as a proxy for an AI scientist, our findings suggest that
-
1.
An AI scientist can learn many diverse explanations of the same physical phenomenon.
-
2.
Encountering more complex systems, successful AI scientists modify their existing theories to suit new observations.
-
3.
AI scientists tend to learn similar theories, evaluated in terms of similarity of the networks’ internal activations. These theories also closely resemble either the Hamiltonian or Lagrangian description.
-
4.
The recovered theories resemble Hamiltonian dynamics initially, then shift closer to a Lagrangian formulation as complexity of systems increase. This suggests that the Lagrangian formulation remains as the only correct theory within a rich theory space.
II Related Work
Scientists aim to recover equations from observation. So do AI scientists. Given a set of data of some physical system, we aim to uncover the underlying “truth” in terms of physical equations. Efforts to tackle this problem has been a mix of discrete methods such as combinatorial optimization, making use of methods involving genetic programming [20] and continuous ones centering around symbolic regression [6]. The underlying assumption is that there are few number of terms in the final expression, inspiring methods of sparse linear regression [21]. Physical priors were introduced [22] to improve the ability of symbolic regression techniques in discovering known physical equations. In this paper, we propose a method to discover underlying physical laws with minimal physical priors, making use of the principle of stationary action, learning a single scalar function. These two properties are shared by the Hamiltonian Neural Network (HNN) [23] and the Lagrangian Neural Network (LNN) [24].
Inspired by the Hamiltonian formulation of classical mechanics, The HNN breaks down the task of learning the equations of motion of a physical system to first learning a scalar function, the Hamiltonian , then obtaining using Hamilton’s canonical equations:
(1) |
where are the canonical position and momentum. However, in some cases the expression for these canonical coordinates is not trivial to write down. The LNN solves this problem by learning the scalar function as the Lagrangian instead and taking derivatives according to the Euler-Lagrange equations.
(2) |
This avoids the need for an explicit expression of canonical momentum, making LNNs advantageous for certain physical systems.
Since the introduction of these works, there has been significant leaps in improving the efficiency of training [25, 26], as well as applying these networks to problems in domains such as rigid-body dynamics [27], particle interactions [28], video prediction [29] and generative modeling [30]. However, in many of these works the underlying equations of motion (Equations 1 and 2) are baked into the model architecture, and the model resultingly learns the corresponding theory governed by this equation. Instead, we ask the following question: when the model is given a freedom of learning multiple theories, what will it learn?
In this work, our proposed model, Multiple AI Scalar Scientists (MASS), is a generalized framework that includes both LNN and HNN as special cases. MASS is similarly inspired by the principle of stationary action. Like LNNs and HNNs, we aim to learn a free-form scalar function from data. However, unlike LNN and HNNs, which have hard-coded equations of motion, we equip MASS with the ability to also learn the equations of motion. For a physical system described by generalized coordinates and velocities , one can learn a scalar function (akin to a Lagrangian or Hamiltonian) that governs the system’s evolution, such that the path obeys the principle of stationary action.
The architecture design of MASS allows it to learn a rich space of theories, defined by the coefficients on each term in the governing equation learned by MASS. Similar to [24], our experiments are done in generalized coordinates. Through a series of controlled experiments on an ensemble of MASS scientists in these coordinates, we will probe the underlying theories learned.
III MASS: The AI scientist

To emulate how human scientists operate, the core idea behind MASS is to embed within a single neural network the capacity to learn and unify information from multiple physical systems. Rather than fitting individual models for each system, MASS aims to internalize a shared framework that captures fundamental patterns across all datasets. Specifically, it achieves this by learning a scalar function—analogous to Lagrangian or Hamiltonian—whose derivatives encode the system-specific dynamics. As illustrated in Figure 2, MASS adopts the following workflow:
-
1.
Data ingestion: MASS receives observational data (e.g., trajectories, states, or energy values) from diverse physical systems such as pendulums, orbital problems, or other synthetic potentials.
-
2.
Hypothesis formation: A separate neural network for each system learns a single scalar function that encapsulates the system-specific dynamics.
-
3.
Theory evaluation: A final layer shared across all systems differentiates the learned scalar function with respect to the system coordinates (position, momentum, and/or velocity), MASS infers a system’s governing equations. This enforces the consistency of an overarching theory across multiple systems.
-
4.
Refinement and generalization: The outputs of the model, in this case the time derivatives of the input, are then evaluated against the ground truth training data to compute errors. The error is summed across all systems, then backpropagation optimizes a single theory that is simultaneously consistent with multi-physics observations.
By iterating through these steps, MASS strives to discover a single, scalar function for each system and a shared final layer forming a generalized theory across systems. Together, the scalar function and the weights in the final layer (i.e. how MASS takes its derivatives) form the theory that it learns.
IV Method
We denote by a single MASS scientist network. learns from different physical systems. Some examples of systems include the spring-mass system, the gravitational system, and quantum mechanical systems, etc. Each system obeys some underlying physical law, be it the inverse square law of gravitational attraction, or Schrödinger’s equation. For simplicity and as a proof of concept, we constrain our systems to classical mechanics below.
Data ingestion: The input variables for system to are the generalized coordinates in -dimensions, expressed as , where and are the generalized coordinates and their time derivative, respectively. For a simple pendulum, we can either express as a one-dimensional problem, or with and to express the problem in two dimensions with Cartesian coordinates.
Hypothesis formulation: This block consists of of separate neural networks each learning a separate potential function for each system . We denote this forward pass as
(3) |
In this paper, we focus on MLPs, which suffice for learning .
Theory evaluation: The shared derivatives layer computes the derivatives, up to second-order, of with respect to the input variables . Note that given -dimensional inputs, i.e. , the single-variable derivatives are column vectors while the second-order derivatives (and their inverses) are Hessian matrices, i.e. . To allow the network to learn a diverse set of theories, we compute all products of terms, up to three terms in the product, such that the final result is a vector that predicts the time derivatives . Particularly, let the set of vectors be and the set of matrices be . there are three types of terms that can potentially predict :
-
1.
-
2.
Av where and
-
3.
where and .
In our implementation, there are a total of different terms across the three types described above, and we explicitly compute them using
(4) |
where is the derivative layer and gives the terms that can potentially enter into the final equation.
In the final layer, the network learns a linear combination of these vectors to predict the time derivatives of the inputs. Note that since we are using generalized position and momentum, trivially up to a constant factor. The remaining of the paper focuses on investigating the set of activations in the final layer that predicts . We denote this final layer as , and the output prediction of will be given by
(5) |
Refinement and generalization: For a specific system , we predict and then compute the MSE loss with the ground truth data. We then sum up the losses across all systems that is learning over, and do a backpropagation on the accumulated gradients. After convergence, the model develops a theory that is consistent across its multiple physical systems. The optimization objective is written as
(6) |
where is the concatenation of samples in system , and the expectation is taken over the samples X, Y drawn i.i.d. for each system.
We find that optimization over , the parameterization of is highly unstable (as observed in [24]), due to the computation of derivatives and inverses in the matrices . The experimental procedure and hyperparameter setting is more detailed in Appendix A, but some key design choices help achieve stable training:
-
•
Computing the pseudo-inverse with a regularized stabilization term. Instead of computing , we compute where is penalized as a regularization term in training.
- •
-
•
Augmenting inputs to include second-order terms of x, y.
V Experiments
V.1 Single scientist: Correlated theories
“It might be that to describe the universe, we have to employ different theories in different situations. Each theory may have its own version of reality, but according to model-dependent realism, that is acceptable so long as the theories agree in their predictions whenever they overlap, that is, whenever they can both be applied.”
— Stephen Hawking & Leonard Mlodinow, The Grand Design (2010)


The central message of Hawking’s statement is that multiple theoretical frameworks can provide equally valid descriptions of physical phenomena, as long as their predictions agree with experiment. A typical classroom demonstration of this principle is the undamped spring-mass system. One can appeal to Newton’s laws of motion, where the governing equation is
or switch to a Hamiltonian formulation in which energy functions and conservation laws offer an alternative viewpoint.
Machine learning models, on the other hand, tend to be overparameterized, often giving them considerable flexibility in fitting data, even for relatively simple physical systems [33, 34]. An intriguing question arises: If we train a single “AI scientist” network on a simple harmonic oscillator, what sort of theoretical representation will it learn, and how will it compare to the standard Newtonian or Hamiltonian descriptions?
To investigate this, we trained MASS on simulated data from the undamped spring-mass system. Figure 3 shows the training results. Particularly, we observe that training on the simple harmonic oscillator is not a difficult task for MASS, converging to an MSE loss of . We are interested in understanding how the model learns and simplifies its theory, under addition of and regularization to the final layer. To that end, we track the number of significant weights across training steps, calculated as the number of weights in the final layer that account for the first total norm of the final layer weight vector. Observe that this also decreases with the total number of training steps, but it plateaus at a rather large number of 42. This means that almost weight terms have significant magnitude, not at all close to a simple theory of !
Using MASS, we can also simulate a trajectory of the oscillator rather easily, and Figure 3 show the consistency of predictions given by MASS.


Figure 4 depicts the learned scalar function over phase space, compared to a canonical Hamiltonian function . A single MASS scientist is able to recover the sum of potential and kinetic energy expression. However, we already see some differences between these two expressions. To note, our learnt is convex rather than concave as in . also takes on negative values, not typically allowed in expressions of energy. In this example and also in general, the contour can also look skewed, translated or even resemble an entirely different conic section. This goes to show the richness of the theory space offered to MASS.
While the weights in the final layer offer a glimpse as to which terms are important, they are not the full story. Firstly, the contribution to the final prediction of exists as the sum
where -th derivative term, such as


We hence compute the activation vector over a sample batch of 512 data points. In Figure 5, we compare the mean norm of the activations (expectation taken over the 512 data points) with the weights, . In general, non-zero weights correspond to non-zero activation norms, but the relative order of the magnitudes of each term are not necessarily preserved. Particularly, the inverse of Hessian matrices, such as are large when the second derivatives of is small.
The largest magnitude activations that contribute to the final prediction of are in descending order: . When sorting instead by the norm of the weights, the top 5 terms are . The similarity of these terms is strong indicator of the important terms contributing to the final theory learnt by MASS.
In the next step, we filter to the number of significant terms. Following the convention in Figure 3, we keep only those terms that contribute to the first 99% of the total magnitude. That is, is the number of significant terms if . On these remaining significant activations, we compute the correlations in the heatmap in Figure 5, sorted according to similarity in correlations by hierarchical clustering [35]. There are three distinct clusters, consisting of terms that are linear matrix products of the vectors and respectively, from the upper left to lower right.
The existence of multiple terms lies in contrast to a trivial theory in which the only significant activation is , that gives a perfect prediction of . The Hamiltonian expression would construct and predict . The reason behind the multi-expressivity of the network is that most second order terms are constant when the scalar function is at most second order. For instance, one can easily conceive a network that learns , where Hessian matrices and their inverses become constant products. The invariance in learning these products give rise to the mix of expressions we see. Nonetheless, these terms turn out to be extremely correlated and they mainly represent only one theory. In the next section, we will discuss how the significant terms evolve, which terms survive and which die, when the AI scientist is exposed to more complex physical systems.
The main takeaways from this section is
-
1.
A single AI scientist can very effectively learn a single simple system (Figure 3), and it learns to filter its theory as training progresses.
-
2.
The underlying theory resembles some familiar physical function (Figure 4).
-
3.
When incorporated with large capacities, a single AI scientists tends to learn many seemingly separate theories (Figure 5(a)).
-
4.
However, many of these theories are strongly correlated (Figure 5(b)).
V.2 Multiple systems: Sparsification and diversification
“In the beginning of the twentieth century, it became apparent that the motion of the planet Mercury was not exactly right. This caused a lot of trouble and was not explained until it was shown by Einstein that Newton’s Laws were slightly off and that they had to be modified. ”
— Richard Feynman, The Character of Physics Law (1967)[36]
The simple harmonic oscillator is perhaps too simple for a machine learning model to fit. You see, it just has to fit . We now extend our experiments to investigate what happens when an AI scientist, when starting out by observing just a single system, encounters more complex physical systems. Following the training paradigm in Section III, MASS learns a separate scalar function for each system, while sharing the same final layer. We aggregate the loss across all the systems in one step of training. The specific systems of interest here:
-
1.
Simple harmonic oscillator
-
2.
Simple pendulum
-
3.
Kepler (Gravitational potential)
-
4.
Relativistic harmonic oscillator

Figure 6 shows the training results as we introduce each system one after another at intervals of 10000 steps, in the aforementioned order, i.e. a single training phase lasts for 10000 steps. This specific order represents at a high level the level of complexity of the systems to a human scientist. We observe that as more systems are introduced, existing theories either survive or falter, depending on the random seed controlling the initialization of the MASS network. For example, seed 80 fails at the simple pendulum, seed 96 fails at the gravitational potential, and seed 569 fails at the relativistic potential. This means that although they survived previous tasks, they probably discovered “wrong” theories that only overfit to previous tasks. It is also interesting to note that while some MASS fails initially, they can start learning accurate representations when tasked with a more complex system. The intuition for this late start can be understood by the larger number of constraints that more complex systems impose on MASS, that help its convergence.
The aggregated behaviors across seeds at each system will be further discussed in Section V.3. In the remaining of this section, we analyze a single MASS and its surviving terms. Particularly, this will be seed 52.

In general, we make the following observations:
-
1.
As the number of systems increases, the number of distinct terms learned decreases.
-
2.
As the number of systems increases, the theories become more diverse.
The first result, from Figure 7 that the number of significant terms, counted by the number of squares in each correlation map of Figure 7, decrease from 20 to 6 for the SHO, 12 to 7 for the pendulum and 10 to 5 for the gravitational problem, show that fewer terms can simultaneously explain all the systems, as opposed to just a smaller subset of the systems. The second result is observed from the increasing occurrence of non-correlated terms trending towards the bottom right of Figure 7. We also find that when tasked with explaining an ensemble of systems, MASS uses almost the same terms! To see this, the last row of Figure 7 comprises essentially the same 6 to 7 terms used for explaining all 4 systems. These terms correspond to
The large dependence on is hypothesized to survive from the initial explanation of the simple harmonic oscillator, in which MASS begins by learning terms that are constant products of , and from these terms it develops a theory for the new systems. We verified in a separate set of experiments, that permuting the order of the systems, starting with more difficult systems first, results in more emphasis on terms more related with and .
We state succinctly the main conclusion from this section: as an AI scientist encounters more systems, the number of distinct terms decrease.
V.3 Multiple scientists: Mixture of theories
“For a brief period at the beginning of 1926, it looked as though there were, suddenly, two self-contained but quite distinct systems of explanation extant: matrix mechanics and wave mechanics. But Schrödinger himself soon demonstrated their complete equivalence.”
— Max Born, Nobel Prize Lecture (1954)
When multiple scientists work on the same problem independently, some arrive at theories that seem vastly different but later become obvious that they were just two sides of the same coin (think Newton and Leibniz’s description of calculus). Differences in theory, reconciled later, happen more so in today’s advances in machine learning [37, 38, 39, 40]. Whereas, in some other instances, theories remain different from each other, though they both obey the same experimental results, very much like the Hamiltonian and Lagrangian scalar function descriptions of classical mechanics.
In this section, we investigate the relations between the theories learnt by different MASS scientists (which we will represent by different initial seeds) studying the same system.
The exact weights and values of each activation differs a lot between different scientists. Depending on the initialization, the exact terms which matter changes drastically (refer to Figure 13 and more in Appendix B). While the magnitudes of the individual terms vary, the significant terms chosen by each scientist remains rather identical. We illustrate the relative magnitude of each activation term in Figure 8. Observe that there are clear lines along this strip, indicating the terms on which it is possible to learn a describing theory of the system.




Nonetheless, the large variations in activation magnitudes and weights indicate that while the theories learned by MASS all lie within the dark lines in Figure 8, it might very well be the case that each scientist learns something different. Examining the scalar functions learnt by individual AI scientists(refer to Figure 16 in Appendix C), it is difficult to tell the underlying similarities and differences. Are these AI scientists all learning something entirely different. We will now show that this is not the case.
Consider the activations on the final layer of MASS which has a shape on a batch of samples where the final layer has terms. Specifically in our case, we have . We conduct dimensionality reduction by PCA. It turns out that in majority of seeds, the first principal component already explains more than 90% of the variance. Reducing into this first principal component gives the set of activations, and we observe in Figure 14 that in fact, each of the activation values are in fact distributionally equivalent to a uniform distribution (see Figure 14).




Such observations are corroborated across multi-scientists set-ups when run on the relativistic spring-mass and the simple pendulum, as given in Figures 15(b) and 15(a).
Computing the correlations between the activations shows that each scientist is in fact strongly correlated with all others (see figure 9). Note that correlations close to denote parity flips, which is surprisingly only rarely learnt.
These results allow us to conclude that multiple scientists learn the same underlying theory when trained on the same physical system. In fact, this already gives the answer to our very first research question: two AI scientists do agree!
V.4 Exploring the unknown: Lagrangian is all you need
“I think that the prize is recognizing, in part, the fact that understanding the deep problems of things like mind is not going to come forth in some simple way like Newtonian physics. ”
— John Hopfield, Nobel Prize Interview (2024)

In the remaining of this section, we extend the analysis to a fully general case: multiple MASS scientists trained on multiple physical systems. Again, we train in the manner of Section IV, continuously exposing MASS to increasingly difficult systems and summing the errors across each system.
Simultaneously, we present an extension of our MASS framework to unseen physical systems. Thus far we have been replicating the results of known problems: the simple harmonic oscillator, the simple pendulum, the gravitational potential and the relativistic oscillator. The original motivation for training MASS on these systems was that they were already well-studied, giving us a decent baseline to benchmark the performance of MASS against. However, a natural advancement in the direction of scientific progress, is what happens when we extend our current framework to systems yet to be discovered. At the same time, these four canonical systems lie far within the capabilities of MASS. The learnt theories are not very diverse (see Figure 9) and some terms in the final layer are almost consistently never used (see Figure 8). Theoretically, this can be attributed to the fact that one-dimensional systems yield potential functions that typically do not involve the cross-terms . For example, even in the most complex relativistic harmonic oscillator that MASS has been exposed to can be expressed with a potential function following that of a Lagrangian
for which .
To extend our studies to unseen physical systems and also fully utilize the capacity of the MASS network, we introduce synthetic systems. We list the modifications in Table 1 by describing the kinetic energy and potential energy of each system. In particular, we introduce two additional synthetic systems which serve as extensions of the relativistic harmonic oscillator with a more complex potential energy term.
System | ||
---|---|---|
(1) Classical | ||
(2) Pendulum | ||
(3) Kepler | ||
(4) Relativistic | ||
(5) (Synthetic) | ||
(6) (Synthetic) |
Our key results are presented in Figure 10, where we count the number of correct MASS scientists, defined as the number of seeds where the evaluation loss on the converged model, computed as the maximum MSE across all seen physical systems, is less than . We also count the number of significant terms, defined as the number of terms in the final layer (out of terms) needed to reach 95% of the total norm. These values are aggregated at the end of each training phase. Recall that in a single training phase, a MASS scientist is exposed to a new system and trained on the sum of the losses. Typically, a phase lasts for 10000 steps.
As we increase the number of systems, the number of MASS scientists that have been consistently correct decreases (dashed blue line of Figure 10), where to be consistently correct at phase is to have a low converged loss for all phases up till the -th phase. This is intuitive, since the consistently correct MASS scientists at the end of the -th phase is always a subset of that at the end of phase . What is not very intuitive is the solid blue line: the number of correct scientists can increase with number of systems. This is analogous to seed 506 on Figure 6, where a MASS scientist can fail at a less complex system, but when exposed to more systems, learns the overarching underlying theory and succeeds. Such revivals of scientist networks highlight the importance of augmenting physical neural networks with more difficult tasks for it to work on simpler ones.
The number of significant terms also show a consistently decreasing trend. This cements the results of Figure 9 but is still surprising! To describe each system independently, the MASS scientist relies on rather different sets of weights, as in Figure 8. Rather than learning separate terms to describe separate systems, i.e. learn the union of the terms for each theory, MASS instead learns the intersection of the terms, exemplifying the purpose of the shared final layer.
After training on 6 systems, the number of significant terms is still more than 6. A six term theory is neat, but nowhere near the simplicity of equations 1 and 2. In the remaining of this section, we show that we can easily distill the underlying theory, and that this underlying theory is in fact a Lagrangian.


V.4.1 Simple Problems: Hamiltonian is all you need?
Recall that in the Hamiltonian formulation, i.e. we learn to be , where is the Hamiltonian, is the kinetic energy, is the potential energy. In the Lagrangian formulation, we learn to be . The sign flip here is crucial.
Given data coordinates , and weights of the MASS scientist, we can compute the scalar function . We can also pre-compute the kinetic and potential energy terms and , then linear fit with . We denote a MASS scientist to have learnt a Lagrangian theory if and have opposite sign and a Hamiltonian theory if and have the same sign.
Alongside the discrete counting method described above, we can also directly fit a batch of activations against the Lagrangian and Hamiltonian activations which we can compute from an analytic expression. This fitting should not be expected to be perfect, for MASS can learn a simple variant of a clean theory yet give similar accuracies. For example, learning and may end up being effectively the same since terms second first derivative terms change by a constant while second derivative terms are entirely the same. Despite the imperfections of linear fitting, the aggregate trend of the mean across many samples can tell us a bit about the relation to each theory.
Figure 11 summarizes these results and shows the evolution of theories learned across a number of systems. When trained on just the simple harmonic oscillator or the pendulum, MASS learns almost a complete Hamiltonian description (with more than 90% of the scientists agreeing). In this simple setting, there exists some choice of sparse terms among the derivative terms that under Hamilton’s equations (Equation 1) that give low loss, and MASS tends towards this. The learned scalar functions themselves also display strong correlation.
V.4.2 Complex Problems: Lagrangian is all you need
The story changes when we extend beyond the simple pendulum to more complex problems. On these systems (3 to 6 in Table 1), MASS switches to a Lagrangian theory. One reason for this, as discussed in [24] is that the Lagrangian can be applied directly in generalized coordinates while the Hamiltonian requires canonical coordinates. As our data is presented in generalized coordinates, the MASS architecture supports calculations done in this coordinate system, following that of the Lagrangian formulation. What is surprising here is that the correlation to the Lagrangian scalar function itself also increases, suggesting that on an aggregate level, AI scientists tend towards this singular family of descriptions of physical systems: the Lagrangian description!
The results of Figure 11 show a bias toward the Lagrangian formulation, but never a definitive proof that the calculations faithfully follow that of the Lagrangian. Of course we should not expect that, given the capacity imbued to MASS, why would it follow some “nice” theory? But turns out, it almost exactly does! We will show this with a method of constrained optimization.
In the Lagrangian formulation, the prediction of will be given by [24]
The activations in the final layer of MASS will hence be concentrated on the terms and . However, the multi-expressivity of our network allows for many terms to be linearly related to these two terms.
System | ||
---|---|---|
(1D) Relativistic | 0.9999 | 0.9995 |
(1D) (Synthetic) | 0.9835 | 0.8205 |
(1D) (Synthetic) | 0.9306 | 0.8734 |
(2D) Double pendulum | 0.9712 | 0.7317 |
We solve the constrained optimization problem. Given data coordinates , and weights of the MASS scientist , we can compute the scalar function and from this obtain two terms representative of the Lagrangian theories. We call these , and . and can be easily computed with JVP. We can also obtain the activations of the final layers with a forward pass of through . Then, we solve the constrained optimization problem
(7) | ||||
(8) | ||||
(9) | ||||
(10) |
where is a transformation of the 172 term MASS activation space to the 2 term Lagrangian activation space, and the constraint 10 restricts the transformation to one which exactly uses all the weights in the activations , i.e. no cheating by placing avoiding some activations completely and overusing others. Just a technical note: in the first four systems of Table 1, we always get a trivial solution of since (due to the cross term . Of interest is what happens in the synthetic systems, where the cross terms are not zero and MASS is forced to learn something non-trivial in both and .
We summarize these results in Table 2, which consists of the single term . We average the scores of this constrained optimization fitting across the correct scientists to give Table 2. Coherent with previous observations, MASS can almost be directly transformed into a Lagrangian theory with values above 0.9. If we try to pick any two random terms from the available terms, or even the two terms with the highest activation magnitudes, the constrained optimization will typically fail, observed as a negative score on the holdout test set.
Such strong correlations to the Lagrangian raises a broader question: can we find a third description of classical mechanics? At least with MASS working in the rich theory space of terms, the answer appears to be no! The Lagrangian is all you need.
V.5 Extensions to high dimensions

While in the previous sections, we have mainly worked with one-dimensional problems, i.e. , most physical problems in nature are higher dimensional. In this section, we study one classic example of that: the chaotic double pendulum problem. The two degrees of freedom are the angles of the two pendulums. Our results show that MASS can be effectively extended to higher dimensions.
Following an identical training scheme as in Section IV, we reproduce the analytically correct trajectory of a double pendulum in Figure 12, calling the MASS solver for at each step and using RK4 integration [41, 42].
Not only can we achieve rather accurate prediction of the angles, the energy discrepancy is only at 0.4% of the total energy per 100 steps. This is comparable to the results of the Lagrangian neural network [24]. Even without introducing the Lagrangian and Euler-Lagrange equations directly into the architecture to enforce energy conservation, MASS learns to reproduce it.
We also observe, consistent with our expectations, that the learnt theories resemble a Lagrangian, with the results further included in Table 2.
We present more results of solution trajectories to the spherical pendulum and the multi-body gravitational problem in Appendix D. We are not claiming that MASS as a state-of-the-art method for solving physical systems, especially since it is out of the scope of this project to tune MASS for efficiency and accuracy on higher dimensional problems. In fact, the computation of the Hessian matrix and its inverse incurs an dependence on the dimensionality of the problem, so directly applying the current solver to problems of extremely high dimensionality would be very expensive. Nonetheless, the evident applicability of MASS to solving the double pendulum problem at a sufficient level demonstrates its potential for future exploration, and drives home the spirit of this paper: to build AI scientists that are both simple and interpretable, and also generally applicable to complex physical systems.
VI Discussions
So do two AI scientists agree? The short answer is yes. But it comes with some caveats.
Looking back, we question the relationship between our results in Figure 9 and those in Figure 11. In the former, we observe a strong correlation between the theories described by each MASS scientist. Compared to Figure 11, we see an indication that scientists can learn different theories. In combination, this is suggestive that a number of theories lie on the boundary between a “Hamiltonian”-like contour and a “Lagrangian”-like contour. We did not perform rigorous symbolic regression on the results of the learned scalar function. Given the vast number of terms that can be learned by MASS, we believe the results of Figure 11 and Table 2 tell a much richer story about the underlying theories. We performed thorough numerical analysis of the trained systems through counting the number of Hamiltonian and Lagrangian theories, and measuring correlations of the activations, to conclude the generalizability of the Lagrangian theory.
To answer the original research question, we chose to use different seeds as proxies for different AI scientists. While only affecting the initializations of the MASS network, we already see strikingly different training behavior (Figure 6). Our initial experiments on varying model width and depth suggest that the extent of agreement increases with model capacity. Preliminary testing on varying architectures, using convolutions and attention instead of pure feed-forwards, shows to be much less stable in training, primarily due to the low-dimensionality of our data.
Looking ahead, while most of our results, due to the extensive parallelization over many seeds and systems, were conducted on one-dimensional physical systems, preliminary results (Figure 12) show that these can be readily extended to higher dimensions.
MASS offers a tradeoff between inductive bias (through including physical priors such as the Euler-Lagrange equations) with training efficiency. When calculating many of the terms, particularly the Hessian matrix inverses, training slows and becomes unstable, which was solved only with strong regularization and initialization techniques. Nonetheless, these additional terms should not be seen as irrelevant. One should not expect the Euler-Lagrange equations to be the end-all for physics-based machine learning, and certainly not for physics itself.
VII Future Work
There are several low-hanging fruits to extend our work in this paper. We list several below:
-
1.
Coordinate choice. Our experiments done in generalized coordinates forbid the Hamiltonian expression to achieve low loss, while the Lagrangian remains a perfect description. Hence, results such as Figure 11 are not extremely surprising. But what happens if we allow MASS to work in arbitrary coordinates? This can be done by allowing MASS to learn a coordinate transform (through a simple MLP) then take derivative in the transformed coordinates [43]. On these coordinates, will MASS still prefer the Lagrangian expression?
-
2.
Loss function. We can modify the loss function to encourage the learning our un-learning of specific theories. In particular, the measure of Hamiltonicity [44] quantifies how ”Hamiltonian”-like the theory is. How does including this as a loss term bias MASS into learning different theories?
-
3.
Model architecture. Our choice of variation of AI scientists was across the random initializations. What happens if we modify the architecture completely? Will AI scientists still agree?
-
4.
High dimensions. We show results up to six dimensions in Figure 19. But many physical problems are even higher dimensional. How do we efficiently extend our model to solve those problems?
VIII Conclusion
In this paper, we have developed a novel architecture and training scheme, MASS and rigorously investigated the evolution of theories studied by MASS across multiple physical systems. Through our experiments, we show that AI scientists, when modeled as a high capacity neural network, often learns multiple equivalent expressions of the same theory. As we expose our AI scientists to new, and more complex systems, some of these theories prove inconsistent with these previously unseen systems, while others successfully generalize to more difficult problems. Even within these surviving theories, the underlying theories change over increasing systems, starting from resembling a Hamiltonian to resembling a Lagrangian.
We hope that MASS will not just be an interesting story of Hamiltonian v.s. Lagrangian, but also lays the groundwork to build models that are more interpretable and capable. Then, we will revisit the question: do two AI scientists agree?
Acknowledgement Z.L. and M.T. are supported by IAIFI through NSF grant PHY-2019786. Z.L. is supported by the Google PhD Fellowship.
References
- Kuhn [1970] T. Kuhn, The Structure of Scientific Revolutions, 2nd ed. (University of Chicago Press, 1970).
- Baldi et al. [2014] P. Baldi, P. Sadowski, and D. Whiteson, Searching for exotic particles in high-energy physics with deep learning, Nature Communications 5, 4308 (2014).
- Ball and Brunner [2010] N. M. Ball and R. J. Brunner, Data mining and machine learning in astronomy, International Journal of Modern Physics D 19, 1049 (2010).
- Ramprasad et al. [2017] R. Ramprasad, R. Batra, G. Pilania, A. Mannodi-Kanakkithodi, and C. Kim, Machine learning in materials informatics: recent applications and prospects, npj Computational Materials 3, 1 (2017).
- Pfau et al. [2020] D. Pfau, J. S. Spencer, A. G. D. G. Matthews, and W. M. C. Foulkes, Ab initio solution of the many-electron schrödinger equation with deep neural networks, Phys. Rev. Res. 2, 033429 (2020).
- Schmidt and Lipson [2009] M. Schmidt and H. Lipson, Distilling free-form natural laws from experimental data, Science 324, 81 (2009).
- Cranmer and et al. [2020] M. Cranmer and et al., Discovering symbolic models from deep learning with inductive biases, in Advances in Neural Information Processing Systems (NeurIPS), Vol. 33 (2020) pp. 17429–17442.
- Jumper and et al. [2021] J. Jumper and et al., Highly accurate protein structure prediction with alphafold, Nature 596, 583 (2021).
- Vaswani et al. [2023] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, Attention is all you need (2023), arXiv:1706.03762 [cs.CL] .
- Devlin et al. [2019] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, Bert: Pre-training of deep bidirectional transformers for language understanding (2019), arXiv:1810.04805 [cs.CL] .
- Brown et al. [2020] T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, S. Agarwal, A. Herbert-Voss, G. Krueger, T. Henighan, R. Child, A. Ramesh, D. M. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark, C. Berner, S. McCandlish, A. Radford, I. Sutskever, and D. Amodei, Language models are few-shot learners (2020), arXiv:2005.14165 [cs.CL] .
- Touvron et al. [2023] H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozière, N. Goyal, E. Hambro, F. Azhar, A. Rodriguez, A. Joulin, E. Grave, and G. Lample, Llama: Open and efficient foundation language models (2023), arXiv:2302.13971 [cs.CL] .
- Jiang et al. [2024] A. Q. Jiang, A. Sablayrolles, A. Roux, A. Mensch, B. Savary, C. Bamford, D. S. Chaplot, D. de las Casas, E. B. Hanna, F. Bressand, G. Lengyel, G. Bour, G. Lample, L. R. Lavaud, L. Saulnier, M.-A. Lachaux, P. Stock, S. Subramanian, S. Yang, S. Antoniak, T. L. Scao, T. Gervet, T. Lavril, T. Wang, T. Lacroix, and W. E. Sayed, Mixtral of experts (2024), arXiv:2401.04088 [cs.LG] .
- Gemma-Team [2024] Gemma-Team, Gemma: Open models based on gemini research and technology (2024), arXiv:2403.08295 [cs.CL] .
- DeepSeek-AI [2025] DeepSeek-AI, Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning (2025), arXiv:2501.12948 [cs.CL] .
- Lu et al. [2024] C. Lu, C. Lu, R. T. Lange, J. Foerster, J. Clune, and D. Ha, The ai scientist: Towards fully automated open-ended scientific discovery (2024), arXiv:2408.06292 [cs.AI] .
- LeCun et al. [2015] Y. LeCun, Y. Bengio, and G. Hinton, Deep learning, Nature 521, 436 (2015).
- Carleo and et al. [2019] G. Carleo and et al., Machine learning and the physical sciences, Reviews of Modern Physics 91, 045002 (2019).
- Rudin [2019] C. Rudin, Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead, Nature Machine Intelligence 1, 206 (2019).
- Koza [1994] J. R. Koza, Genetic programming as a means for programming computers by natural selection, Statistics and Computing 4, 87 (1994).
- Brunton et al. [2016] S. L. Brunton, J. L. Proctor, and J. N. Kutz, Discovering governing equations from data by sparse identification of nonlinear dynamical systems, Proceedings of the National Academy of Sciences 113, 3932–3937 (2016).
- Udrescu and Tegmark [2020] S.-M. Udrescu and M. Tegmark, Ai feynman: a physics-inspired method for symbolic regression (2020), arXiv:1905.11481 [physics.comp-ph] .
- Greydanus et al. [2019] S. Greydanus, M. Dzamba, and J. Yosinski, Hamiltonian neural networks (2019), arXiv:1906.01563 [cs.NE] .
- Cranmer et al. [2020] M. Cranmer, S. Greydanus, S. Hoyer, P. Battaglia, D. Spergel, and S. Ho, Lagrangian neural networks (2020), arXiv:2003.04630 [cs.LG] .
- Xiao et al. [2024] S. Xiao, J. Zhang, and Y. Tang, Generalized lagrangian neural networks (2024), arXiv:2401.03728 [math.DS] .
- Finzi et al. [2020] M. Finzi, K. A. Wang, and A. G. Wilson, Simplifying hamiltonian and lagrangian neural networks via explicit constraints (2020), arXiv:2010.13581 [cs.LG] .
- Bhattoo et al. [2022] R. Bhattoo, S. Ranu, and N. M. A. Krishnan, Learning articulated rigid body dynamics with lagrangian graph neural network (2022), arXiv:2209.11588 [cs.LG] .
- Bhattoo et al. [2023] R. Bhattoo, S. Ranu, and N. M. A. Krishnan, Learning the dynamics of particle-based systems with lagrangian graph neural networks, Machine Learning: Science and Technology 4, 015003 (2023).
- Allen-Blanchette et al. [2020] C. Allen-Blanchette, S. Veer, A. Majumdar, and N. E. Leonard, Lagnetvip: A lagrangian neural network for video prediction (2020), arXiv:2010.12932 [cs.LG] .
- Toth et al. [2020] P. Toth, D. J. Rezende, A. Jaegle, S. Racanière, A. Botev, and I. Higgins, Hamiltonian generative networks (2020), arXiv:1909.13789 [cs.LG] .
- Loshchilov and Hutter [2019] I. Loshchilov and F. Hutter, Decoupled weight decay regularization (2019), arXiv:1711.05101 [cs.LG] .
- Loshchilov and Hutter [2017] I. Loshchilov and F. Hutter, Sgdr: Stochastic gradient descent with warm restarts (2017), arXiv:1608.03983 [cs.LG] .
- Zhang et al. [2017] C. Zhang, S. Bengio, Y. Singer, and Y. LeCun, Rethinking generalization in deep learning, in Proceedings of the 34th International Conference on Machine Learning, Vol. 70 (2017).
- Allen-Zhu et al. [2019] Z. Allen-Zhu, Y. Li, and Y. Liang, Learning and generalization in overparameterized neural networks, going beyond two layers, Advances in Neural Information Processing Systems (2019).
- Sneath and Sokal [1973] P. H. A. Sneath and R. R. Sokal, Numerical Taxonomy: The Principles and Practice of Numerical Classification (W. H. Freeman, San Francisco, 1973).
- Feynman [1967] R. P. Feynman, The Character of Physical Law (MIT Press, Cambridge, MA, 1967).
- Sohl-Dickstein et al. [2015] J. Sohl-Dickstein, E. Weiss, N. Maheswaranathan, and S. Ganguli, Deep unsupervised learning using nonequilibrium thermodynamics, in International Conference on Machine Learning (ICML) (2015).
- Du and Mordatch [2019] Y. Du and I. Mordatch, Implicit generation and generalization in energy-based models, in Advances in Neural Information Processing Systems (NeurIPS) (2019).
- Song and Ermon [2019] Y. Song and S. Ermon, Generative modeling by estimating gradients of the data distribution, in Advances in Neural Information Processing Systems (NeurIPS) (2019).
- Song et al. [2021] Y. Song, J. Sohl-Dickstein, D. P. Kingma, A. Kumar, S. Ermon, and B. Poole, Score-based generative modeling through stochastic differential equations, in International Conference on Learning Representations (ICLR) (2021).
- Runge [1895] C. Runge, Über die numerische auflösung von differentialgleichungen, Mathematische Annalen 46, 167 (1895).
- Kutta [1901] W. Kutta, Beitrag zur näherungsweisen integration totaler differentialgleichungen, Zeitschrift für Mathematik und Physik 46, 435 (1901).
- Chen et al. [2021] Y. Chen, T. Matsubara, and T. Yaguchi, Neural symplectic form: Learning hamiltonian equations on general coordinate systems, in Advances in Neural Information Processing Systems, Vol. 34, edited by M. Ranzato, A. Beygelzimer, Y. Dauphin, P. Liang, and J. W. Vaughan (Curran Associates, Inc., 2021) pp. 16659–16670.
- Liu and Tegmark [2022] Z. Liu and M. Tegmark, Machine learning hidden symmetries, Physical Review Letters 128, 10.1103/physrevlett.128.180201 (2022).
- He et al. [2015] K. He, X. Zhang, S. Ren, and J. Sun, Delving deep into rectifiers: Surpassing human-level performance on imagenet classification (2015), arXiv:1502.01852 [cs.CV] .
- Glorot and Bengio [2010] X. Glorot and Y. Bengio, Understanding the difficulty of training deep feedforward neural networks, in Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, Proceedings of Machine Learning Research, Vol. 9, edited by Y. W. Teh and M. Titterington (PMLR, Chia Laguna Resort, Sardinia, Italy, 2010) pp. 249–256.
Appendix A Training methods
Parameter | Value |
---|---|
MLP hidden layers | 4 |
MLP width | 20 |
Batch size | 512 |
Steps (per phase) | 10000 |
Linear warmup steps | 100 |
Learning rate | |
Weight decay | 0.01 |
0.7 | |
0.8 | |
EMA | 0.99 |
0.5 | |
0.1 | |
0.01 |
As stated in Section IV, training MASS is extremely unstable. The “ground truth” scalar potentials in many of the systems(Table 1) we are studying induce Hessian matrices that are identically zero, making their inverses hard to compute. To remedy this, we introduce a regularization parameter to each , computing instead of , and minimize the loss with the inclusion of a term to each system. We consistently use in our experiments. In addition, we augment the initializations. Under a typical initialization scheme, like Kaiming initialization [45] and Xavier initialization [46], the second derivatives are very small, leading to the same problem of exploding inverses. Instead of symbolically optimizing for the variances at each layer [24], we simply augment the input of the MLP to take not just but the entirety of . Together, these allow for stable training of MASS networks up to high learning rates of even .
To encourage sparsification of terms, we introduce a regularization term on the final layer weights and activations. Note that regularizing the weights alone is not enough, since MASS can simply cheat by increasing the magnitude of and increasing the activations. Let the final layer weights be and the activations be for system , then system-wise we include the regularization term . We use in our experiments.
We report the hyperparameter settings in the Table 3.
For the higher dimensional problems, we used larger MLP width ranging from 40 to 100 and up to 6 hidden layers. Correspondingly, the learning rate varied between to .
Appendix B Activations on different 1D systems
In the following set of visualizations (Figures 13 to 15(b)), we support our claims that while the exact terms learnt by MASS differ, the underlying theories, described by the histograms, are mostly identical. We find that this first PCA component corresponds mainly to the ground truth acceleration for 1D simple systems (1 to 4), but not necessarily for more complex systems. It is of future interest to investigate that this direction means and what the theories that have a low correlation to this direction actually represent.




Appendix C Visualizing learned scalar functions









Below we provide some additional visualizations of the learned scalar functions . Note the various shapes: elliptical, parabolic, hyperbolic, and degenerate (where the level curves are nearly straight lines). In genreral, the shape closely resembles a conic section, in large part due to the nature of these problems: kinetic and potential energy terms are typically second order in the terms of the generalized coordinates. Even for the gravitational problem where we have a potential, the learnt scalar functions still resembles a conic section!
In general, we observe that while the learned scalar function look different. The differences lie in simple parity swaps (positive to negative, elliptical to hyperbolic) and learned theories are in fact similar, according to our discussions in the main paper. The number of curves near straight lines indicate that many theories lie on the border of a “Hamiltonian” or “Lagrangian” contour.
Appendix D Higher dimensional problems





We can apply MASS to also solve higher dimensional problems, beyond that of the double pendulum in Figure 12. In particular, we demonstrate the ability of MASS to solve for periodic orbits reliably.
Another natural extension to the simple pendulum to two dimensions is the spherical pendulum, parameterized by the two degrees of freedom and . We present a typical solution displaying the oscillation about an equilibrium conical solution given by . oscillates in a near harmonic motion while oscillates about a constant drift which is the initial angular velocity.
The exact equations of motion are given by
with the energy of the system given by
and we set all physical constants to for the purpose of this experiment.
Another problem we can tackle with MASS is the -body problem. The -body problem involves interacting masses under the influence of gravitational forces. The equations of motion for the -body problem are given by:
where is the mass of the -th body, is its position vector, and is the gravitational constant. As with all previous problems, we set all physical constants to 1.
For the two-body problem in Cartesian coordinates, represented by , we report the comparisons between the analytic and MASS results in Figure 18, from which we can observe an accurate learning of the behavior including the drift of the two bodies as well as their orbits about the common center of mass. Note that this problem can effectively be reduced to two dimensions with a coordinate transform using the reduced mass, but nonetheless MASS is able to learn the higher dimensional general representation in Cartesian coordinates.
The two-body problem is not so difficult. But what about the three-body problem? This is known to be chaotic. Turns out, we can solve this too! We can use MASS on this problem directly in 6 dimensions, and the result is shown in Figure 19. The initial conditions are chosen to be a deviation from the known stable figure-8 solution, and shows that MASS can capture the interaction between all three bodies accurately.
For all the systems presented, we use Runge-Kutta fourth-order integration solver for the ODE. Together with the accuracy of the MASS solver, the integration solver conserves the energy of the systems significantly. Again, we are not claiming that MASS is the state-of-the-art solver for physical systems. In fact, many of these toy examples are not solved to the best precision, and some are only exhibited near equilibrium states of which the behavior of the system is regular. In fact, a persistent problem is the stability of training of MASS, which is accentuated in irregular regimes. Nonetheless, the ability of MASS to be adapted to higher dimensional problems without much change in architecture and hyperparameters is a promising sign in building general and interpretable AI physics scientists in the future.