"There are two cultures in the use of statistical modeling to reach conclusions from data. One assumes that the data are generated by a given stochastic data model. The other uses algorithmic models and treats the data mechanism as unknown. The statistical community has been committed to the almost exclusive use of data models. This commitment has led to irrelevant theory, questionable conclusions, and has kept statisticians from working on a large range of interesting current problems. Algorithmic modeling, both in theory and practice, has developed rapidly in fields outside statistics. It can be used both on large complex data sets and as a more accurate and informative alternative to data modeling on smaller data sets. If our goal as a field is to use data to solve problems, then we need to move away from exclusive dependence on data models and adopt a more diverse set of tools."
-Leo Breiman, like 24 years ago
Machine learning isn't the native language of biology, the author just realized that there's more than one approach to modeling. I'm a statistician working in an ML role and most of the issues I run into (from a modeling perspective) are the reverse of what this article describes - people trying to use ML for the precise things inferential statistics and mechanistic models are designed for. Not that the distinction is that clear to begin with.
"For example, the Lotka-Volterra model accurately captures predator-prey dynamics using systems of differential equations."
This is incorrect. The validation of the L-V predator/prey model was considered to be the population dynamics of the Snow Shoe Hare and Canada Lynx as seen in Hudson Bay Company records. The data actually models the fashion cycles in Europe, showing prices and demand from Europe drove the efforts of the Company and the trappers. This is in the standard texts from at least the mid 90s AFAIK.
This person seems to work in a field (exercise / athletics) with an abundance of data, low stakes outcomes, reasonably well established biomarkers, etc. in other words, a field perfectly suited for a top down outcome driven analysis.
IMO the post is merely stating: "man, everyone should be doing this!" Without realizing that (1) everyone is doing this, and (2) it doesn't seem like it because many (most?) fields in biology don't work in the top down approach being suggested. Determining mechanism and function is vital in biology because in a lot of cases there just isn't the data to perform a fuzzy outcome driven analysis.
The problem with this machine-learned “predictive biology” framework is that it doesn’t have any prescription for what to do when your predictions fail. Just collect more data! What kind of data? As the author notes, the configuration space of biology is effectively infinite so it matters a great deal what you measure and how you measure it. If you don’t think about this (or your model can’t help you think about it) you’re unlikely to observe the conditions where your predictions are incorrect. That’s why other modeling approaches care about tedious things like physics and causality. They let you constrain the model to conditions you’ve observed and hypothesize what missing, unobserved factors might be influencing your system.
It’s also a bit arrogant in presuming that no other approaches to modeling cells cared about “prediction”. Of course, systems and mathematical biologists care about making accurate predictions, they just also care about other things like understanding molecular interactions *because that lets you make better predictions*
Not to be cynical but this seems like an attempt to export benchmark culture from ML into bio. I think that blindly maximizing test set accuracy is likely to lead down a lot dead end paths. I say this as someone actively doing ML for bio research.
Also predictions in biology take months or years to validate, so they lack the fast feedback loop of the vision and NLP world where the feedback is almost instant.
Combine this with the fact that In vivo data in biology is extremely limited, and we see copying the NLP and vision playbook into biology is challenging
This. Many of the predictions we're talking about are potentially years in the making, involve expensive data collection to validate, suffer from a lot of stochastic noise, etc.
Honestly even if a prediction comes an experiment, and they know exactly how the experiment was done, it takes month to years to follow up and verify.
Generative AI is basically going to flood the field with more predictions, but with little explanation of how, and doing nothing to alleviate the downstream verification process.
That's a lot of words, including a sentence that in which the author almost compares himself with Galileo. The proof is in the pudding no? What did you predict with it?
The author claims that "machine learning methods better describe many biological systems than traditional mathematical formulations", but I see very little concrete evidence in the article to support it.
Biological systems can be described via diff equations, e.g. neural cells can be analyzed with hodgkin-huxley type models and this can lead to bottom-up theories of biological neural networks. ML is used to approximate other more complex processes but that doesn't mean that it s impossible
This is an inaccurate statement. Geocentrism makes identical predictions to heliocentrism, but clearly the two models offer differing explanations of the dynamics of the solar system.
From an engineering perspective, yes, predictions are all that you care about. From a scientific perspective, the end goal is the simplest and most general set of explanations possible.
Explanations are also useful, because people often find them interesting.
Some things are valuable, because they keep us alive and healthy in the short term. Some things are valuable, because we find them interesting, enjoyable, or something like that. And some things are indirectly valuable, because they enable other things that are more directly valuable.
IMHO, this article makes grand claims but doesn't substantiate them.
In what way is ML-based biology any different from the myriad statistics-based mechanistic models that systems or computational biology has employed for 50 years to model biological mechanisms and processes? Does the author claim that theory-less parameterless ML models like those in deep NNs are superior because theory-based explicitly parameterized models are doomed to fail? If so, then some specific examples / illustrations would go a long way toward making your case.
I generally enjoyed the article. Maybe it's because the classical functional categorization/cataloging approaches in molecular biology are rarely sufficient to explain experimental data unless you are an expert and know all the exceptions and special cases. So the Predictive Biology approach seems a promising path, particularly since a lot of data for ML training is available.
That said, the formulation "machine learning is the native language of biology" seems odd.
Look, we're all going to sit around cringing until someone says it; machine learning is explicitly the natural language of computers. In nature, neurons are not arranging themselves into neat unsigned 8-bit integers to quantize themselves for recollection. They're also networked by synapses and reactive biology, not feedforward algorithms scanning static, hereditary weights.
This whole thing feels like the author is familiar with one set of abstractions but not the other. It's very reminiscent of the (intensely fallible) Chomsky logic that leads to insane extrapolations about what biology is or isn't. Machine learning is a model, and all models are wrong.
"There are two cultures in the use of statistical modeling to reach conclusions from data. One assumes that the data are generated by a given stochastic data model. The other uses algorithmic models and treats the data mechanism as unknown. The statistical community has been committed to the almost exclusive use of data models. This commitment has led to irrelevant theory, questionable conclusions, and has kept statisticians from working on a large range of interesting current problems. Algorithmic modeling, both in theory and practice, has developed rapidly in fields outside statistics. It can be used both on large complex data sets and as a more accurate and informative alternative to data modeling on smaller data sets. If our goal as a field is to use data to solve problems, then we need to move away from exclusive dependence on data models and adopt a more diverse set of tools."
-Leo Breiman, like 24 years ago
Machine learning isn't the native language of biology, the author just realized that there's more than one approach to modeling. I'm a statistician working in an ML role and most of the issues I run into (from a modeling perspective) are the reverse of what this article describes - people trying to use ML for the precise things inferential statistics and mechanistic models are designed for. Not that the distinction is that clear to begin with.
This is largely my feeling as well.
In the third paragraph the authors state:
"For example, the Lotka-Volterra model accurately captures predator-prey dynamics using systems of differential equations."
This is incorrect. The validation of the L-V predator/prey model was considered to be the population dynamics of the Snow Shoe Hare and Canada Lynx as seen in Hudson Bay Company records. The data actually models the fashion cycles in Europe, showing prices and demand from Europe drove the efforts of the Company and the trappers. This is in the standard texts from at least the mid 90s AFAIK.
This person seems to work in a field (exercise / athletics) with an abundance of data, low stakes outcomes, reasonably well established biomarkers, etc. in other words, a field perfectly suited for a top down outcome driven analysis.
IMO the post is merely stating: "man, everyone should be doing this!" Without realizing that (1) everyone is doing this, and (2) it doesn't seem like it because many (most?) fields in biology don't work in the top down approach being suggested. Determining mechanism and function is vital in biology because in a lot of cases there just isn't the data to perform a fuzzy outcome driven analysis.
The problem with this machine-learned “predictive biology” framework is that it doesn’t have any prescription for what to do when your predictions fail. Just collect more data! What kind of data? As the author notes, the configuration space of biology is effectively infinite so it matters a great deal what you measure and how you measure it. If you don’t think about this (or your model can’t help you think about it) you’re unlikely to observe the conditions where your predictions are incorrect. That’s why other modeling approaches care about tedious things like physics and causality. They let you constrain the model to conditions you’ve observed and hypothesize what missing, unobserved factors might be influencing your system.
It’s also a bit arrogant in presuming that no other approaches to modeling cells cared about “prediction”. Of course, systems and mathematical biologists care about making accurate predictions, they just also care about other things like understanding molecular interactions *because that lets you make better predictions*
Not to be cynical but this seems like an attempt to export benchmark culture from ML into bio. I think that blindly maximizing test set accuracy is likely to lead down a lot dead end paths. I say this as someone actively doing ML for bio research.
Also predictions in biology take months or years to validate, so they lack the fast feedback loop of the vision and NLP world where the feedback is almost instant.
Combine this with the fact that In vivo data in biology is extremely limited, and we see copying the NLP and vision playbook into biology is challenging
This. Many of the predictions we're talking about are potentially years in the making, involve expensive data collection to validate, suffer from a lot of stochastic noise, etc.
Honestly even if a prediction comes an experiment, and they know exactly how the experiment was done, it takes month to years to follow up and verify.
Generative AI is basically going to flood the field with more predictions, but with little explanation of how, and doing nothing to alleviate the downstream verification process.
That's a lot of words, including a sentence that in which the author almost compares himself with Galileo. The proof is in the pudding no? What did you predict with it?
The author claims that "machine learning methods better describe many biological systems than traditional mathematical formulations", but I see very little concrete evidence in the article to support it.
Biological systems can be described via diff equations, e.g. neural cells can be analyzed with hodgkin-huxley type models and this can lead to bottom-up theories of biological neural networks. ML is used to approximate other more complex processes but that doesn't mean that it s impossible
Science isn't about making predictions primarily, it's about explanations.
Explanations in turn are tools whose only purpose is to make predictions.
This is an inaccurate statement. Geocentrism makes identical predictions to heliocentrism, but clearly the two models offer differing explanations of the dynamics of the solar system.
From an engineering perspective, yes, predictions are all that you care about. From a scientific perspective, the end goal is the simplest and most general set of explanations possible.
In fact, geocentric models made better predictions than early heliocentric ones because epicycles allowed a better fit to the data.
Explanations are also useful, because people often find them interesting.
Some things are valuable, because they keep us alive and healthy in the short term. Some things are valuable, because we find them interesting, enjoyable, or something like that. And some things are indirectly valuable, because they enable other things that are more directly valuable.
IMHO, this article makes grand claims but doesn't substantiate them.
In what way is ML-based biology any different from the myriad statistics-based mechanistic models that systems or computational biology has employed for 50 years to model biological mechanisms and processes? Does the author claim that theory-less parameterless ML models like those in deep NNs are superior because theory-based explicitly parameterized models are doomed to fail? If so, then some specific examples / illustrations would go a long way toward making your case.
I generally enjoyed the article. Maybe it's because the classical functional categorization/cataloging approaches in molecular biology are rarely sufficient to explain experimental data unless you are an expert and know all the exceptions and special cases. So the Predictive Biology approach seems a promising path, particularly since a lot of data for ML training is available.
That said, the formulation "machine learning is the native language of biology" seems odd.
Look, we're all going to sit around cringing until someone says it; machine learning is explicitly the natural language of computers. In nature, neurons are not arranging themselves into neat unsigned 8-bit integers to quantize themselves for recollection. They're also networked by synapses and reactive biology, not feedforward algorithms scanning static, hereditary weights.
This whole thing feels like the author is familiar with one set of abstractions but not the other. It's very reminiscent of the (intensely fallible) Chomsky logic that leads to insane extrapolations about what biology is or isn't. Machine learning is a model, and all models are wrong.
What do you mean by Chomsky logic?
Nah, they mean UG and his theorizing about the in-born language facilitates of the human brain.
But there's nothing intrinsically fallacious about positing UG, nor crazy extrapolations.
I agree with you, I'm just pointing out what (imo) OP was referring to.
[dead]
[flagged]