AI in the Lab: Can Large Language Models Outperform Human Experts in Biomedical Research?

25

The integration of Large Language Models (LLMs) into scientific research is moving from theory to high-stakes reality. Recent findings suggest that AI-generated code is not just assisting researchers—it is beginning to match, and occasionally surpass, the analytical capabilities of human experts.

A study published in Cell Reports Medicine has highlighted a significant shift in how complex medical data can be processed. By leveraging LLMs, junior researchers—including a graduate student and a high school student—successfully generated highly accurate code to predict preterm birth risks, a task that traditionally requires years of specialized expertise.

The Breakthrough: Speed and Accuracy

The research utilized massive datasets from the DREAM (Dialogue for Reverse Engineering Assessments and Methods) Challenges. These datasets are incredibly complex, involving:
Blood transcriptomics: Analyzing RNA to see which genes are active.
Epigenetic data: Examining chemical tags on DNA that control gene expression.
Microbiome data: Studying bacterial compositions in vaginal fluid.

Traditionally, analyzing these variables to predict gestational age or preterm birth would take months of manual work by highly trained bioinformaticians. However, the junior researchers in this study used simple prompts to task eight different LLMs with the analysis.

The results were striking. Four models—DeepSeekR1, Gemini, ChatGPT (o3-mini-high and 4o) —produced functional code. Notably, OpenAI’s o3-mini performed as well as the original human expert teams and even outperformed them in certain epigenetic analyses.

Perhaps most significantly, the timeline for discovery has been compressed:
Human teams: Took years to complete similar analyses.
AI-assisted junior researchers: Produced results in three months and a completed manuscript within six.

The Evolution Toward “Agentic” AI

The current wave of AI assistance is moving toward “agentic” AI. Unlike standard chatbots that simply respond to prompts, agentic systems are designed to act as autonomous researchers. They can:
1. Develop multi-step research workflows.
2. Iterate on their own work to correct errors.
3. Execute tasks like searching the internet or running code independently.

However, this autonomy brings a significant “accuracy gap.” A study in Nature Biomedical Engineering found that when LLMs were allowed to create workflows entirely on their own, their accuracy dropped below 40%.

To solve this, researchers are moving toward a “human-in-the-loop” framework. By requiring the AI to present a step-by-step plan for human review before executing code, accuracy jumped from 40% to 74%. This suggests that the future of AI in science is not about replacing the scientist, but about augmenting them through supervised reasoning.

Challenges: Standards, Safeguards, and “AI Slop”

As AI becomes a permanent fixture in the laboratory, the scientific community faces three critical hurdles:

  • The Benchmarking Problem: AI evolves so rapidly that by the time a standard benchmark is created to test it, the models have already surpassed it. Researchers at Stanford are currently working to establish standardized medical benchmarks to keep pace with this evolution.
  • The Supervision Requirement: Experts warn against “blind trust.” The goal is to integrate AI into the scientific method without sacrificing rigor or creating “AI slop”—low-quality, unverified research output.
  • The Perfection Myth: There is a tendency to hold AI to an impossible standard of perfection. As computer science professor Ian McCulloh notes, the objective is not for AI to be flawless, but to perform more reliably and accurately than human error rates allow.

“The goal is not to ask researchers to blindly trust an AI system,” says study co-author Zifeng Wang. “The goal is to design frameworks where the reasoning, planning, and intermediate steps are visible enough that researchers can supervise and validate the process.”

Conclusion

AI is rapidly lowering the barrier to entry for complex biomedical analysis, turning months of work into weeks. While the potential to improve maternal and infant health is immense, the scientific community must prioritize rigorous human oversight and new standardized benchmarks to ensure these powerful tools remain reliable.