IBM's MAMMAL Is a Quiet Demonstration That Biomedical AI Is Moving Beyond Single-Purpose Models

IBM Research's MAMMAL, a 458-million-parameter multimodal biomedical foundation model trained on 2 billion biological samples across proteins, small molecules, and gene expression data, has achieved state-of-the-art results on 9 of 11 drug discovery benchmarks and outperformed AlphaFold 3 on antibody binding classification for 3 of 4 tested targets, with the model weights and code released publicly on GitHub and Hugging Face for researchers to build on.

The paper is peer-reviewed, the benchmarks are named and specific, and the model is available to download. Those three facts distinguish MAMMAL from a large fraction of biomedical AI announcements, which tend to describe capabilities without releasing the systems that demonstrate them. IBM has published the MAMMAL architecture on arXiv, released the pretrained model as ibm/biomed.omics.bl.sm.ma-ted-458m, and made the fine-tuning codebase accessible under the BiomedSciAI GitHub organisation. Researchers can reproduce the benchmark results, apply the model to their own datasets, and evaluate the AlphaFold 3 comparison for themselves. That is a different category of claim than a press release citing proprietary internal evaluation.

The specific benchmark results are worth being precise about. MAMMAL achieves state-of-the-art performance on 9 of the 11 downstream tasks the paper evaluates, covering classification, regression, and generation tasks across small molecules, proteins, and gene expression profiles. On the molecular toxicity benchmarks ClinTox and BBBP, it achieves AUROC scores of 0.986 and 0.937 respectively, improvements of 4% and 2.2% over prior state-of-the-art MoLFormer. On a peptide classification task, it improves F1 by 7.5% over the previous best. The AlphaFold 3 comparison is more specific and more nuanced than the headline suggests: MAMMAL outperforms AF3 on antibody binding classification for CD206 and VWF, two larger and structurally complex targets, while AF3 outperforms MAMMAL on TBG, a smaller target where AF3's structural prediction advantages hold. This is not a claim that MAMMAL is generally superior to AlphaFold 3. It is a demonstration that for specific binding classification tasks, a modality-aligned foundation model can outperform a structure-prediction system that was not designed for that task type. The two systems are not doing exactly the same thing, and the comparison requires that context to be interpreted correctly.

The architectural choice that makes MAMMAL interesting is the alignment mechanism rather than scale. At 458 million parameters, it is not a large model by 2026 standards. GPT-5 and Claude 4 operate at orders of magnitude larger scale. What MAMMAL does differently is align representations across modalities during pretraining, enabling the model to reason about relationships between protein sequences, small molecule structures, and gene expression profiles in a unified embedding space. That is meaningful because drug discovery is fundamentally a multi-entity problem: a candidate molecule does not exist in isolation, it exists in relation to a target protein, which exists in relation to a biological pathway, which exists in relation to a disease context expressed in gene regulation data. A model that encodes relationships across those entities simultaneously can, in theory, make predictions that single-modality models cannot, because it has learned the cross-domain correlations that single-modality training explicitly excludes. The 9-of-11 benchmark performance suggests this theoretical advantage is producing real results on standard evaluation tasks, even if the gap from actual drug candidates to validated clinical compounds remains vast.

IBM's positioning in the biomedical AI space is not a recent pivot. The company's AI for Health programme and its foundation models group at IBM Research Haifa have been publishing in computational biology for years. MAMMAL fits into a broader portfolio that includes work on antibody design, vaccine selection, and antigenicity modelling, all of which require exactly the cross-domain reasoning MAMMAL is built for. IBM is not competing for consumer AI mindshare. It is competing for enterprise biotech infrastructure contracts with pharmaceutical and biotech companies that need AI tools integrated into their computational drug discovery workflows. Those workflows are long, expensive, and dominated by bespoke single-purpose models trained on proprietary datasets with narrow transfer capability. A publicly available foundation model that achieves competitive benchmark performance across multiple stages of the drug discovery pipeline, and that enterprise customers can fine-tune on their own data, is a credible entry point into that procurement process.

The single-purpose versus foundation model question is where the industry is genuinely unsettled. The dominant biotech AI tools in production today, Schrödinger's computational chemistry platform, Atomwise's structure-based drug design system, Recursion's phenomics platform, are all task-specific systems built around deep domain expertise and proprietary training data. They are valuable precisely because they encode years of accumulated knowledge about specific biological domains and because their outputs have been validated in wet-lab experiments that establish their predictive reliability. A foundation model that achieves competitive benchmark performance without that validation history has demonstrated something real but has not demonstrated clinical utility. The gap between a high AUROC on a standard benchmark and a drug candidate that survives preclinical testing is wide, and it is a gap that has swallowed many technically impressive AI systems in the past decade of computational drug discovery hype.

What MAMMAL actually represents for the startup ecosystem building in computational biology is a publicly available, high-quality pretraining foundation that can reduce the data and compute requirements for fine-tuned downstream models. A startup working on a specific protein-ligand interaction problem can fine-tune MAMMAL on their proprietary dataset rather than pretraining a domain model from scratch, which compresses the time and capital required to reach competitive performance. IBM's decision to release weights and code rather than commercialise the foundation model directly suggests a platform strategy: create the infrastructure layer, allow the ecosystem to build on it, and compete at the enterprise integration and services level rather than at the model layer. Whether that strategy generates the kind of pharma partnership revenue IBM's enterprise business requires is a separate question from whether the model itself is technically sound. On the technical question, the paper provides evidence that it is. On the commercial question, the evidence will take longer to accumulate.

Also read: Panthalassa Wants to Build AI Data Centers in the Ocean and the Power Crunch Makes That Sound Less Crazy Than It Should • AI Models Are Giving Biosecurity Experts Operationally Useful Bioweapons Guidance and the Refusal Systems Are Not Stopping It • OpenAI's DeployCo Is Not a Fund. It Is a Captured Distribution Machine for the Enterprise Market.