To evaluate the performance and generalization capabilities of ibm/biomed.omics.bl.sm.ma-ted-458m, we selected a diverse set of existing benchmarks spanning multiple task types and stages of the drug discovery pipeline, prioritizing benchmarks with clearly defined splits when those were available. We assessed model quality through a fine-tuning-based evaluation strategy, where the pretrained model is adapted to each benchmark and compared against specialized state-of-the-art (SOTA) models. The evaluation methodology and fine-tuning protocol as well as detailed descriptions of each benchmark—including background, significance for drug discovery, prior models, and data statistics—are provided in the subsections below. A summary of performance across tasks is presented in Table 1 and visualized in Fig. 1E, and representative encoder-decoder examples are provided in the Supplementary Table S1.
Fig. 1: Overview of MAMMAL pretraining data, model architecture, and downstream tasks.
The alternative text for this image may have been generated using AI.
A We introduce a multi-align model pretrained on six datasets, each containing tens to hundreds of millions of data points. These data points include protein sequences, small molecules, and gene expression profiles, with a combined sample size of 2 billion. B The multi-align model combines flexible encoder-only and encoder-decoder components. It takes sequences as input, which may contain any combination of tokens and scalar elements, processed by an encoder stack consisting of self-attention blocks. In encoder-only mode, a dedicated token prediction head outputs logits for token predictions, with an optional scalar prediction head for scalar outputs. In encoder-decoder mode, residual connections inject features from the encoder’s final hidden layer into each decoder layer, and a decoder-specific prediction head outputs the final logits. C Diverse downstream tasks performed by the multi-align model, mapped to their contributions within the steps of a typical drug discovery pipeline. D Diverse downstream tasks performed by the multi-align model, categorized by data type used in the fine-tuning process. E Performance of the multi-align model across a diverse set of tasks compared to SOTA. Panel (E) was generated using Matplotlib. Panels (A–D) were created using Illustrator and PowerPoint.
Table 1 Comparison of SOTA and MAMMAL Performance Across Benchmarks
AlphaFold22, whose development contributed to the 2024 Nobel Prize in Chemistry, revolutionized protein structure prediction. Its extension AlphaFold-Multimer23 enabled modeling of antibody-antigen complexes, while AlphaFold 3 (AF3)24 further improved accuracy and added nucleic acid/small molecule support. Motivated by AF3’s reported advances, we evaluated its performance on therapeutic Antibody and Nanobody complexes (Subsection 2.10). Comparative analysis reveals that MAMMAL achieves better classification performance than AF3 in five of seven targets (Table 2).
Fig. 2: AF3-predicted nanobody binding poses on HER2 and TBG.
The alternative text for this image may have been generated using AI.
a HER2 extracellular domain (ECD) structure with representative AF3-predicted complexes for a binder and a non-binder. The FDA-approved therapeutic antibodies trastuzumab (blue) and pertuzumab (purple) are shown for reference. AF3 predicts both binders and non-binders engaging the same region of the HER2 ECD, which is distinct from the known therapeutic epitopes, consistent with its poor discriminative performance on this target (AUROC = 0.45). b Thyroxine-binding globulin (TBG) structure with AF3-predicted complexes for binding and non-binding VHHs. In contrast to HER2, AF3 predicts distinct binding poses for binders versus non-binders on TBG, consistent with its strong discriminative performance on this target. Visualizations were generated using PyMOL.
Evaluation
We compiled a comprehensive set of 11 benchmarks covering multiple data domains and task types, including classification, regression and generation, as well as single-entity, multi-entity, and multi-domain tasks. These benchmarks address key stages of the drug discovery process: Identifying target cell types (Cell Type) and advancing precision medicine (Cancer-Drug Response 1-3); predicting drug efficacy (BBBP) and safety (ClinTox); predicting the binding affinity of small-molecule drugs to target proteins (DTI); predicting interactions of biological drugs (PPI); and designing new drugs, such as antibodies, to target specific proteins (Ab Infilling).
To enable fair and direct comparison to prior work, benchmark selection prioritized datasets with predefined train, validation, and test splits or with established splitting strategies reported in the corresponding state of the art studies. For each benchmark, we followed the data splits and evaluation metrics used in the original benchmark or SOTA reference. When explicit train—validation—test splits were available, ibm/biomed.omics.bl.sm.ma-ted-458m was fine tuned on the training set, the best checkpoint was selected using the validation set, and final performance was reported on the test set. For benchmarks evaluated using cross validation, we adopted the same protocol as the corresponding prior work. Unless otherwise noted, standard errors were estimated by training the models with three different random seeds and calculating the standard deviation of their performance on the held out test set. Detailed descriptions of each benchmark, the fine tuning procedures, and the evaluation protocols are provided alongside the corresponding results. For the DTI benchmark, performance is reported using normalized root mean square error (NRMSE), defined as the root mean square error divided by the standard deviation of the test labels. The same normalization is applied to both MAMMAL and reported SOTA results, yielding values below 1 and enabling joint visualization alongside other performance metrics in Fig. 1(E). We consider MAMMAL to outperform existing state of the art when the relative improvement, computed as ∣SOTA − MAMMAL∣/SOTA, exceeds 1%.
Cell Type Annotation
Cell type prediction enables researchers to distinguish between different cell populations, such as those associated with various diseases11,12,13,14. It is also crucial for understanding how diseases or drugs affect different cell types. In recent years, a variety of methods have been developed for this task, including approaches based on marker genes, correlation-based techniques, and annotation using classification25. Recent advances in transformer-based and large-scale foundation models26,27,28 have shown improved performance.
The input for this task is single-cell gene expression data. The benchmark we used was based on the Zheng68k dataset29, which is composed of human peripheral blood mononuclear cells and is widely used for evaluating cell-type annotation performance, due to the similarity of the cell types involved. The dataset contains 68,579 cells across 11 cell types and originally included 32,738 genes, which after removing non-expressed genes leaves 20,387 genes in the benchmark. Preprocessing involved normalization, log transformation of expression values, followed by binning. Similar to the approach in30, our model uses a ranked list of expressed gene names, ordered by their expression levels, as input. The label to predict is provided in the cell ontology format “CL:NNNNNN” (see Supplementary Table S1).
Following prior work27, we adopted a 5-fold cross-validation strategy to fine-tune and evaluate ibm/biomed.omics.bl.sm.ma-ted-458m, ensuring similar proportions of cell types across folds, and assessed performance using accuracy and macro F1 score. MAMMAL outperforms the previous state-of-the-art performance in both accuracy and F1 (Table 1 and detailed results in Supplementary Table S2), achieving a 7.5% improvement in F1.
BBBP and ClinTox
To ensure the development of safe and effective drugs, candidates must satisfy rigorous criteria related to both efficacy and safety. In this study, we selected two relevant benchmarks from MoleculeNet31, a widely used suite of benchmarks for evaluating machine learning models on small-molecule drug properties: BBBP and ClinTox. The BBBP benchmark focuses on predicting the ability of drugs to penetrate the blood-brain barrier, a critical consideration for drugs targeting the central nervous system. The ClinTox benchmark comprises two related tasks: (1) predicting failure in clinical toxicity trials, and (2) predicting FDA approval status. The overall performance on ClinTox is reported as the average performance across these two tasks.
MoLFormer32, a well-established model for molecular embeddings trained on 1.1 billion SMILES sequences, has achieved state-of-the-art performance on both the BBBP and ClinTox benchmarks. In our study, we adopted the benchmarks from32, which provided predefined splits for training, validation, and testing. MAMMAL surpasses MoLFormer on both benchmarks (Table 1), achieving an average area under the receiver operating characteristic curve (AUROC) score of 0.957 on BBBP and 0.986 on ClinTox, representing improvements of 2.2% and 4%, respectively, over the state of the art.
Cancer-Drug Response
Identifying drug response at the cellular level is a critical step in the development of new drugs. Two key public databases supporting this effort, particularly in cancer drug development, are the Cancer Cell Line Encyclopedia (CCLE)33 and the Genomics of Drug Sensitivity in Cancer (GDSC)34. CCLE provides multi-omics profiles for around 1000 cancer cell lines, while GDSC offers data on the drug responses of these lines to hundreds of drugs, commonly measured using the half-maximal inhibitory concentration (IC50). Notable computational models addressed this task35,36,37.
For our study, we used three subsets of the GDSC database: GDSC1 and GDSC2, available through the Therapeutics Data Commons (TDC)38, and referred to in the paper as Cancer-Drug Response 1 and Cancer-Drug Response 2, respectively; and a subset published in36, referred to as Cancer-Drug Response 3. A dataset statistics table summarizing the number of cell lines, drugs, and cell–drug pairs is provided in Supplementary Table S3. We used the random splits provided by TDC for Cancer-Drug Response 1 and 2, while for Cancer-Drug Response 3, we followed the split methodology outlined in36, reserving 5% of the data for the test set, stratified by TCGA39 pathways associated with the cancer cell lines.
During fine-tuning, we used only gene-expression profiles and SMILES representations of drugs, as shown in the example prompt in the Supplementary Table S1. Similar to the input format for cell type annotation, gene-expression profiles were provided as ranked lists of gene names based on their expression levels. For predicting continuous IC50 values, MAMMAL was utilized in regression mode, taking advantage of its built-in support for floating-point scalar predictions. Our model outperforms the current SOTA models for Cancer-Drug Response 1 and 2 (Table 1), achieving a 3.4% increase in Pearson correlation values. Additionally, it yields results comparable to the SOTA for the Cancer-Drug Response 3 benchmark, with a slight improvement of 0.5%.
To further evaluate MAMMAL’s predictive capability on novel compounds, we assessed drug response predictions for four drugs not present in the GDSC training data: Carfilzomib, Nintedanib, Infigratinib, and Vemurafenib. Tanimoto similarity analysis confirmed that three of these drugs (Carfilzomib, Nintedanib, and Infigratinib) have no structurally similar compounds in the training set (Tanimoto coefficient <0.7), while Vemurafenib shares moderate similarity (0.82) with PLX-4720, a BRAF inhibitor present in GDSC. We performed experimental validation using the same assay protocol employed in GDSC: cell viability was measured using CellTiter-Glo following 72-hour drug incubation, and IC50 values were determined with Prism (GraphPad). The experimental measurements revealed a consistent potency ranking across all tested cell lines: Carfilzomib (most potent), followed by Nintedanib, Infigratinib, and Vemurafenib (least potent). MAMMAL predictions reproduced this exact ranking for the tested cell lines. When extended to all 805 cell lines in GDSC, the model preserved this relative ordering in approximately 90–95% of cases, suggesting that the predicted potency differences are largely cell line–independent.
Notably, Carfilzomib is a proteasome inhibitor approved exclusively for hematological malignancies (multiple myeloma), with limited efficacy in cells of solid tumors40. The model’s prediction of Carfilzomib as the most potent agent across diverse solid tumor cell lines aligns with our experimental observations and suggests potential broader applicability that warrants further investigation.
Ab Infilling
Antibodies are a family of proteins produced by the immune system to neutralize foreign antigens and are of particular interest due to their high specificity and strong binding to target molecules41,42. These characteristics have made them a crucial class of therapeutics, driving significant research efforts into the design of new antibody-based drug candidates7,43,44,45. Antigen-binding fragments (Fabs) are the antibody fragments that bind to antigens. It is composed of one constant domain and one variable domain of each of the heavy and light chains. Each variable region is further divided into four framework (FR) regions and three complementarity-determining regions (CDRs). While FR regions are typically conserved, CDRs exhibit significant variation in their amino acid composition and are generally the primary determinants of binding affinity to the target antigen. When designing novel antibodies for a specific antigen, the typical approach is to explore alternative CDRs that could produce a new, functional antibody with high binding affinity to the target41,42,46,47.
Recently, several deep learning methods have been developed for targeted antibody design, framing CDR prediction as an infilling task46,47,48,49,50,51,52. These models predict missing CDR regions, represented by MASK tokens, using the amino acid sequences of both the antigen and the antibody’s FR regions. While prior approaches often relied on structural data, this information is scarce and challenging to obtain53. In contrast, we fine-tune MAMMAL for the targeted antibody design task using only the sequence data of the antigen and the sequence of the antibody’s FR regions.
The targeted antibody design task benchmark is based on the SAbDab dataset53. Following the data processing outlined in47, we filtered out samples with missing CDRs to enable direct comparison, even though MAMMAL supports samples that contain missing CDRs. Consistent with47, we randomly partitioned the dataset into training, validation, and test folds while ensuring that samples with similar heavy-chain third CDR (CDRH3) sub-sequences remained in the same fold. MAMMAL demonstrates superior amino acid recovery (AAR), defined as the fraction of correctly predicted residues, across all masked CDRs (Table 1; detailed results are provided in Supplementary Table S4). Notably, in CDRH3, the most variable region, it exhibits a remarkable improvement of 19%.
T-Cell Receptor-Epitope Binding
T-cell receptor (TCR) binding to immunogenic peptides (epitopes) presented by major histocompatibility complex molecules is a critical mechanism in the adaptive immune system, essential for antigen recognition and triggering immune responses. The TCR repertoire exhibits considerable diversity, consisting of an α-chain and a β-chain that function together to enable T cells to recognize a wide array of epitopes. The β-chain is especially significant, as it is crucial for the early stages of T-cell development and possesses greater variability, which enhances the TCR’s capacity to identify diverse pathogens effectively. However, understanding the specific interactions between TCRs and epitopes remains a significant challenge due to the vast variability in TCR sequences. Accurate prediction of TCR-peptide binding from sequence data would advance immunology by offering deeper insights into a patient’s immune status and disease history. This capability holds potential applications in personalized immunotherapy, early diagnosis, and the treatment of diseases such as cancer and autoimmune disorders. In silico tools designed to model TCR-peptide interactions could facilitate the study of therapeutic T-cell efficacy and assess cross-reactivity risks, presenting an opportunity for precision medicine.
We evaluated the model on the task of predicting TCR-epitope binding from sequence data using the Weber benchmark (ref. 54, https://tdcommons.ai/multi_pred_tasks/tcrepitope), which consists of 47,182 TCR β-chain epitope pairs. This dataset covers 192 distinct epitopes and includes 23,139 unique TCR β-chain sequences, with 50% of the pairs serving as negative samples created by randomly pairing TCR sequences with epitopes they are not known to bind with. The dataset also includes the CDR3 subsequence for each TCR β-chain, the most hypervariable region of the chain. We used 10-fold cross-validation. The folds were pre-defined in 54. Fine-tuning involved three concurrent tasks: TCR β-chain mask infilling and two classification tasks: (i) TCR β-chain epitope binding prediction and (ii) TCR β-chain -CDR3 epitope binding prediction. Here, we report the performance only for the TCR β-chain epitope binding prediction task. Our model achieves an average AUROC of 0.879 (Table 1), representing a statistically significant improvement of 2% over the SOTA, as our result falls outside the SOTA’s confidence interval.
Protein-Protein Interaction – ΔΔG Prediction
An important factor in drug design is binding affinity, commonly measured by the equilibrium dissociation constant, KD, which is related to the Gibbs free energy ΔG through the equation
$$\Delta G=kT\,{ln}({K}_{D}),$$
(1)
where k is the Boltzmann constant and T is the temperature55.
The effect of introducing mutations into a protein–protein complex is commonly quantified by the change in binding free energy relative to the reference (wild-type) complex. This mutation-induced effect is captured by the difference in Gibbs free energy, defined as
$$\Delta \Delta G=\Delta {G}_{{\rm{mutant}}}-\Delta {G}_{{\rm{wild}}-{\rm{type}}}.$$
By subtracting the wild-type free energy, ΔΔG isolates the energetic contribution of the mutation itself. As a result, ΔΔG provides a direct measure of whether a mutation stabilizes or destabilizes binding and is a standard target in studies of mutational effects on protein–protein interactions56,57,58.
The SKEMPI dataset55 provides experimentally measured changes in thermodynamic parameters, including ΔG and kinetic rate constants, for mutations in protein–protein complexes with known structures in the Protein Data Bank59. This dataset is widely used to benchmark methods for predicting mutation-induced changes in binding affinity, particularly ΔΔG. A commonly used subset of SKEMPI comprising 1131 single-point mutations (S1131) is adopted as our benchmark. Following standard practice, we report 10-fold cross-validation performance on this subset. The input for our model consists solely of amino acid sequences for the wild-type and mutant complexes, without structural information. Leveraging MAMMAL–s support for continuous-valued outputs, we formulate ΔΔG prediction as a regression task. Performance results are reported in Table 1. Our model achieves an average Pearson correlation of 0.852, substantially exceeding the previous sequence-only state of the art (0.663), and remains competitive with structure-based methods, falling only 1.6% short of the reported best performance of 0.86656.
Drug-Target Interaction
Predicting drug-target binding affinity plays a crucial role in the early stages of drug discovery. Traditionally, binding affinities are measured through high-throughput screening experiments, which, while accurate, are resource-intensive and limited in their scalability to evaluate large sets of drug candidates. In this task, we focus on predicting binding affinities using pKD, the negative logarithm of the dissociation constant, which reflects the strength of the interaction between a small molecule (drug) and a protein (target). We utilize the PEER (Protein sEquence undERstanding) benchmark60 for DTI prediction. This benchmark leverages data from the BindingDB dataset61, with a specific test split that holds out four protein classes – estrogen receptor, G-protein-coupled receptors, ion channels, and receptor tyrosine kinases – for assessing generalization performance on unseen classes.
For model fine-tuning, we conducted hyperparameter optimization, selecting an initial learning rate of 0.0004, with no dropout and no weight decay. We standardized the pKD values based on the mean and standard deviation of the training set. For evaluation, we transformed the predicted values back to their original scale. Our model achieves an average NRMSE of 0.906 (Table 1), demonstrating a solid improvement of 3.8% over the SOTA reported by60.
Antibody-Antigen Binding Prediction
Accurate prediction of antigen-antibody binding can enhance the design and optimization of therapeutic antibodies, leading to improved efficacy and specificity. We employ the human epidermal growth factor receptor 2 (HER2) dataset62 as a benchmark for predicting antibody-antigen binding. HER2 is a key target for certain types of breast and stomach cancers. The dataset includes variations of the clinically approved therapeutic antibody trastuzumab and their corresponding affinities for the HER2 antigen. The dataset comprises 8,935 binding and 25,114 non-binding trastuzumab CDR H3 mutants, each with up to 10 mutations, following de-duplication and the removal of samples labeled as both binding and non-binding.
For the most accurate comparison with the SOTA (refs. 62,63), the HER2 dataset was divided into train (70%), validation (15%) and test (15%) sets. For increased robustness, the train set was further divided into 5 folds. The reported results are from the 5 models trained on different train folds, and evaluated on the test set.
Finetuning involved feeding the target antigen sequence as well as the entire heavy-chain variable region as input and predicting binding to the target sequence. Our model achieves an average AUROC of 0.928 (Table 1), slightly surpassing the SOTA, which incorporated structural data, unlike our model.
Comparison of AlphaFold 3 and MAMMAL in Predicting Antibody-Antigen and Nanobody-Antigen Binding
Accurate prediction of antibody-antigen and nanobody-antigen interactions is essential for evaluating therapeutic efficacy and guiding protein engineering. Although AlphaFold 3 (AF3) is not explicitly designed as a binary protein-protein interaction (PPI) classifier, binding likelihood can be inferred from structure-derived confidence scores, such as predicted template modeling (pTM) and interface predicted template modeling (ipTM), computed from predicted protein-protein complexes. These confidence scores are derived from structural predictions rather than classification objectives. Recent studies suggest that these scores correlate with true binding events64,65.
Accordingly, we conduct an exploratory comparison between AF3-derived confidence scores and a fine-tuned MAMMAL model for distinguishing binders from non-binders. We emphasize that AF3 provides detailed 3D structural hypotheses, whereas MAMMAL is a sequence-only model that produces probabilistic binding predictions; the comparison is intended to assess relative discriminative power for binding prediction rather than to equate the underlying modeling approaches.
We first evaluated the extracellular domain (ECD) of HER2, a well-characterized therapeutic antigen with experimentally validated binding epitopes. We used the HER2-specific MAMMAL model described in Subsection 2.9. Due to the computational demands and limited availability of AF3, the HER2 benchmark test set was downsampled to 60 examples, comprising 30 binders and 30 non-binders. The HER2-specific MAMMAL model demonstrates strong discriminative performance, achieving an AUROC of 0.88. In contrast, AF3 exhibits no meaningful separation between binders and non-binders (AUROC = 0.45), and the difference in performance between the two models is highly significant (DeLong test, P = 1.5 × 10−6). Structural analysis further reveals that AF3-predicted binding sites are indistinguishable between binders and non-binders and deviate from the known epitopes of the FDA-approved antibodies trastuzumab and pertuzumab (Fig. 2a). An extended comparative analysis of MAMMAL (pre-trained and fine-tuned) and AF3 is provided in Supplementary Table S5. This includes several AF3 confidence-score variants, using ipTM – and pTM-based scoring, and heavy-chain-only and heavy+light-chain input configurations.
Next, we evaluated nanobody binding across six structurally diverse antigen targets: albumin, mannose receptor (CD206), epidermal growth factor receptor (EGFR), thyroxine-binding globulin (TBG), tumor necrosis factor alpha (TNFα), and von Willebrand factor (VWF). Binding nanobodies were collected from SAbDab-nano66, patents, and proprietary datasets. Non-binders consisted of nanobodies experimentally confirmed as non-binding in phage-display library screenings, as well as nanobodies targeting unrelated antigens. From a total of 668 nanobody–antigen pairs (131 binders and 537 non-binders), we selected 475 sequences (64 binders and 411 non-binders) for MAMMAL fine-tuning and reserved 193 sequences (67 binders and 126 non-binders) for held-out evaluation. A single MAMMAL model was fine-tuned on training data comprising binders and non-binders across all targets, and subsequently evaluated separately on the test subset corresponding to each target, with performance compared against AF3 confidence scores computed on the same samples. Test set statistics and per-target MAMMAL and AF3 performance are summarized in Table 2. Additional performance metrics for MAMMAL and AF3 are presented in Supplementary Tables S6–S7. As shown, MAMMAL significantly outperforms AF3 on the larger targets: albumin, CD206, EGFR, and VWF. In contrast, AF3 achieves superior performance on the smaller TBG target, and analysis of the predicted structures highlights distinct binding sites for binders versus non-binders (Fig. 2b and Supplementary Figure S4). For the smallest protein, TNFα, the two models exhibit comparable performance.
Table 2 Per-target AUROC comparison between MAMMAL and AF3 on held-out antibody/nanobody–antigen test subsets