Can a general-purpose causal LLM (OLMo-7B) match a dedicated Message-Passing Graph Neural Network (GNN) on molecular toxicity prediction when evaluated on a rigorous, data-leakage-free scaffold split?
Most published baselines use Random Splitting, which allows structural data leakage. Our rigorous Murcko Scaffold Split reveals the truth:
| Model | Architecture | Split | ROC-AUC | Status |
|---|---|---|---|---|
| RF + ECFP | Fingerprint | Random | 0.8183 | ❌ Inflated (Data Leakage) |
| RF + ECFP | Fingerprint | Scaffold | 0.6135 | |
| Pure PyTorch GNN | Graph (atoms+bonds) | Scaffold | 0.6388 | 🥇 Topology Aware |
| OLMo-7B QLoRA | 1D Text (SMILES) | Scaffold | 0.6300 | 🥈 Language Aware |
Gap between GNN and OLMo: 0.009 This near-zero gap proves two things:
- OLMo learns molecular grammar from 1D SMILES almost as effectively as a dedicated 2D graph model.
- The Motivation for TSM: That remaining 0.009 gap justifies injecting 2D graph constraints directly into the LLM decoding loop—which is exactly what our Topological State Machine (TSM) achieves.
We built a custom GNN without relying on heavy dependencies like DGL or PyG. It seamlessly consumes DeepChem's MolGraphConvFeaturizer outputs.
- Nodes: Atoms (30-dim features)
- Edges: Bonds
- Pipeline: 3 Message Passing Rounds ➔ Global Mean Pooling ➔ 12-Task Classification Head
This repository provides the empirical foundation for the DeepChem LLM Integration proposal:
- Baseline Validation: Establishes honest scaffold split numbers.
- GIMF Motivation: Quantifies the structural gap between LLMs and GNNs.
- DeepChem Native: Fully compliant with
dc.featanddc.datastandards.
├── notebooks/
│ ├── 01_rf_ecfp_baseline.ipynb # Random Forest baseline
│ ├── 02_gnn_scaffold_split.ipynb # Pure PyTorch GNN
│ └── 03_olmo_tox21_classification.ipynb # OLMo-7B
├── results/
│ └── benchmark_table.csv # Extracted metrics
├── src/
│ ├── scaffold_split.py # Murcko scaffold splitter
│ └── mol_gnn.py # PyTorch GNN Implementation
└── README.md