NovoExpert: State-of-the-art ADMET prediction on TDC benchmarks
Five SOTA wins on Therapeutics Data Commons ADMET endpoints using CatBoost ensembles with MapLight and GIN fingerprints.
Abstract
We present NovoExpert, a family of ADMET prediction models achieving state-of-the-art performance on five Therapeutics Data Commons (TDC) benchmark endpoints. Our approach combines CatBoost gradient-boosted trees with MapLight (2573-bit) and GIN (300-dimensional) molecular fingerprints, supplemented by Chemprop v2 directed message-passing neural networks for specific endpoints.
Results
| Endpoint | Metric | Score | Improvement |
|---|---|---|---|
| CYP2D6 Veith | AUPRC | 0.778 | +0.028 |
| CYP3A4 Veith | AUPRC | 0.916 | +0.016 |
| CYP3A4 Substrate | AUPRC | 0.648 | +0.004 |
| Clearance Hepatocyte | Spearman | 0.602 | +0.024 |
| DILI | AUROC | 0.922 | +0.006 |
Method
For four of five winning endpoints, the final model is a CatBoost classifier trained on concatenated MapLight and GIN fingerprints. For DILI, a Chemprop v2 D-MPNN achieved the best performance. All models were validated using the TDC benchmark scaffold split.