How Post-Training Shapes Biological Reasoning Models
Abstract
Post-training stages in biological reasoning models differently affect generalization, with continued pre-training aligning models with biological language, supervised fine-tuning improving in-domain performance but reducing out-of-domain generalization, and reinforcement learning recovering out-of-domain performance when applied to well-aligned checkpoints.
Scientific reasoning models for biology combine language models with foundation models trained on multimodal biological data, including DNA, RNA, and proteins. These models are built through post-training, yet how each stage shapes reasoning and generalization remains poorly understood. We study when post-training improves performance and when it induces over-specialization. Across genomics, transcriptomics, and proteins, we train and evaluate more than 100 biological reasoning models under controlled variation in backbone, continued pre-training (CPT), supervised fine-tuning (SFT), and reinforcement learning (RL), measuring both in-domain (ID) and out-of-domain (OOD) performance. We find that each post-training stage reshapes generalization in a distinct way rather than contributing uniform gains. CPT improves downstream performance by aligning models with biological language. SFT consistently increases ID performance but causes OOD performance to peak early and decline as models fit the training distribution. RL, when applied to strong SFT checkpoints with aligned rewards, improves OOD performance and partially recovers generalization. These results show that biological reasoning does not improve monotonically with additional supervision or compute. Instead, performance depends on how training stages are composed. Under fixed post-training budgets, the strongest ID-OOD trade-off comes from brief SFT, larger RL allocations, and asymmetric adaptation capacity across stages.
Community
This study of 100+ models reveals that biological reasoning follows non-monotonic dynamics, where more post-training compute does not always yield better generalization. We identify a "generalization collapse" where SFT drives in-domain accuracy but causes out-of-domain robustness to peak early and then sharply decline. Reinforcement Learning (RL) acts as a vital corrective, recovering lost robustness and pushing the performance frontier on unseen biological systems. Our findings identifies an optimal recipe for scientific reasoning: prioritize brief SFT paired with larger RL allocations and asymmetric adaptation capacity across stages.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Introspective X Training: Feedback Conditioning Improves Scaling Across all LLM Training Stages (2026)
- On Asymmetric Optimization of Reasoning and Perception in Vision-Language Model Post-Training (2026)
- OmicsLM: A Multimodal Large Language Model for Multi-Sample Omics Reasoning (2026)
- GRLO: Towards Generalizable Reinforcement Learning in Open-Ended Environments from Zero (2026)
- AdaMame: A Training Recipe for Adaptive Multilingual Reasoning (2026)
- Hybrid-LoRA: Bridging Full Fine-Tuning and Low-Rank Adaptation for Post-Training (2026)
- ProteinJEPA: Latent prediction complements protein language models (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on HF Mirror checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Get this paper in your agent:
hf papers read 2606.16517 Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash Models citing this paper 1
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper