Explanation

AutoSDT is designed to maximize ecological validity of scientific programming tasks while minimizing manual curation cost.

AutoSDT-Search improves repository recall by discipline-specific keyword generation.
AutoSDT-Select improves data quality by verifying scientific-task relevance and isolating executable dependencies.
AutoSDT-Adapt improves usability by converting fragmented code into standalone tasks with instructions.

This design supports both training (task-solution supervision) and evaluation (realistic scientific coding benchmarks).

Training and Inference

Supervised Fine-tuning

We use the LLaMA-Factory library for SFT. Example config files are provided in the models/ folder.

Inference and Evaluation

For ScienceAgentBench, follow its original repository instructions.

For DiscoveryBench, first start an LLM engine at localhost using vllm, then run:

python evaluate_with_llm_engine.py

Then compute final metrics:

python cal_eval_avg.py

Acknowledgements

This work is supported in part by the National Science Foundation (NSF) funded AI Institute for Intelligent Cyberinfrastructure with Computational Learning in the Environment (ICICLE) under award OAC 2112606.

Training and Inference​

Supervised Fine-tuning​

Inference and Evaluation​

Acknowledgements​

Training and Inference

Supervised Fine-tuning

Inference and Evaluation

Acknowledgements