Distributed Training Estimator of LLMs

📄️ Distributed Training Estimator of LLMs

This component implements a time cost estimator for distributed training of large language models (LLMs). It is used to predict the time required to train one batch across multiple GPUs. The predictor module only requires at least a CPU. The computation sampling module needs one or more GPUs, while the communication sampling module requires multiple GPUs, depending on your computing platform.

📄️ Tutorials

Enviroment