Tutorials

Enviroment

The software environment is determined by the specific training frameworks employed, such as the versions of CUDA, PyTorch, FlashAttention, and others. While the requirements.txt file enumerates the necessary packages, it is the user's responsibility to specify the appropriate versions required for their use case.

cd distributed_training_estimator_of_LLM
pip install -r requirements.txt

Or you can only install the pacakges for estimator, if you already have sampling data.

cd distributed_training_estimator_of_LLM/Estimator
pip install -r requirements-estimator.txt

Predictor

In order to run Predictor, the training configurations, computing and communication operators' sampling data are required. In the target_config folder, there are two example configuration YAML files. The regressors folder contains the required data obtained from two real clusters as examples. This can run on any CPU from the past five years, as it only relies on Random Forest and XGBoost.

cd Estimator
python mml_3d_prediction.py --config_path <path_to_config.yml>

Commands to run two example configurations, providing sampling data of Perlmutter and Vista in Estimator/regressors.

cd Estimator
# One batch runtime estimator about llemma-7B with 4 pipline, 2 model and 2 data parallelism ways on Perlmutter. 
python mml_3d_prediction.py --config_path ./target_config/llemma_7b_4_2_2_P.yml

# One batch runtime estimator about llemma-7B with 4 pipline, 2 model and 2 data parallelism ways on Vista. 
python mml_3d_prediction.py --config_path ./target_config/llemma_7b_4_2_2_V.yml

# And will print out a message in the terminal like this:
Estimated timecost of current training configs is 9480819.171239894 us.

The output can also be obtained using the function.

from mml_3d_prediction import one_batch_predict

configs_path = 'path_to/training_config.yml'
one_batch_cost = one_batch_predict(configs_path)   # microseconds

The output is the estimated time cost of a single parameter update, measured in microseconds.

Computation Sampling

The computing operator sampling module requires the configuration of each operator in the form of a YAML file. The /configs/collect and /configs/test directories provide details about the configuration files. This can be run on a single GPU or multiple GPUs.

cd Kernel_sampling
## For example sampling the baddbmm with fp16 
python sampling_controller.py --config_path ./configs/collect/baddbmm.yml --precision fp16 

The files run_collection.sh and run_test.sh contain details about how to test and collect the sampling data for each operator. The --parts option specifies how many parts the sampling work should be split into, and the --part option specifies which part of the work is being processed on current GPU.

Communication Sampling

This part, like the Operator Sampling, also requires the configuration of each communication operator in the form of a YAML file. The /configs/collect and /configs/test directories provide details about the configuration files. The example shows how to collect P2P communication between two nodes, with only one GPU being active on each node.

 # Get master address and port
 nodes=( $( scontrol show hostnames $SLURM_JOB_NODELIST ) )
 nodes_array=($nodes)
 head_node=${nodes_array[0]}
 head_node_ip=$(srun --nodes=1 --ntasks=1 -w "$head_node" hostname --ip-address)

 # Get the number of nodes
 NNODES=$(scontrol show hostname $SLURM_NODELIST | wc -l)

 # Active GPU 0 of each node
 srun --export=ALL,CUDA_VISIBLE_DEVICES=0 torchrun \
 --nnodes $NNODES \
 --nproc_per_node 1 \
 --rdzv_id $RANDOM \
 --rdzv_backend c10d \
 --rdzv_endpoint $head_node_ip:29500 \
 sampling_controller.py \
 --config_path ./configs/test/p2p.yml \
 --precision fp16 \
 --parts 1 \
 --part 1

Enviroment​

Predictor​

Computation Sampling​

Communication Sampling​

Enviroment

Predictor

Computation Sampling

Communication Sampling