DFLOP is a data-driven optimization framework designed to improve distributed training efficiency for Multimodal Large Language Models (MLLMs).
Unlike existing data-agnostic frameworks that parallelize computation blindly, DFLOP adapts parallelism and scheduling to the real data characteristics, mitigating computation imbalance and input-dependent performance variance.
DFLOP consists of three core components:
-
Profiling Engine
- Profiles both model and data workloads.
- Builds predictive models for memory and throughput across input shapes.
- Analyzes the empirical input-shape distribution from real datasets.
-
Data-aware 3D Parallelism Optimizer
- Uses profiling results to determine optimal 3D parallelism configurations
(Tensor / Pipeline / Data Parallelism) for each module independently. - Minimizes expected makespan under memory and hardware constraints.
- Uses profiling results to determine optimal 3D parallelism configurations
-
Online Microbatch Scheduler
- Dynamically partitions each training batch using Integer Linear Programming (ILP).
- Balances computation load across pipeline stages in real time.
- Reduces GPU idle time caused by pipeline bubbles.
- Install package
conda create -n dflop python=3.10 -y
conda activate dflop
pip install --upgrade pip # enable PEP 660 support
pip install -e .[dev] --extra-index-url https://download.pytorch.org/whl/cu124- Install an additional package
pip install flash-attn==2.7.3 --no-build-isolationAfter downloading:
- Place Single Image Dataset and Multiple Image Dataset inside the image_folder (e.g., data/image_folder/)
- Place Video Dataset inside the video_folder (e.g., data/video_folder/)
Set these dataset paths in the dataset paths in configs/dataset_config.yaml.
-
mllm_model_namecan be selected from the following options:- llavaov
- internvl
-
llm_model_namecan be selected from:- qwen2.5
- llama3
DFLOP uses a separate configuration file, configs/dflop_config.yaml,
to define model selection, dataset paths, hardware resources, and training parameters.
You must fill in the commented sections before running DFLOP.
Navigate to scripts folder.
The run_profiling_engine.sh script launches the Profiling Engine of DFLOP across multiple nodes.
Each node must have a unique rank_number, assigned sequentially (e.g., 0, 1, 2, 3, ...), so that every node can correctly identify its role in the distributed profiling job.
bash run_profiling_engine.sh <num_nodes> <rank_number> <master_addr>After completing the profiling stage, run the Data-aware 3D Parallelism Optimizer to automatically search for optimal parallel configurations based on the profiling results.
bash run_data_aware_optimization.shOnce the optimized configuration is generated, you can start the training phase. During this stage, DFLOP’s Online Microbatch Scheduler runs asynchronously to dynamically balance workloads across GPU pipeline stages in real time.
Each node must have a unique rank_number, assigned sequentially (e.g., 0, 1, 2, 3, ...), so that every node can correctly identify its role in the distributed profiling job.
bash run_training.sh <num_nodes> <rank_number> <master_addr>