Tutorial of training Multi-modal Agent Tuning projects with LLaMA-Factory
Background
In our research work, Multi-modal Agent Tuning (MAT), we have developed a framework for auto-generating multi-modal tool-usage trajectories (20K MM-Traj), boosting MiniCPM & Qwen-VL tool use by 20%. This work is accepted by ICLR 2025.
At the moment we did this work, LLaMA-Factory has not supported the training of Qwen-VL and MiniCPM. Therefore, we need to modify the code from officials of Qwen-VL and MiniCPM team. In our code, you can find training these two models required two separated codebase, which is not very convenient.
In this tutorial, I will show you how to use latest LLaMA-Factory to train MAT projects, such that you only need to download dataset then use one single codebase to train MAT.
Tutorial
Step 1: Install LLaMA-Factory
This is very simple, just follow the official instruction.
conda create -n mat python=3.10
conda activate mat
git clone --depth 1 https://github.com/hiyouga/LLaMA-Factory.git
cd LLaMA-Factory
pip install -e ".[torch,metrics]"
Step 2: Download and parse dataset
You can download the dataset from HF with huggingface-cli
.
# assume you are in the root directory of LLaMA-Factory
mkdir data/mat
huggingface-cli download PengxiangLi/MAT --local-dir data/mat
You need to unzip files.zip
in data/mat
to get the images. The dataset format we released is different to what LLaMA-Factory supported, and I wrote a simple script to do the conversion.
import json
import os
from tqdm import tqdm
data_path = "data/mat/mat_train.json"
with open(data_path, "r") as f:
data = json.load(f)
processed_data = []
for item in tqdm(data):
images = item["image"]
conversation = item["conversations"]
conversation_processed = []
system_prompt = None
for conv_id, conv in enumerate(conversation):
if conv["role"] == "user":
content = conv["content"]
try:
temp = json.loads(images)
for key in temp.keys():
content = content.replace(key, "<image>")
except:
pass
conversation_processed.append({"role": "user", "content": content})
elif conv["role"] == "system":
system_prompt = conv["content"]
else:
conversation_processed.append({"role": "assistant", "content": conv["content"]})
if type(images) == str and len(images) <= 2:
images = []
elif type(images) == str and len(images) > 2:
try:
images = json.loads(images)
images_processed = []
for i in range(len(images)):
images_processed.append(images[f"<image_0{i}>"])
images = images_processed
except:
images = [images]
else:
raise ValueError(f"Invalid images type: {type(images)}")
for i in range(len(images)):
raw = images[i]
print(raw)
images[i] = f"data/mat/tongagent/{images[i].replace('data/open_llava_next/', '')}"
assert os.path.exists(images[i]), f"Image {raw} {type(raw)} does not exist"
processed_item = {
"messages": conversation_processed,
"images": images,
"system": system_prompt
}
processed_data.append(processed_item)
with open("data/mat_train_processed.json", "w") as f:
json.dump(processed_data[:500], f, indent=4)
Note: In this script, I did some conversion to make sure the image path is correct and replace the image placeholder. You should check the image path since you might download them somewhere else.
This will give you a similar structure like data/mllm_demo.json
. You will have something like this:
[
{
"messages": [
{
"content": "<image>Who are they?",
"role": "user"
},
{
"content": "They're Kane and Gretzka from Bayern Munich.",
"role": "assistant"
},
{
"content": "What are they doing?",
"role": "user"
},
{
"content": "They are celebrating on the soccer field.",
"role": "assistant"
}
],
"images": [
"mllm_demo_data/1.jpg"
]
},
...
]
Step 3: Configure the LLaMA-Factory
Now, you have the dataset ready, and you need to change LLaMA-Factory’s dataset configuration. In data/dataset_info.json
, simply add
...
"mat": {
"file_name": "mat_train_processed.json",
"formatting": "sharegpt",
"columns": {
"messages": "messages",
"images": "images",
"system": "system"
},
"tags": {
"role_tag": "role",
"content_tag": "content",
"user_tag": "user",
"assistant_tag": "assistant"
}
},
...
Write the following yaml file as training config. This is for MiniCPM-V-2_6.
### model
model_name_or_path: openbmb/MiniCPM-V-2_6
image_resolution: 262144
video_resolution: 16384
trust_remote_code: true
### method
stage: sft
do_train: true
finetuning_type: lora
lora_rank: 8
lora_target: all
### dataset
dataset: mat # video: mllm_video_demo
template: minicpm_v
cutoff_len: 10240
max_samples: 1000
overwrite_cache: true
preprocessing_num_workers: 16
### output
output_dir: saves/minicpm_v-2_6/lora/sft
logging_steps: 10
save_steps: 500
plot_loss: true
overwrite_output_dir: true
### train
per_device_train_batch_size: 1
gradient_accumulation_steps: 8
learning_rate: 1.0e-4
num_train_epochs: 3.0
lr_scheduler_type: cosine
warmup_ratio: 0.1
bf16: true
ddp_timeout: 180000000
### eval
# val_size: 0.1
# per_device_eval_batch_size: 1
# eval_strategy: steps
# eval_steps: 500
This is for Qwen2-VL
### model
model_name_or_path: Qwen/Qwen2-VL-7B-Instruct
image_resolution: 262144
video_resolution: 16384
trust_remote_code: true
### method
stage: sft
do_train: true
finetuning_type: lora
lora_rank: 8
lora_target: all
### dataset
dataset: mat # video: mllm_video_demo
template: qwen2_vl
cutoff_len: 10240
max_samples: 1000
overwrite_cache: true
preprocessing_num_workers: 16
### output
output_dir: saves/qwen2_vl-7b/lora/sft
logging_steps: 10
save_steps: 500
plot_loss: true
overwrite_output_dir: true
### train
per_device_train_batch_size: 1
gradient_accumulation_steps: 8
learning_rate: 1.0e-4
num_train_epochs: 3.0
lr_scheduler_type: cosine
warmup_ratio: 0.1
bf16: true
ddp_timeout: 180000000
### eval
# val_size: 0.1
# per_device_eval_batch_size: 1
# eval_strategy: steps
# eval_steps: 500
Step 4: Train MAT
Training is straightforward, just run
llamafactory-cli train examples/train_lora/qwen2vl_lora_sft_mat.yaml
# or
llamafactory-cli train examples/train_lora/minicpm_v_lora_sft_mat.yaml
Troubleshooting
AttributeError: 'MiniCPMVProcessor' object has no attribute 'audio_feature_extract'
When you run the training, you might encounter this error. This is because the MiniCPMVProcessor
does not have the audio_feature_extract
method. I simply remove all code block related to audio. I guess this problem is due to the modification of MiniCPM-O since it has audio feature extraction, but somehow this change make MiniCPM-V not work. As I used LLaMA-Factory 0.9.2
, this problem might be fixed in the latest version.
Example of modified code:
# in src/llamafactory/data/mm_plugin.py line 624
# Comment the following code
# if len(audios) != 0:
# audio_parts_ls = kwargs.get("audio_parts_ls", None)
# new_audios = []
# for audio in audios:
# if not isinstance(audio, np.ndarray):
# audio = librosa.load(audio, sr=processor.feature_extractor.sampling_rate)[0]
# new_audios.append(audio)
# audios_ls = []
# idx = 0
# for audio_parts in audio_parts_ls:
# audios_ls.append(new_audios[idx : idx + len(audio_parts)])
# idx += len(audio_parts)
# audio_features, audio_feature_lens, audio_phs = processor.audio_feature_extract(
# audios_ls,
# audio_parts_ls,
# chunk_input=True,
# sampling_rate=16000,
# )
# mm_inputs.update({"audio_features": audio_features, "audio_feature_lens": audio_feature_lens})
# if kwargs.get("ret_phs", False):
# mm_inputs.update({"audio_phs": audio_phs})
Happy to answer any questions and please feel free to ask.