Background

In our research work, Multi-modal Agent Tuning (MAT), we have developed a framework for auto-generating multi-modal tool-usage trajectories (20K MM-Traj), boosting MiniCPM & Qwen-VL tool use by 20%. This work is accepted by ICLR 2025.

At the moment we did this work, LLaMA-Factory has not supported the training of Qwen-VL and MiniCPM. Therefore, we need to modify the code from officials of Qwen-VL and MiniCPM team. In our code, you can find training these two models required two separated codebase, which is not very convenient.

In this tutorial, I will show you how to use latest LLaMA-Factory to train MAT projects, such that you only need to download dataset then use one single codebase to train MAT.

Tutorial

Step 1: Install LLaMA-Factory

This is very simple, just follow the official instruction.

conda create -n mat python=3.10
conda activate mat
git clone --depth 1 https://github.com/hiyouga/LLaMA-Factory.git
cd LLaMA-Factory
pip install -e ".[torch,metrics]"

Step 2: Download and parse dataset

You can download the dataset from HF with huggingface-cli.

# assume you are in the root directory of LLaMA-Factory
mkdir data/mat
huggingface-cli download PengxiangLi/MAT --local-dir data/mat

You need to unzip files.zip in data/mat to get the images. The dataset format we released is different to what LLaMA-Factory supported, and I wrote a simple script to do the conversion.

import json
import os
from tqdm import tqdm
data_path = "data/mat/mat_train.json"

with open(data_path, "r") as f:
    data = json.load(f)

processed_data = []
for item in tqdm(data):
    images = item["image"]
    conversation = item["conversations"]
    conversation_processed = []
    system_prompt = None
    for conv_id, conv in enumerate(conversation):
        if conv["role"] == "user":
            content = conv["content"]
            try:
                temp = json.loads(images)
                for key in temp.keys():
                    content = content.replace(key, "<image>")
            except:
                pass
                
            conversation_processed.append({"role": "user", "content": content})
        elif conv["role"] == "system":
            system_prompt = conv["content"]
        else:
            conversation_processed.append({"role": "assistant", "content": conv["content"]})
            
    if type(images) == str and len(images) <= 2:
        images = []
    elif type(images) == str and len(images) > 2:
        try:
            images = json.loads(images)
            images_processed = []
            for i in range(len(images)):
                images_processed.append(images[f"<image_0{i}>"])
            images = images_processed
        except:
            images = [images]
            
    else:
        raise ValueError(f"Invalid images type: {type(images)}")
    
    for i in range(len(images)):
        raw = images[i]
        print(raw)
        images[i] = f"data/mat/tongagent/{images[i].replace('data/open_llava_next/', '')}"
        assert os.path.exists(images[i]), f"Image {raw} {type(raw)} does not exist"
    processed_item = {
        "messages": conversation_processed,
        "images": images,
        "system": system_prompt
    }

    processed_data.append(processed_item)

with open("data/mat_train_processed.json", "w") as f:
    json.dump(processed_data[:500], f, indent=4)

Note: In this script, I did some conversion to make sure the image path is correct and replace the image placeholder. You should check the image path since you might download them somewhere else.

This will give you a similar structure like data/mllm_demo.json. You will have something like this:

[
  {
    "messages": [
      {
        "content": "<image>Who are they?",
        "role": "user"
      },
      {
        "content": "They're Kane and Gretzka from Bayern Munich.",
        "role": "assistant"
      },
      {
        "content": "What are they doing?",
        "role": "user"
      },
      {
        "content": "They are celebrating on the soccer field.",
        "role": "assistant"
      }
    ],
    "images": [
      "mllm_demo_data/1.jpg"
    ]
  },
  ...
]

Step 3: Configure the LLaMA-Factory

Now, you have the dataset ready, and you need to change LLaMA-Factory’s dataset configuration. In data/dataset_info.json, simply add

...
  "mat": {
    "file_name": "mat_train_processed.json",
    "formatting": "sharegpt",
    "columns": {
      "messages": "messages",
      "images": "images",
      "system": "system"
    },
    "tags": {
      "role_tag": "role",
      "content_tag": "content",
      "user_tag": "user",
      "assistant_tag": "assistant"
    }
  },
  ...

Write the following yaml file as training config. This is for MiniCPM-V-2_6.

### model
model_name_or_path: openbmb/MiniCPM-V-2_6
image_resolution: 262144
video_resolution: 16384
trust_remote_code: true

### method
stage: sft
do_train: true
finetuning_type: lora
lora_rank: 8
lora_target: all

### dataset
dataset: mat  # video: mllm_video_demo
template: minicpm_v
cutoff_len: 10240
max_samples: 1000
overwrite_cache: true
preprocessing_num_workers: 16

### output
output_dir: saves/minicpm_v-2_6/lora/sft
logging_steps: 10
save_steps: 500
plot_loss: true
overwrite_output_dir: true

### train
per_device_train_batch_size: 1
gradient_accumulation_steps: 8
learning_rate: 1.0e-4
num_train_epochs: 3.0
lr_scheduler_type: cosine
warmup_ratio: 0.1
bf16: true
ddp_timeout: 180000000

### eval
# val_size: 0.1
# per_device_eval_batch_size: 1
# eval_strategy: steps
# eval_steps: 500

This is for Qwen2-VL

### model
model_name_or_path: Qwen/Qwen2-VL-7B-Instruct
image_resolution: 262144
video_resolution: 16384
trust_remote_code: true

### method
stage: sft
do_train: true
finetuning_type: lora
lora_rank: 8
lora_target: all

### dataset
dataset: mat  # video: mllm_video_demo
template: qwen2_vl
cutoff_len: 10240
max_samples: 1000
overwrite_cache: true
preprocessing_num_workers: 16

### output
output_dir: saves/qwen2_vl-7b/lora/sft
logging_steps: 10
save_steps: 500
plot_loss: true
overwrite_output_dir: true

### train
per_device_train_batch_size: 1
gradient_accumulation_steps: 8
learning_rate: 1.0e-4
num_train_epochs: 3.0
lr_scheduler_type: cosine
warmup_ratio: 0.1
bf16: true
ddp_timeout: 180000000

### eval
# val_size: 0.1
# per_device_eval_batch_size: 1
# eval_strategy: steps
# eval_steps: 500

Step 4: Train MAT

Training is straightforward, just run

llamafactory-cli train examples/train_lora/qwen2vl_lora_sft_mat.yaml
# or
llamafactory-cli train examples/train_lora/minicpm_v_lora_sft_mat.yaml

Troubleshooting

AttributeError: 'MiniCPMVProcessor' object has no attribute 'audio_feature_extract'

When you run the training, you might encounter this error. This is because the MiniCPMVProcessor does not have the audio_feature_extract method. I simply remove all code block related to audio. I guess this problem is due to the modification of MiniCPM-O since it has audio feature extraction, but somehow this change make MiniCPM-V not work. As I used LLaMA-Factory 0.9.2, this problem might be fixed in the latest version.

Example of modified code:

# in src/llamafactory/data/mm_plugin.py line 624
# Comment the following code
# if len(audios) != 0:
#     audio_parts_ls = kwargs.get("audio_parts_ls", None)
#     new_audios = []
#     for audio in audios:
#         if not isinstance(audio, np.ndarray):
#             audio = librosa.load(audio, sr=processor.feature_extractor.sampling_rate)[0]
#         new_audios.append(audio)

#     audios_ls = []
#     idx = 0
#     for audio_parts in audio_parts_ls:
#         audios_ls.append(new_audios[idx : idx + len(audio_parts)])
#         idx += len(audio_parts)

#     audio_features, audio_feature_lens, audio_phs = processor.audio_feature_extract(
#         audios_ls,
#         audio_parts_ls,
#         chunk_input=True,
#         sampling_rate=16000,
#     )
#     mm_inputs.update({"audio_features": audio_features, "audio_feature_lens": audio_feature_lens})
#     if kwargs.get("ret_phs", False):
#         mm_inputs.update({"audio_phs": audio_phs})

Happy to answer any questions and please feel free to ask.