Multimodal LoRA on Text-Only Data: The Visual-Block Gradient Sink
LoRA Engineering

Multimodal LoRA on Text-Only Data: The Visual-Block Gradient Sink

May 2026 · Black Sheep AI Research

If you fine-tune a multimodal model with LoRA on text-only data and don't filter your target_modules, the adapters that PEFT inserts on the vision encoder receive zero gradient signal, and you've silently wasted ~20% of your trainable parameter budget on layers that won't move from initialization.

The setup

You're fine-tuning Qwen3.6-27B (or Gemma-4-27B-vision, or any other multimodal LLM) on a text-only reasoning corpus. You apply LoRA with target_modules=["q_proj", "v_proj", "down_proj"], the standard suffix-name pattern. You train for 50,000 examples and evaluate.

Your trainable parameter count looks fine on paper: 408 modules across the model × rank 16 × ~5K dims = ~110 M trainable. Your eval results are mixed, better than the BF16 baseline on some tasks, worse than expected on others. You inspect the trained adapter and find that some adapter weights are exactly at their initialization values to the bit.

This is the visual-block gradient sink.

What's happening

Multimodal models are typically structured as:

model.
├── visual.                    # vision encoder (CLIP-like or similar)
│   ├── blocks.0.attn.qkv      # ← suffix `qkv` matches your `q_proj` style? Not quite.
│   ├── blocks.0.attn.proj
│   └── blocks.0.mlp.linear_fc1
│   ...
└── language_model.            # text decoder (transformer)
    ├── layers.0.self_attn.q_proj
    ├── layers.0.self_attn.v_proj
    └── layers.0.mlp.down_proj
    ...

When you train with text-only inputs (no images), the vision encoder's forward pass either:

In all three cases, LoRA adapters on visual modules see zero gradient on every step. They never move from lora_A=Gaussian, lora_B=zeros initialization. They consume rank in your trainable parameter budget but contribute nothing.

How big is the silent waste?

We measured on Qwen3.6-27B (a multimodal hybrid SSM+attention model):

Module category Module count Trainable params at r=16
Vision encoder linears 83 ~22 M (20%)
Language model linears 408 ~89 M (80%)
Total if you don't filter 491 111 M
Total if you filter visual 408 89 M

20% of your LoRA budget is silently allocated to dead weight if your target_modules includes vision suffixes that match. That's not just inefficient, it makes any "trainable parameter count" comparisons across multimodal vs text-only models meaningless until you filter.

How to filter

Two clean approaches:

Option 1, Regex-based exclusion (preferred):

target_modules = r"model\.language_model\..*\.(q_proj|v_proj|down_proj)$"

This matches only modules under model.language_model.*, the vision encoder is excluded by name. PEFT compiles the single regex via re.fullmatch and inserts adapters precisely where you want them.

Option 2, Pre-filter your module list:

all_target_paths = [m for m in model.named_modules() if isinstance(m, nn.Linear)]
language_only = [path for path, _ in all_target_paths if "visual" not in path and "vision" not in path]
config = LoraConfig(target_modules="|".join(re.escape(p) for p in language_only), ...)

More verbose but lets you also exclude mtp., multi_token_pred., or other architectural components you don't want adapted.

How to detect the problem after the fact

If you've already trained and you're not sure whether your adapters covered visual modules, this 5-line check tells you:

import torch
from peft import PeftModel
adapter = PeftModel.from_pretrained(base_model, adapter_path)

for name, module in adapter.named_modules():
    if "lora_B" in name and "visual" in name:
        weight = module.weight
        if torch.all(weight == 0):
            print(f"DEAD: {name}  (B is still zero, adapter never trained)")

Any "DEAD" output means visual modules have unrotated lora_B weights, they're at their zero initialization, confirming they received no gradient. Drop them, retrain with the language-only filter, and you've recovered 20% of your effective trainable budget for the modules that actually receive signal.

Why this matters for paper-grade comparisons

Suppose paper A claims "LoRA at rank=16 with 110M trainable parameters achieves X% on MMLU-Pro" and paper B claims "LoRA at rank=16 with 89M trainable parameters achieves X% on MMLU-Pro." Naively, paper B wins on parameter efficiency.

But if paper A trained on a multimodal model with text-only data and included vision modules in target_modules, the effective trainable parameter count for A is 89M, not 110M. The 20M parameters in vision modules contributed zero. The two papers are equivalent on effective LoRA budget, and the apparent "B beats A on parameter efficiency" is a measurement artifact.

This silently inflates the perceived "wasted parameters" in any LoRA comparison study that doesn't audit visual-module adapters.

When you actually want LoRA on visual modules

Three legitimate cases:

  1. You're training on multimodal data, image-and-text inputs. Vision modules see real activations and meaningful gradients. Adapt them.
  2. You're doing visual-only fine-tuning, image classification, image captioning, OCR. Language modules become the dead-weight side; you'd want a regex like model\.visual\..* to exclude language.
  3. You're studying cross-modal transfer, initializing a visual adapter from text-task gradients to see if anything transfers via shared layers (some recent papers explore this for vision-language alignment). Note that this is a research-grade question, not a default mode.

The default for text-only SFT is to filter visual modules out of target_modules. Make the exclusion explicit; don't trust suffix matching to handle it.

The shorter rule

For text-only fine-tuning of a multimodal model, always restrict target_modules to model.language_model.* paths. Verify with the dead-weight check after a single training step. If 20% of your lora_B weights are still at zero, you have the bug.


Source: observed on Qwen3.6-27B fine-tuning experiments, 2026. Pattern generalises to any multimodal LLM with an unused vision encoder during training.

Read more: PEFT target_modules Has Three Modes, A 16 GB Mac Mini Can Quantize a 250 GB Model.