If you fine-tune a multimodal model with LoRA on text-only data and don't filter your target_modules, the adapters that PEFT inserts on the vision encoder receive zero gradient signal, and you've silently wasted ~20% of your trainable parameter budget on layers that won't move from initialization.
The setup
You're fine-tuning Qwen3.6-27B (or Gemma-4-27B-vision, or any other multimodal LLM) on a text-only reasoning corpus. You apply LoRA with target_modules=["q_proj", "v_proj", "down_proj"], the standard suffix-name pattern. You train for 50,000 examples and evaluate.
Your trainable parameter count looks fine on paper: 408 modules across the model × rank 16 × ~5K dims = ~110 M trainable. Your eval results are mixed, better than the BF16 baseline on some tasks, worse than expected on others. You inspect the trained adapter and find that some adapter weights are exactly at their initialization values to the bit.
This is the visual-block gradient sink.
What's happening
Multimodal models are typically structured as:
model.
├── visual. # vision encoder (CLIP-like or similar)
│ ├── blocks.0.attn.qkv # ← suffix `qkv` matches your `q_proj` style? Not quite.
│ ├── blocks.0.attn.proj
│ └── blocks.0.mlp.linear_fc1
│ ...
└── language_model. # text decoder (transformer)
├── layers.0.self_attn.q_proj
├── layers.0.self_attn.v_proj
└── layers.0.mlp.down_proj
...
When you train with text-only inputs (no images), the vision encoder's forward pass either:
- Doesn't run at all (if the input pipeline skips vision when no image is provided), gradients through visual modules are zero by definition
- Runs on a placeholder zero tensor (some implementations), same effect, zero gradient
- Runs on real image features but the loss doesn't depend on them for text-only outputs, same effect by chain rule
In all three cases, LoRA adapters on visual modules see zero gradient on every step. They never move from lora_A=Gaussian, lora_B=zeros initialization. They consume rank in your trainable parameter budget but contribute nothing.
How big is the silent waste?
We measured on Qwen3.6-27B (a multimodal hybrid SSM+attention model):
| Module category | Module count | Trainable params at r=16 |
|---|---|---|
| Vision encoder linears | 83 | ~22 M (20%) |
| Language model linears | 408 | ~89 M (80%) |
| Total if you don't filter | 491 | 111 M |
| Total if you filter visual | 408 | 89 M |
20% of your LoRA budget is silently allocated to dead weight if your target_modules includes vision suffixes that match. That's not just inefficient, it makes any "trainable parameter count" comparisons across multimodal vs text-only models meaningless until you filter.
How to filter
Two clean approaches:
Option 1, Regex-based exclusion (preferred):
target_modules = r"model\.language_model\..*\.(q_proj|v_proj|down_proj)$"
This matches only modules under model.language_model.*, the vision encoder is excluded by name. PEFT compiles the single regex via re.fullmatch and inserts adapters precisely where you want them.
Option 2, Pre-filter your module list:
all_target_paths = [m for m in model.named_modules() if isinstance(m, nn.Linear)]
language_only = [path for path, _ in all_target_paths if "visual" not in path and "vision" not in path]
config = LoraConfig(target_modules="|".join(re.escape(p) for p in language_only), ...)
More verbose but lets you also exclude mtp., multi_token_pred., or other architectural components you don't want adapted.
How to detect the problem after the fact
If you've already trained and you're not sure whether your adapters covered visual modules, this 5-line check tells you:
import torch
from peft import PeftModel
adapter = PeftModel.from_pretrained(base_model, adapter_path)
for name, module in adapter.named_modules():
if "lora_B" in name and "visual" in name:
weight = module.weight
if torch.all(weight == 0):
print(f"DEAD: {name} (B is still zero, adapter never trained)")
Any "DEAD" output means visual modules have unrotated lora_B weights, they're at their zero initialization, confirming they received no gradient. Drop them, retrain with the language-only filter, and you've recovered 20% of your effective trainable budget for the modules that actually receive signal.
Why this matters for paper-grade comparisons
Suppose paper A claims "LoRA at rank=16 with 110M trainable parameters achieves X% on MMLU-Pro" and paper B claims "LoRA at rank=16 with 89M trainable parameters achieves X% on MMLU-Pro." Naively, paper B wins on parameter efficiency.
But if paper A trained on a multimodal model with text-only data and included vision modules in target_modules, the effective trainable parameter count for A is 89M, not 110M. The 20M parameters in vision modules contributed zero. The two papers are equivalent on effective LoRA budget, and the apparent "B beats A on parameter efficiency" is a measurement artifact.
This silently inflates the perceived "wasted parameters" in any LoRA comparison study that doesn't audit visual-module adapters.
When you actually want LoRA on visual modules
Three legitimate cases:
- You're training on multimodal data, image-and-text inputs. Vision modules see real activations and meaningful gradients. Adapt them.
- You're doing visual-only fine-tuning, image classification, image captioning, OCR. Language modules become the dead-weight side; you'd want a regex like
model\.visual\..*to exclude language. - You're studying cross-modal transfer, initializing a visual adapter from text-task gradients to see if anything transfers via shared layers (some recent papers explore this for vision-language alignment). Note that this is a research-grade question, not a default mode.
The default for text-only SFT is to filter visual modules out of target_modules. Make the exclusion explicit; don't trust suffix matching to handle it.
The shorter rule
For text-only fine-tuning of a multimodal model, always restrict target_modules to model.language_model.* paths. Verify with the dead-weight check after a single training step. If 20% of your lora_B weights are still at zero, you have the bug.
Source: observed on Qwen3.6-27B fine-tuning experiments, 2026. Pattern generalises to any multimodal LLM with an unused vision encoder during training.
Read more: PEFT target_modules Has Three Modes, A 16 GB Mac Mini Can Quantize a 250 GB Model.