Loading [MathJax]/extensions/TeX/amsfonts.js

3-minute Pitch: Robust Adaptation of Large Multimodal Models for Retrieval Augmented Hateful Meme Detection

Published in EMNLP 2025 Main Conference (Oral)

Posted by Jingbiao on September 13, 2025, Reading time: 4 minutes.

In our earlier work RGCL (ACL 2024), we showed that retrieval guidance over a CLIP-style encoder could sharpen hateful meme detection. The natural next question is whether Large Multimodal Models (LMMs) — which can both detect and explain — can enjoy the same retrieval benefits without losing their general vision-language skills. In RA-HMD we show that naive supervised fine-tuning (SFT) of LMMs on meme corpora under-delivers, degrades cross-domain generalisation, and causes catastrophic forgetting of general VL capabilities. RA-HMD proposes a two-stage robust adaptation recipe that fixes all three.

\require{amstext} \require{amsmath} \require{amssymb} \require{amsfonts}

Motivation

Off-the-shelf LMMs and their SFT variants exhibit three systematic failures on hateful meme detection:

  1. Sub-optimal in-domain accuracy on benchmarks like HatefulMemes and HarMeme.
  2. Poor out-of-domain generalisation — accuracy collapses when the training distribution shifts (e.g., HarMeme → HatefulMemes).
  3. Catastrophic forgetting. Standard SFT erodes general vision-language competence on benchmarks such as MMMU, SEED-Bench and GQA, and leaves the model brittle to adversarial perturbation.

Prior retrieval-augmented detectors (including our own RGCL) worked on frozen encoders and small logistic heads. Transferring that idea onto a full generative LMM without destroying its reasoning capacity is the central technical challenge.

Two-Stage Robust Adaptation

RA-HMD adapts an LMM backbone (Qwen2-VL, LLaVA, InternVL) with LoRA, adding a trainable MLP projection and a Logistic Regression Classifier (LRC) head on top of the backbone’s pooled representation.

Stage 1 — Logistic-Regression-Augmented SFT. The LMM is trained jointly with \mathcal{L}_{\text{Stage1}} = \mathcal{L}_{\text{LM}} + \lambda \, \mathcal{L}_{\text{LRC}}, where $\mathcal{L}{\text{LM}} is the standard next-token loss producing the target label tokens (“hateful” / “benign”) together with a rationale, and \mathcal{L}{\text{LRC}}$ is a cross-entropy on the LRC head. LoRA keeps the backbone mostly frozen, so the model’s general VL skills are preserved.

Stage 2 — LMM Contrastive Fine-tuning. With the LMM frozen, we fine-tune the MLP + LRC with a contrastive objective on top of the LRC cross-entropy. Using FAISS nearest-neighbour search over the training-set embedding bank \mathbf{G}, each anchor \mathbf{g}_i receives:

  • a pseudo-gold positive \mathbf{g}_i^{+}: highest similarity, same label;
  • hard negatives \mathbf{G}_i^{-}: high similarity, opposite label.

The retrieval-guided contrastive term reuses the RGCL formulation: \mathcal{L}_{i}^{\text{RGCL}} = -\log \frac{ e^{\text{sim}(\mathbf{g}_{i},\mathbf{g}_{i}^{+})}}{ e^{\text{sim}(\mathbf{g}_{i},\mathbf{g}_{i}^{+})} + \sum_{\mathbf{g}\in\mathbf{G}_{i}^{-}} e^{\text{sim}(\mathbf{g}_i,\mathbf{g})}}.

Inference. We combine the LRC head with Retrieval-based KNN (RKC) majority voting over the retrieval bank — allowing operators to update the detector by simply inserting new examples into the database, no retraining required.

Experiments

We evaluate on six meme classification benchmarks: HatefulMemes, HarMeme, MAMI, Harm-P, MultiOFF and PrideMM.

In-domain. On HatefulMemes, Qwen2-VL-7B + RA-HMD reaches 91.1 AUC / 82.1 Acc, compared with RGCL’s 87.04 AUC / 78.82 Acc and outperforming the much larger VPD-55B agentic baseline.

Cross-domain generalisation (HarMeme → HatefulMemes). RA-HMD with RKC inference achieves 77.1 AUC / 69.3 Acc, a +21.6 AUC gain over the SFT few-shot baseline.

Adversarial robustness (SaltPepper-I-High on HatefulMemes). RA-HMD + RKC degrades by only 4.0 AUC to 86.8, versus 5.8 points for SFT. Populating the retrieval bank with perturbed exemplars recovers performance to 88.4 AUC.

Rationale quality. In pairwise human-rubric comparison on HatefulMemes validation, RA-HMD rationales beat SFT rationales 61.5% of the time, with rubric scores of 5.6 vs 4.9 / 10.

General VL capability. Performance on MMMU, SEED-Bench and GQA remains close to the pretrained baseline — SFT alone degrades on all three.

Conclusion

RA-HMD shows how to equip large multimodal models with retrieval-guided supervision without sacrificing either their general vision-language competence or their ability to explain their decisions. The two-stage pipeline — LoRA-based SFT with an auxiliary LRC head, followed by frozen-LMM contrastive fine-tuning — delivers state-of-the-art in-domain accuracy, large cross-domain gains, adversarial robustness and higher-quality rationales, all while keeping the retrieval bank hot-swappable at deployment time.

Citation

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
@inproceedings{RAHMD2025Mei,
    title = "Robust Adaptation of Large Multimodal Models for Retrieval Augmented Hateful Meme Detection",
    author = "Mei, Jingbiao  and
      Chen, Jinghong  and
      Yang, Guangyu  and
      Lin, Weizhe  and
      Byrne, Bill",
    editor = "Christodoulopoulos, Christos  and
      Chakraborty, Tanmoy  and
      Rose, Carolyn  and
      Peng, Violet",
    booktitle = "Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing",
    month = nov,
    year = "2025",
    address = "Suzhou, China",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2025.emnlp-main.1215/",
    pages = "23817--23839",
    ISBN = "979-8-89176-332-6",
}


App ready for offline use.