File failed to load: https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.1/extensions/TeX/amsfonts.js

3-minute Pitch: ExPO-HM — Learning to Explain-then-Detect for Hateful Meme Detection

Published in ICLR 2026

Posted by Jingbiao on January 22, 2026, Reading time: 3 minutes.

Hateful meme detection has largely been framed as binary classification: a model outputs hateful or benign and, at best, a confidence score. For real-world content moderation that is not enough — moderators need to know who is being attacked and what kind of attack is being levelled. In ExPO-HM we push detectors toward an Explain-then-Detect formulation: the model first generates a policy-aware rationale identifying target and attack type, and then issues a detection verdict. Surprisingly, naively adding a Chain-of-Thought prompt or applying vanilla GRPO / DPO to this setup performs worse than a plain SFT baseline. ExPO-HM fixes this with a three-stage recipe built around a new reward signal: Conditional Decision Entropy (CDE).

Why existing Explain-then-Detect fails

Our analysis points to two specific failure modes of prior Explain-then-Detect systems (CoT prompting, LMM agents, vanilla GRPO):

  1. The model does not hypothesise policy-relevant cues. Without explicit pressure, LMM rationales drift toward surface description instead of naming the protected target or the attack type.
  2. Binary rewards give no gradient toward better reasoning. A correct/incorrect label carries no signal about whether the explanation led to the decision — so the policy learns to skip the thinking and just guess.

ExPO-HM’s central bet is that we need a reward that explicitly measures how much the explanation determines the decision.

Method

ExPO-HM trains an LMM backbone in three stages, mirroring how human annotators are trained and evaluated.

Stage 1 — SFT warmup. Supervised fine-tuning on annotated rationales teaches the model the output format and seeds policy-aware priors (targets, attack types, protected categories).

Stage 2 — Curriculum GRPO. Group Relative Policy Optimization is used for RL-style alignment, but staged via curriculum — easier, clearly-hateful examples first, followed by subtle, policy-sensitive memes. The curriculum prevents the policy from collapsing onto the easy shortcut of guessing the majority class.

Stage 3 — Conditional Decision Entropy (CDE) as reward. CDE is the key novelty. It scores an explanation e by how sharply it reduces the model’s uncertainty over the final decision y given the meme x:

CDE(ex)=H![p(yx)]H![p(yx,e)].

A rationale that actually drives the decision produces a large entropy drop and receives high reward; a generic boilerplate rationale produces near-zero CDE. Plugged into GRPO, CDE gives a dense, reasoning-aware signal that the binary label reward cannot provide. CDE doubles as an evaluation metric for rationale quality.

Experiments

ExPO-HM is evaluated across three hateful meme benchmarks, covering binary detection, fine-grained target/attack classification, and rationale quality.

  • State-of-the-art on all three axes (detection, fine-grained classification, reasoning quality).
  • +15% F1 over the GRPO baseline — evidence that CDE-shaped reward, not RL alone, is what delivers the gains.
  • +17% F1 over the DPO baseline — preference-pair RL without CDE underperforms.
  • Ablating either the SFT warmup or the curriculum hurts performance, confirming that all three stages are load-bearing.

ExPO-HM is complementary to our prior retrieval-guided work: RGCL (ACL 2024) sharpens the embedding space, RA-HMD (EMNLP 2025) lifts retrieval onto full LMMs, and ExPO-HM adds policy-aware reasoning and explanation quality on top.

Conclusion

For hateful meme detection to be useful in production moderation, models must explain as well as detect. ExPO-HM shows that this is not automatic — CoT prompting, vanilla GRPO and DPO all fail to close the gap to ordinary SFT. The combination of SFT warmup, curriculum GRPO and CDE-based reward turns Explain-then-Detect into a state-of-the-art approach on both detection accuracy and rationale quality.

Citation

1
2
3
4
5
6
7
@inproceedings{EXPOHM2026Mei,
  title     = {ExPO-HM: Learning to Explain-then-Detect for Hateful Meme Detection},
  author    = {Jingbiao Mei and Mingsheng Sun and Jinghong Chen and Pengda Qin and Yuhong Li and Da Chen and Bill Byrne},
  booktitle = {The Fourteenth International Conference on Learning Representations},
  year      = {2026},
  url       = {https://openreview.net/forum?id=bEejbORUI5}
}