3-minute Pitch: ExPO-HM — Learning to Explain-then-Detect for Hateful Meme Detection

Project page here
Paper on arXiv / OpenReview
Code available here

Hateful meme detection has largely been framed as binary classification: a model outputs hateful or benign and, at best, a confidence score. For real-world content moderation that is not enough — moderators need to know who is being attacked and what kind of attack is being levelled. In ExPO-HM we push detectors toward an Explain-then-Detect formulation: the model first generates a policy-aware rationale identifying target and attack type, and then issues a detection verdict. Surprisingly, naively adding a Chain-of-Thought prompt or applying vanilla GRPO / DPO to this setup performs worse than a plain SFT baseline. ExPO-HM fixes this with a three-stage recipe built around a new reward signal: Conditional Decision Entropy (CDE).

$\require{amstext} \require{amsmath} \require{amssymb} \require{amsfonts}$

Why existing Explain-then-Detect fails

Our analysis points to two specific failure modes of prior Explain-then-Detect systems (CoT prompting, LMM agents, vanilla GRPO):

The model does not hypothesise policy-relevant cues. Without explicit pressure, LMM rationales drift toward surface description instead of naming the protected target or the attack type.
Binary rewards give no gradient toward better reasoning. A correct/incorrect label carries no signal about whether the explanation led to the decision — so the policy learns to skip the thinking and just guess.

ExPO-HM’s central bet is that we need a reward that explicitly measures how much the explanation determines the decision.

Method

ExPO-HM trains an LMM backbone in three stages, mirroring how human annotators are trained and evaluated.

Stage 1 — SFT warmup. Supervised fine-tuning on annotated rationales teaches the model the output format and seeds policy-aware priors (targets, attack types, protected categories).

Stage 2 — Curriculum GRPO. Group Relative Policy Optimization is used for RL-style alignment, but staged via curriculum — easier, clearly-hateful examples first, followed by subtle, policy-sensitive memes. The curriculum prevents the policy from collapsing onto the easy shortcut of guessing the majority class.

Stage 3 — Conditional Decision Entropy (CDE) as reward. CDE is the key novelty. It scores an explanation $e$ by how sharply it reduces the model’s uncertainty over the final decision $y$ given the meme $x$ :

$\text{CDE}(e \mid x) = H!\left[\, p(y \mid x) \,\right] - H!\left[\, p(y \mid x, e) \,\right].$

A rationale that actually drives the decision produces a large entropy drop and receives high reward; a generic boilerplate rationale produces near-zero CDE. Plugged into GRPO, CDE gives a dense, reasoning-aware signal that the binary label reward cannot provide. CDE doubles as an evaluation metric for rationale quality.

Experiments

ExPO-HM is evaluated across three hateful meme benchmarks, covering binary detection, fine-grained target/attack classification, and rationale quality.

State-of-the-art on all three axes (detection, fine-grained classification, reasoning quality).
+15% F1 over the GRPO baseline — evidence that CDE-shaped reward, not RL alone, is what delivers the gains.
+17% F1 over the DPO baseline — preference-pair RL without CDE underperforms.
Ablating either the SFT warmup or the curriculum hurts performance, confirming that all three stages are load-bearing.

ExPO-HM is complementary to our prior retrieval-guided work: RGCL (ACL 2024) sharpens the embedding space, RA-HMD (EMNLP 2025) lifts retrieval onto full LMMs, and ExPO-HM adds policy-aware reasoning and explanation quality on top.

Conclusion

For hateful meme detection to be useful in production moderation, models must explain as well as detect. ExPO-HM shows that this is not automatic — CoT prompting, vanilla GRPO and DPO all fail to close the gap to ordinary SFT. The combination of SFT warmup, curriculum GRPO and CDE-based reward turns Explain-then-Detect into a state-of-the-art approach on both detection accuracy and rationale quality.

Citation

@inproceedings{EXPOHM2026Mei,
  title     = {ExPO-HM: Learning to Explain-then-Detect for Hateful Meme Detection},
  author    = {Jingbiao Mei and Mingsheng Sun and Jinghong Chen and Pengda Qin and Yuhong Li and Da Chen and Bill Byrne},
  booktitle = {The Fourteenth International Conference on Learning Representations},
  year      = {2026},
  url       = {https://openreview.net/forum?id=bEejbORUI5}
}

3-minute Pitch: ExPO-HM — Learning to Explain-then-Detect for Hateful Meme Detection

Published in ICLR 2026

Why existing Explain-then-Detect fails

Method

Experiments

Conclusion

Citation

CATALOG

FEATURED TAGS

FRIENDS