AI-Generated Image Detection on Edge
2026 · 🥈 LPCVC Track 3, 2nd Place — an on-device vision-language model that detects AI-generated images and explains why.
ECV Workshop @ CVPR 2026, Denver · Sponsored by Qualcomm
Can a phone tell a real photo from an AI-generated one — and explain its reasoning? Our entry to the 2026 IEEE Low-Power Computer Vision Challenge does both, fully on-device, under the contest's strict latency and power budgets.
The challenge
Track 3 raises the bar past a yes/no classifier: the model has to decide and justify.
Saying “fake” isn’t enough — the model has to point to what gives it away.
Every prediction therefore has two parts:
- Detection — is the image Real or AI-Generated?
- Explanation — a score and written evidence for each of 8 forensic criteria:
The organizers grade the submitted binary in two stages: first the model reads the image and writes a free-form analysis across the 8 criteria (Stage 1), then it folds that analysis into one structured JSON — per-criterion score, evidence, and final verdict (Stage 2).
How it’s scored
Two constraints drive every design decision:
⏱️ Speed gate
Inference must run faster than 15 tokens/s on the phone — miss it and the entry is disqualified.
🎯 Accuracy
A per-image score rewarding both the verdict and the explanation.
The accuracy score combines three measurements:
| Component | How it’s measured |
|---|---|
| Detection | accuracy of the overall Real / AI-Generated call |
| Criterion | exact-match accuracy of each per-criterion judgment |
| Evidence | semantic similarity of the written evidence to ground truth |
Approach
flowchart LR
A["~788K images<br/>ADM · BigGAN · SID<br/>SynthScars · ImageNet · COCO"] -->|"Qwen2.5-VL auto-annotation<br/>8 criteria · evidence · domain"| B["SFT splits"]
B -->|"Step 0 → 1 → 2<br/>LoRA+ on Qwen2-VL-2B"| C["Merged detector"]
C -->|"AIMET W4A16<br/>ONNX → QNN"| D["On-device<br/>Snapdragon 8 Elite"]
1 · Data & annotation
The hard part: almost none of the source images came with the 8-criteria labels the task needs.
- Sources — fakes from GenImage (ADM, BigGAN), SID-Set, and SynthScars; real photos from ImageNet and COCO.
- Auto-labeling — Qwen2.5-VL annotates every image with a domain tag, text/person flags, and a 0–2 score + evidence per criterion.
- Real images get all-zero scores and a “no artifacts” note — turning a pile of unlabeled images into a fully supervised set.
2 · A 3-step training curriculum
A general VLM doesn’t know what AI artifacts look like, so we taught Qwen2-VL-2B in stages with LoRA+ (LoRA, DoRA and PiSSA were also tried; the 7B model overfit, so 2B won):
- Step 0 — learn to analyze. Free-form “real or fake, and why” reasoning, so the model learns to see artifacts.
- Step 1 — learn the format. Compress that reasoning into the contest’s compact template (~300 tokens) — token budget is part of the speed gate.
- Step 2 — learn the JSON. Emit valid structured output, with a consistency rule so any fake-criterion forces an
AI-Generatedverdict.
The trained adapter is then merged into the base model to give a single deployable network.
3 · Quantization & deployment
- Quantize the merged model with AIMET (W4A16) — both vision encoder and language model.
- Export through ONNX → QNN binary for the Snapdragon NPU.
- Match calibration data to the real inference distribution; a mismatch quietly wrecks quantized accuracy.
Results
2nd
of all teams0.72
challenge score31.2
tokens/s · 2× the floor2.6 GB
QNN binary
Team
Team SSUPER_POWER — VIP Lab, Soongsil University:
- Dayoung Kil
- Doeon Kim
- Junyoon Lee
Tech stack
Qwen2.5-VL-7B Qwen2-VL-2B LoRA+ / DoRA / PiSSA PyTorch 2.10 LLaMA-Factory 0.9.1 AIMET Pro 1.34 (W4A16) QAIRT 2.31 · QNN
Datasets: GenImage (ADM · BigGAN) · SID-Set · SynthScars · ImageNet · COCO · ARForensics.
Code released under the MIT License.