AI-Generated Image Detection on Edge

2026 · 🥈 LPCVC Track 3, 2nd Place — an on-device vision-language model that detects AI-generated images and explains why.

🥈 2nd Place LPCVC 2026 · Track 3 On-device · Snapdragon 8 Elite

ECV Workshop @ CVPR 2026, Denver · Sponsored by Qualcomm

Can a phone tell a real photo from an AI-generated one — and explain its reasoning? Our entry to the 2026 IEEE Low-Power Computer Vision Challenge does both, fully on-device, under the contest's strict latency and power budgets.

View code on GitHub


The challenge

2026 IEEE Low-Power Computer Vision Challenge — Track 3: AI Generated Images Detection.

Track 3 raises the bar past a yes/no classifier: the model has to decide and justify.

Saying “fake” isn’t enough — the model has to point to what gives it away.

Every prediction therefore has two parts:

  • Detection — is the image Real or AI-Generated?
  • Explanation — a score and written evidence for each of 8 forensic criteria:
💡 Lighting & Shadows
✂️ Edges & Boundaries
🧵 Texture & Resolution
📐 Perspective & Space
⚖️ Physical / Common-Sense
🔤 Text & Symbols
🧍 Human / Biological
🧱 Material & Object Detail

The organizers grade the submitted binary in two stages: first the model reads the image and writes a free-form analysis across the 8 criteria (Stage 1), then it folds that analysis into one structured JSON — per-criterion score, evidence, and final verdict (Stage 2).

How it’s scored

Two constraints drive every design decision:

⏱️ Speed gate

Inference must run faster than 15 tokens/s on the phone — miss it and the entry is disqualified.

🎯 Accuracy

A per-image score rewarding both the verdict and the explanation.

The accuracy score combines three measurements:

Component How it’s measured
Detection accuracy of the overall Real / AI-Generated call
Criterion exact-match accuracy of each per-criterion judgment
Evidence semantic similarity of the written evidence to ground truth
\[\text{Explanation} = 0.5\,(\text{Criterion}) + 0.5\,(\text{Evidence})\] \[\text{Image score} = \begin{cases} \text{Detection} & \text{Real} \\[2pt] 0.5\,(\text{Detection}) + 0.5\,(\text{Explanation}) & \text{AI-Generated} \end{cases} \qquad \text{Final} = \frac{\sum \text{Image score}}{\#\,\text{images}}\]

Approach

flowchart LR
  A["~788K images<br/>ADM · BigGAN · SID<br/>SynthScars · ImageNet · COCO"] -->|"Qwen2.5-VL auto-annotation<br/>8 criteria · evidence · domain"| B["SFT splits"]
  B -->|"Step 0 → 1 → 2<br/>LoRA+ on Qwen2-VL-2B"| C["Merged detector"]
  C -->|"AIMET W4A16<br/>ONNX → QNN"| D["On-device<br/>Snapdragon 8 Elite"]

1 · Data & annotation

The hard part: almost none of the source images came with the 8-criteria labels the task needs.

  • Sources — fakes from GenImage (ADM, BigGAN), SID-Set, and SynthScars; real photos from ImageNet and COCO.
  • Auto-labelingQwen2.5-VL annotates every image with a domain tag, text/person flags, and a 0–2 score + evidence per criterion.
  • Real images get all-zero scores and a “no artifacts” note — turning a pile of unlabeled images into a fully supervised set.

2 · A 3-step training curriculum

A general VLM doesn’t know what AI artifacts look like, so we taught Qwen2-VL-2B in stages with LoRA+ (LoRA, DoRA and PiSSA were also tried; the 7B model overfit, so 2B won):

  • Step 0 — learn to analyze. Free-form “real or fake, and why” reasoning, so the model learns to see artifacts.
  • Step 1 — learn the format. Compress that reasoning into the contest’s compact template (~300 tokens) — token budget is part of the speed gate.
  • Step 2 — learn the JSON. Emit valid structured output, with a consistency rule so any fake-criterion forces an AI-Generated verdict.

The trained adapter is then merged into the base model to give a single deployable network.

3 · Quantization & deployment

  • Quantize the merged model with AIMET (W4A16) — both vision encoder and language model.
  • Export through ONNX → QNN binary for the Snapdragon NPU.
  • Match calibration data to the real inference distribution; a mismatch quietly wrecks quantized accuracy.

Results

2nd

of all teams

0.72

challenge score

31.2

tokens/s · 2× the floor

2.6 GB

QNN binary
Official LPCVC 2026 leaderboard — SSUPER_POWER placed 2nd in Track 3, behind OptimAI (KETI) and ahead of teams from UC Irvine and Clemson.

Team

Team SSUPER_POWER — VIP Lab, Soongsil University:

  • Dayoung Kil
  • Doeon Kim
  • Junyoon Lee

Tech stack

Qwen2.5-VL-7B Qwen2-VL-2B LoRA+ / DoRA / PiSSA PyTorch 2.10 LLaMA-Factory 0.9.1 AIMET Pro 1.34 (W4A16) QAIRT 2.31 · QNN

Datasets: GenImage (ADM · BigGAN) · SID-Set · SynthScars · ImageNet · COCO · ARForensics.
Code released under the MIT License.