AI-Generated Image Detection on Edge

🥈 2nd Place LPCVC 2026 · Track 3 On-device · Snapdragon 8 Elite

ECV Workshop @ CVPR 2026, Denver · Sponsored by Qualcomm

Can a phone tell a real photo from an AI-generated one — and explain its reasoning?

Our entry in the 2026 IEEE Low-Power Computer Vision Challenge does both, fully on-device, under the contest's strict latency and power budgets.

View code on GitHub 🤗 Model weights

The challenge

2026 IEEE Low-Power Computer Vision Challenge — Track 3: AI Generated Images Detection.

Track 3 goes beyond a yes/no classifier: the model has to decide and justify.

Saying “fake” isn’t enough — the model has to point to what gives it away.

Every prediction therefore has two parts:

Detection — is the image Real or AI-Generated?
Explanation — a score and written evidence for each of 8 forensic criteria:

Lighting & Shadows

Edges & Boundaries

Texture & Resolution

Perspective & Space

Physical / Common-Sense

Text & Symbols

Human / Biological

Material & Object Detail

The organizers grade the submitted binary in two stages: first the model reads the image and writes a free-form analysis across the 8 criteria (Stage 1), then it folds that analysis into one structured JSON — per-criterion score, evidence, and final verdict (Stage 2).

How it’s scored

Two constraints drive every design decision:

⏱️ Speed gate

Inference must run faster than 15 tokens/s on the phone — miss it and the entry is disqualified.

🎯 Accuracy

A per-image score rewarding both the verdict and the explanation.

The accuracy score combines three measurements:

Component	How it’s measured
Detection	accuracy of the overall Real / AI-Generated call
Criterion	exact-match accuracy of each per-criterion judgment
Evidence	semantic similarity of the written evidence to ground truth

\[\text{Explanation} = 0.5\,(\text{Criterion}) + 0.5\,(\text{Evidence})\] \[\text{Image score} = \begin{cases} \text{Detection} & \text{Real} \\[2pt] 0.5\,(\text{Detection}) + 0.5\,(\text{Explanation}) & \text{AI-Generated} \end{cases} \qquad \text{Final} = \frac{\sum \text{Image score}}{\#\,\text{images}}\]

Approach

The system comes together in three parts: building a labeled dataset, a staged training curriculum, and on-device quantization & deployment.

1 · Data & annotation

Almost none of the source images came with the 8-criteria labels the task needs — so we generated them ourselves, using Qwen2.5-VL-7B as a teacher model. 89,263 images in all, balanced ~50 : 50 real / fake.

AI-generated sources

ARForensics (Infinity · Janus-Pro · LlamaGen · RAR · …) · GenImage (ADM · BigGAN) · SID-Set · SynthScars

Real sources

ImageNet · COCO train2017 · SID-Set (real split)

Auto-labeling — 4 questions per image

Instead of one yes/no call, we ask the teacher three vision passes, each scanning a slice of the 8 criteria, then fuse them into JSON in a final text-only pass.

Q1 · Vision
edges · texture · material

Q2 · Vision
physics · text · human

Q3 · Vision
lighting · perspective

Q4 · Text-only synthesis — fold Q1–Q3 into one structured JSON: an aigc score + written evidence per criterion, plus an overall Real / AI-Generated verdict.

Fake images get a one-line hint prepended so the model looks for artifacts; real images come out all-zero with "looks authentic" evidence — yielding a fully supervised set.

Each image becomes one annotation JSON (8 criteria shown trimmed to 3):

{
  "per_criterion": [
    { "criterion": "Lighting & Shadows Consistency",
      "evidence": "Lighting is consistent throughout; no abrupt brightness or shadow changes.",
      "aigc score": 0 },
    { "criterion": "Physical & Common Sense Logic",
      "evidence": "The fish has an unusual shape and size; the person's features look exaggerated.",
      "aigc score": 1 },
    { "criterion": "Human & Biological Structure Integrity",
      "evidence": "Unrealistic proportions and facial features; fish anatomy is inconsistent.",
      "aigc score": 1 }
    // … 8 criteria total
  ],
  "overall_likelihood": "AI-Generated"
}

2 · A staged LoRA training chain

A general VLM doesn’t know what AI artifacts look like, so we adapt Qwen2-VL-2B over four LoRA phases. Each phase trains an adapter, then merges it into the base before the next starts — so skills stack into one deployable network (LoRA+ won out over LoRA / DoRA / PiSSA; the 7B model overfit, so 2B won).

P1 Detect

Emit the per-criterion + overall verdict from a single prompt.

token CE + 2.0 × BCE
aux real/fake head on vision features

P2 Stay consistent

Multi-prompt robustness (4 phrasings/image), scores tied to the written answer.

token CE + 0.1 × overall BCE + 0.05 × criterion BCE

P3 Explain

Image → free-form written evidence for each criterion.

standard SFT token CE

P4 Format

Text-only → one valid JSON; any fake criterion forces an AI-Generated verdict.

standard SFT token CE

Vision tower — unfrozen only in P1, so the ViT itself learns to see artifacts; frozen for P2–P4. The auxiliary heads (P1, P2) are dropped after training — only the LoRA delta is merged.

P4 runs in two passes — a warmup (lr 5e-5) then a low-lr final (lr 1e-5) to lock the JSON format without overfitting. The merged P4 model is what ships to quantization.

3 · Quantization & deployment

Quantize the merged model with AIMET (W4A16) — both vision encoder and language model.
Export through ONNX → QNN binary for the Snapdragon NPU.
Match calibration data to the real inference distribution; a mismatch quietly wrecks quantized accuracy.

Results

2nd

of all teams

0.72

challenge score

31.2

tokens/s · 2× the floor

2.6 GB

QNN binary

Team

Team SSUPER_POWER — VIP Lab, Soongsil University:

Dayoung Kil
Doeon Kim
Junyoon Lee

Tech stack

Qwen2.5-VL-7B Qwen2-VL-2B LoRA+ / DoRA / PiSSA PyTorch 2.10 LLaMA-Factory 0.9.1 AIMET Pro 1.34 (W4A16) QAIRT 2.31 · QNN

Datasets: GenImage (ADM · BigGAN) · SID-Set · SynthScars · ImageNet · COCO · ARForensics.
Code released under the MIT License.