[Evo-ViT] Slow-Fast Token Evolution for Dynamic Vision Transformer

Pruning AAAI 2022

Yifan Xu, Zhijie Zhang, Mengdan Zhang, Kekai Sheng, Ke Li, Weiming Dong, Liqing Zhang, Changsheng Xu, Xing Sun · CAS / SJTU / Tencent Youtu Lab

arXiv GitHub

한 줄 요약. 토큰을 버리지 않는다. informative / placeholder로 나눠, 중요한 토큰은 full transformer로 느리게(slow), 덜 중요한 토큰은 대표 토큰으로 빠르게(fast) 업데이트. 공간 구조를 보존해 deep-narrow 구조·scratch 학습까지 가속.

배경

기존 token pruning(DynamicViT·PS-ViT)은 강력하지만 두 가지 한계가 있다.

구조 손상 — 토큰을 비정형으로 버리면 2D 공간 구조가 깨져, deep-narrow 계열(LeViT처럼 structured compression을 쓰는 현대 ViT)에 적용하기 어렵다.
pretrained 의존 — 보통 학습된 ViT가 필요하고 시간이 드는 절차를 거친다 (scratch 학습 불가).

Figure 1. 토큰 기반 연산 절감 파이프라인 비교 — (1) unstructured pruning(pretrained 필요), (2) structured compression, (3) Evo-ViT: 구조를 보존하는 비정형 업데이트라 structured 압축 모델에도 적용 가능.

버리는 게 문제라면, 버리지 말고 덜 중요한 토큰을 ‘값싸게’ 처리하면 되지 않을까?

핵심 아이디어

덜 중요한 토큰을 placeholder token이라 부르고 제거하지 않는다. 대신 informative token과 다른 계산 경로로 업데이트한다(slow-fast). 공간 구조와 information flow가 그대로 유지되므로, flat(DeiT)·deep-narrow(LeViT) 모두 학습 처음부터 가속할 수 있다.

“Self-motivated” — ViT의 class token이 스스로 informative token을 구분한다(class attention). 그래서 DynamicViT 같은 별도 prediction module이 필요 없다.

Figure 3. Evo-ViT 전체 구조 — global class attention으로 informative/placeholder 토큰을 선택하고, slow(informative + representative) / fast(placeholder) 두 경로로 업데이트.

방법

1) Structure-preserving token selection

global class attention으로 top-k를 informative token으로 선택, 나머지 N−k는 placeholder로 유지(안 버림).
class attention을 layer 간 residual로 evolve시켜 안정화:

\[A^{k}_{cls,g} = \alpha\, A^{k-1}_{cls,g} + (1-\alpha)\, A^{k}_{cls} \quad (\alpha = 0.5)\]

2) Slow-fast token updating

placeholder 토큰들을 하나의 representative token x_rep로 요약(가중합).
slow path — informative 토큰 x_inf와 x_rep를 MSA+FFN으로 정교하게 업데이트.
fast path — placeholder 토큰 x_ph는 x_rep의 residual로 빠르게 업데이트 (단순 copy 수준). → skip-connection처럼 동작해 정보 흐름 유지.

3) 학습 전략

Layer-to-stage schedule: 첫 200 epoch은 layer별로 선택/업데이트, 남은 100 epoch은 stage 시작에서만 선택 → 더 빠름. (잘 학습되면 선택 결과가 layer 간 일관됨을 활용)
Assisted CLS token loss: CLS token과 average-pooled feature 양쪽으로 분류 loss → CLS 없는 ViT에도 적용 가능. (추론 땐 avg feature로 분류, CLS는 선택에만 사용)
keep ratio 0.5, 5번째 layer부터 선택, 300 epoch.

결과

Table 1. DeiT-T/S/B에서 기존 token pruning(PS-ViT·DynamicViT·SViTE·IA-RED²) 대비 정확도·throughput 우위.

DeiT-S: throughput +60.6%, top-1 −0.4%(79.4) — 같은 계열 pruning들보다 정확도·속도 모두 우수.
deep-narrow도 가속: pruning은 못 했던 LeViT까지 가속(구조 보존 덕분). 단, deeper layer는 토큰이 적어 redundancy가 적어 정확도 저하가 더 큼 → dense input(384²)에서 더 효과적.

Ablation

Table 3. 모듈별 효과 — naive selection(그냥 drop)에서 structure preservation·global attention·fast updating·layer-to-stage를 더할수록 개선.

그냥 버리면(naive) DeiT-T 72.2 → 70.8로 하락. 구조 보존으로 71.6, global attention으로 72.0 회복. fast updating은 deep-narrow(LeViT) 에서 특히 효과(72.5→73.0).

Table 4. 토큰 선택 기준 비교 — global class attention(72.0)이 pooling/conv·random·last-attention·column-mean을 모두 능가.

선택 기준은 global class attention이 최선(72.0) > attention column mean 71.2 > pooling/conv ~70 > random 66.4.
모든 layer에 동일 keep ratio(0.5) 를 쓰는 게 최적(점진 축소형보다 나음).

한 줄 정리 & 의의

EViT와 같은 CLS attention 기반 선택이지만, EViT가 덜 중요한 토큰을 1개로 fuse(토큰 수↓)하는 반면 Evo-ViT는 placeholder를 전부 유지하고 빠른 경로로 업데이트 → 공간 구조 완전 보존. 그래서 deep-narrow 구조·scratch 학습을 지원하는 게 핵심 차별점.
한계 / 이후. 분류 중심 — detection·segmentation 등 downstream 확장이 future work. → Token Reduction 개요