vlm

11 notes tagged “vlm”

[LLaVA-OneVision] Easy Visual Task Transfer

The first single open LMM strong across single-image, multi-image, and video — with cross-scenario task transfer (video understanding emerges from image training) via a balanced AnyRes token budget.

arXiv 2024

2024 · vlm
[InternVL] Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks

Argues the vision encoder is too small next to the LLM, so it scales the encoder to 6B (InternViT-6B) and bridges it with an 8B language middleware (QLLaMA) via progressive contrastive→generative alignment.

CVPR 2024

2023 · vlm
[LLaVA-1.5] Improved Baselines with Visual Instruction Tuning

A systematic study of LLaVA's design choices — an MLP connector, a 336px CLIP encoder, and academic-task VQA data with response-format prompts — sets SOTA on 11 benchmarks with only public data.

CVPR 2024

2023 · vlm
[Qwen-VL] A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

Adds vision to Qwen-7B via a ViT + position-aware cross-attention adapter (256 query tokens), trained in 3 stages — and adds grounding and text-reading via box/ref special tokens.

arXiv 2023

2023 · vlm
[InstructBLIP] Towards General-purpose Vision-Language Models with Instruction Tuning

Instruction-tunes BLIP-2 on 26 datasets, and makes the Q-Former instruction-aware — feeding the instruction to the Q-Former so it extracts visual features tailored to the task.

NeurIPS 2023

2023 · vlm
[MiniGPT-4] Enhancing Vision-Language Understanding with Advanced Large Language Models

Aligns a frozen vision encoder (BLIP-2's ViT+Q-Former) and a frozen Vicuna with a single linear projection layer — showing GPT-4-like abilities emerge from minimal alignment of a strong LLM.

arXiv 2023

2023 · vlm
[LLaVA] Visual Instruction Tuning

Uses language-only GPT-4 to generate multimodal instruction-following data, then connects a frozen CLIP encoder to Vicuna with a single linear projection and instruction-tunes end-to-end.

NeurIPS 2023

2023 · vlm
[Kosmos-1] Language Is Not All You Need: Aligning Perception with Language Models

A multimodal LLM trained from scratch on web-scale interleaved image-text — perceiving general modalities, learning in context (few-shot), and following instructions (zero-shot), without a frozen LLM.

arXiv 2023

2023 · vlm
[BLIP-2] Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models

Bridges a frozen image encoder and a frozen LLM with a lightweight Querying Transformer (Q-Former), pre-trained in two stages — representation learning then generative learning.

ICML 2023

2023 · vlm
[Flamingo] Flamingo: a Visual Language Model for Few-Shot Learning

Bridges a frozen vision encoder and a frozen LLM with a Perceiver Resampler + gated cross-attention, unlocking GPT-3-style few-shot in-context learning on image/video tasks.

NeurIPS 2022

2022 · vlm
[BLIP] Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation

A unified vision-language model (MED) that handles both understanding and generation, plus CapFilt — a captioner+filter that bootstraps noisy web captions into cleaner training data.

ICML 2022

2022 · vlm