Being-VL-0: From Pixels to Tokens: Byte-Pair Encoding on Quantized Visual Modalities

Oct 03, 2024 (Accepted: ICLR 2025)
Wanpeng Zhang / Zilong Xie / Yicheng Feng / Yijiang Li / Xingrun Xing / Sipeng Zheng / Zongqing Lu
Peking University / CUHK / UC San Diego / BAAI / CASIA

Multimodal LLMs often rely on modality-specific encoders, which makes vision tokens feel foreign to the language model. Being-VL-0 instead teaches the model a new visual vocabulary by applying Byte-Pair Encoding (BPE) directly on quantized image tokens. The result is a tokenizer that injects structural priors into the visual stream so the LLM can learn alignment more naturally.

Overview

BPE image tokenizer overview
BPE Image Tokenizer merges quantized image tokens into structure-aware visual tokens before they enter the LLM.

Why Tokenization Matters

Our theoretical analysis shows that Transformers can collapse to unigram behavior on certain 2D sequence distributions. By explicitly merging visual tokens, BPE provides the missing structure and makes the learning problem tractable.

Transformer without tokenizer
Without a tokenizer, Transformer training stays near the unigram baseline.
Transformer with tokenizer
With BPE tokenization, even small Transformers reach the optimal loss.

BPE Image Tokenizer

We build a 2D extension of BPE that merges frequent spatial token pairs across both horizontal and vertical directions. The tokenizer grows an extended vocabulary on top of the VQ-GAN codebook, producing composite visual tokens that encode structure. These tokens are then concatenated with text tokens in a single sequence, allowing the LLM to learn cross-modal reasoning end-to-end.

Training Recipe

Being-VL-0 uses a two-stage training pipeline on top of Llama-3.1-8B:

  • Image understanding pretraining (PT): freeze text embeddings and learn the new image token embeddings from captioned data.
  • Supervised fine-tuning (SFT): unfreeze all parameters and train on multi-turn vision-language instruction data.

This mirrors the intuition from text LLMs: learn a tokenizer-aware embedding first, then scale to richer reasoning data.

Results at a Glance

Across VQAv2, MMBench, MME, POPE, and VizWiz, BPE tokenization consistently improves over VQ-only baselines. The gains are strongest when the text embeddings are frozen during pretraining, suggesting cleaner visual grounding. Additional data scaling further lifts performance, indicating the approach has headroom.

Vocabulary Size and Token Usage

We study how the BPE vocabulary size affects performance and how the model distributes weight across original VQ tokens versus merged tokens. The sweet spot appears near an 8K extended vocabulary, balancing fine-grained details with higher-level semantics.

Vocabulary size ablation
Performance improves with vocabulary size up to 8K, then plateaus or drops.
Token embedding weight visualization
Token usage shifts with vocabulary size, revealing a balance between VQ and BPE tokens.

Takeaway

Being-VL-0 shows that tokenization is not just a compression trick for multimodal models. By making visual tokens compositional and structure-aware, we can turn a text LLM into a stronger multimodal reasoner with less data.

Citation

@inproceedings{zhang2025pixelstokens,
  title={From Pixels to Tokens: Byte-Pair Encoding on Quantized Visual Modalities},
  author={Zhang, Wanpeng and Xie, Zilong and Feng, Yicheng and Li, Yijiang and Xing, Xingrun and Zheng, Sipeng and Lu, Zongqing},
  booktitle={International Conference on Learning Representations},
  year={2025}
}