Selftok: Discrete Visual Tokens of Autoregression, by Diffusion, and for Reasoning

Selftok Team

Media Technology Institute, Huawei Singapore

Our preliminary work has been accepted as an oral presentation at CVPR 2025. You can find more details at the following link.

Abstract

We completely discard the conventional spatial prior in image representation and introduce a novel discrete visual tokenizer: Self-Consistency Tokenizer (Selftok). At its design core, we compose an autoregressive (AR) prior—mirroring the causal structure of language—into visual tokens by using the reverse diffusion process of image generation. The AR property makes Selftok fundamentally distinct from traditional spatial tokens in the following two key ways:

Besides the AR property, Selftok is also a SOTA tokenizer that achieves both high-quality reconstruction and high compression bit rate. After representing the training images as Selftok tokens, as a pure AR model, our VLM achieves both SOTA visual comprehension and generation performances. Impressively, without using any text-image training pairs, a simple policy gradient RL working in the visual tokens can significantly boost the visual generation benchmark, surpassing all the existing models by a large margin.

Therefore, we believe that Selftok effectively addresses the long-standing challenge that visual tokens cannot support effective RL. When combined with the well-established strengths of RL in LLMs, this brings us one step closer to realizing a truly multimodal LLM.

Introduction

Why Discrete?

Comparison of pure discrete autoregressive model (dAR) and hybrid dAR/cAR model

Comparison of (a) pure discrete autoregressive model (dAR) and (b) hybrid model that combines dAR and continuous autoregressive model (cAR). <BOS>/<BOV>\texttt{<BOS>}/\texttt{<BOV>} indicates the start of a sentence/image. wi/viw_i/v_i denotes the ii-th language/visual token. Both models predict the next token given all previous ones, e.g., [<BOS>,,<v3>]<v4>[\texttt{<BOS>}, \dots, \texttt{<$v_3$>}] \to \texttt{<$v_4$>}.

Why Not Spatial?

Illustration of spurious dependencies in spatial tokens and anti-causal links

(a) As an image can be viewed as the effect of spatial pixels (or patches), observing any part of it introduces spurious dependencies among spatial tokens, making them non-AR. (b) Due to the anti-causal links (red) for spatial tokens, learning the locally optimal policy π4\pi^*_4 for a later token (e.g., v4v_4) can propagate backward and interfere with earlier tokens that were already optimized (e.g., v1,v2,v3v_1,v_2,v_3). In contrast, AR tokens without such links do not have this issue.

Progressive Reconstruction & Interpolation

Illustration of progressive reconstruction and interpolation

Progressive reconstruction (left to right): Reconstructions by progressively masking out a shorter sequence of tokens before inputting to the decoder. Interpolation (left to right): Reconstructions by gradually replacing tokens of the left image with those of the right one. All methods except Selftok exhibit strong spatial characteristics i.e., tokens\Leftrightarrowpatches.

Selftok: Self-Consistency Tokenizer

Architecture

Selftok architecture diagram

Selftok architecture diagram. +POS: adding positional embeddings; LN: layernorm; AdaLN: token-aware adaptive layernorm, which differentiates token embeddings; MSE: the mean squared error for the reconstruction objective; DM: pre-trained diffusion model.

Main Results

TokenizerType#Token#CoderFID↓PSNR↑SSIM↑LPIPS↓
LlamaGen2D
MaskBiT2D-
Cosmos2D
VQGAN-LC2D
OpenMagViT-V22D
ViT-VQGAN2D---
LlamaGen2D
Cosmos2D
VAR2D
TiTok-L-321D
TiTok-B-641D
TiTok-S-1281D
FlexTok1D
FlowMo-Lo1D
FlowMo-Hi1D
Selftok (Ours)1D
Selftok (Ours)1D

Reconstruction performance of different tokenizers on - resolution ImageNet 50k validation set. Results from the original paper.

Evaluation of text-to-image generation ability on GenEval benchmark. Janus-Pro-7B† represents the result of our evaluation. Janus-Pro-7B-Zero represents a model that has undergone the same visual RL process as Selftok-Pre-Zero and Selftok-Zero. † represents the result of our evaluation. ‡ GPT-4o results are from OpenAI’s technical report.
TypeMethodSingle Obj.Two Obj.CountingColorsPositionColor Attr.Overall
Diffusion OnlyPixArt-α
SDXL
FLUX.1-dev
DALL-E 3
SD3-Medium
CogView4-6B
HiDream-I1
Hybrid ModelSEED-X
Transfusion
D-DiT
Show-o
GPT-4o‡
Pure dAREmu3-Gen>
TokenFlow-XL
ILLUME+
Infinity
Janus-Pro-7B
Janus-Pro-7B†
Janus-Pro-7B-Zero
Selftok-Pre
Sefltok-Pre-Zero
Selftok-SFT
Selftok-Zero
Performances on DPG-Bench. The methods in this table are all generation-specific models except Show-o, Janus-Pro, and Selftok. † represents the result of our evaluation.
TypeMethodGlobalEntityAttributeRelationOtherOverall
Diffusion OnlyPixArt-α
SDXL
DALL-E 3
SD3-Medium
FLUX.1-dev
CogView4-6B
HiDream-I1
Hybrid ModelShow-o
Pure dAREmu3-Gen
Janus
Infinity
Janus-Pro-7B
Janus-Pro-7B†
Janus-Pro-7B-Zero
Selftok-Pre
Selftok-SFT
Selftok-Zero

Examples

Text-to-Image Generation

Text-to-Image generation results by Selftok using the text prompts of DPG-Bench

T2I examples

Selftok-based visual RL

visual RL exmaples

Single-turn Image Editing

Single editing examples

Multi-turn Image Editing

multi editing examples

BibTeX citation