Step3-VL-10B: |

10B Parameters · 2026-01 · StepFun

Frontier Performance, Minimal Cost

Average Benchmark Score vs Model Size

Average of: MMMU, MathVision, MathVista, MMBench (EN), MMBench (CN)

Avg Score
Parameters (B)
Step3-VL-10B (SeRe)
Step3-VL-10B (PaCoRe)
7-10B Models
Flagship Models

Step3-VL-10B 是一款轻量级开源基础模型,旨在重新定义紧凑高效前沿多模态智能之间的权衡。尽管仅有 10B 参数,STEP3-VL-10B 在视觉感知复杂推理人类对齐方面表现卓越。

该模型在 10B 规模以下的模型中始终表现最优,并能媲美甚至超越规模大 10×–20× 的开源模型(如 GLM-4.6V 106B-A12B、Qwen3-VL-Thinking 235B-A22B)以及顶级闭源旗舰模型(如 Gemini 2.5 Pro、Seed-1.5-VL)。

Step3-VL-10B 的成功源于两大核心设计:高质量多模态语料库的统一预训练(1.2T tokens)与规模化多模态强化学习(超过 1,400 次 RL 迭代),并引入 Parallel Coordinated Reasoning (PaCoRe) 实现并行视觉探索的证据聚合。

Benchmark

Benchmark Performance by Category

Performance across MMMU, MathVista, MathVision, MMBench, AIME2025, and MultiChallenge benchmarks

MMMU

Multimodal

MathVista

Multimodal

MathVision

Multimodal

MMBench

Multimodal

AIME2025

Text

MultiChallenge

Text
7-10B Models:
Step3-VL-10B (SeRe)
Step3-VL-10B (PaCoRe)
Others
Flagship Models:
Gemini 2.5 Pro / Seed-1.5-VL / GLM-4.6V (106B-A12B) / Qwen3-VL (235B-A22B)

评测采用"STEM 推理、识别、OCR & 文档、GUI Grounding、空间理解、代码"等核心维度,以横向对比方式呈现多个同行模型的分数差异。对比表格强调统计口径一致性:同一数据集版本、统一评测脚本、固定温度与采样参数。

Benchmark STEP3-VL-10B GLM-4.6V Qwen3-VL Gemini-2.5 Pro Seed-1.5-VL
SeRe PaCoRe
10B 10B 106B-A12B 235B-A22B
STEM / Multimodal Reasoning
MMMU 78.11 80.11 75.20 78.70 83.89 79.11
MMMU-Pro 64.08 67.18 65.84 72.37 76.96 70.60
MathVision 70.81 75.95 63.50* 72.10 73.30* 68.70*
MathVista 83.97 85.50 83.51 85.10 83.88 85.60
LogicVista 66.89 71.36 64.88 73.15 69.80 72.93
DynaMath 56.39 61.48 56.29 60.30 52.30 58.88
ZeroBench (main) 1.00 5.00 1.00 3.00 4.00 1.00
ZeroBench (sub) 27.54 29.94 29.04 28.40 33.53 31.74
MathVerse (vision) 75.73 78.30 72.84 76.65 78.30 77.79
We-Math 73.03 73.90 71.14 74.70 80.10 79.05
VisuLogic 29.68 32.70 28.30 31.80 31.40 34.30
PhyX 59.45 66.01 59.70 66.30 67.56 62.53
Recognition / General VQA
MMBench (EN) 92.05 92.38 92.75 92.70 93.19 92.11
MMBench (CN) 91.55 91.96 91.88 91.80 93.13 91.76
SimpleVQA 53.08 54.64 57.95 59.30 66.85 64.72
MMStar 77.48 77.64 75.30 76.80 79.18 77.91
HallusionBench 64.91 64.54 60.63 65.58 65.63 64.13
MMVP 68.16 68.00 71.33 71.30 70.67 74.00
ReMI 67.29 69.12 64.42 74.70 71.69 72.19
M3GIA 78.33 73.50 78.72 81.00 83.11 83.22
DoYouSeeMe 67.48 68.54 67.50 72.89 71.19 71.94
Counting
CountBench 88.75 88.80 92.06 92.46 87.78 91.85
CountQA 33.69 38.29 36.32 45.62 38.02 48.89
PixMo-Count 70.85 71.61 76.47 79.80 75.54 83.38
OCR
OCRBench 86.75 89.00 86.20 87.30 85.90 85.20
OmniOCR 76.98 78.14 84.53 87.20 66.05 87.80
CC-OCR (Multi-Lang-OCR) 76.59 77.51 74.08 80.80 81.10 78.82
2D / 3D Spatial Understanding
BLINK 66.79 67.39 68.17 67.12 72.01 71.54
CVBench 83.49 85.92 83.72 87.86 84.36 86.27
MMSI-Bench 32.18 36.40 30.80 32.50 40.40 30.60
ERQA 48.87 51.75 47.75 53.50 62.25 48.50
OmniSpatial 51.58 52.58 50.49 53.10 55.64 51.99
All-Angles-Bench 57.21 64.71 62.94 60.59 65.88 57.65
MindCube-tiny 62.81 68.58 52.83 47.58 58.92 39.83
RealWorldQA 74.44 75.56 77.78 78.80 77.78 79.61
SpatialViz-Bench 45.51 52.03 37.46 46.36 45.34 35.25
STARE 61.75 64.57 60.38 70.89 62.36 62.99
CoreCognition 66.69 71.54 69.50 72.66 78.78 72.38
V* 82.85 84.29 85.86 89.53 80.63 90.58
ViewSpatial 46.14 48.41 43.87 48.58 44.15 44.14
Exam (Text-Centric)
MMLU-Pro 76.02 77.09 79.96 83.75 86.45 83.39
GPQA-Diamond 70.83 73.99 69.19 77.68 84.06 71.91
SuperGPQA 50.38 53.15 53.28 64.20 65.00 60.50
LiveBench (2024-11-25) 69.71 71.69 62.75 80.14 76.34 65.62
Mathematics (Text-Centric)
AIME 2024 90.94 93.33 80.63 91.93 79.53 79.48
AIME 2025 87.66 94.43 71.88 83.59 83.96 64.06
HMMT 2025 78.18 92.14 57.29 67.71 65.68 51.30
CNMO 2024 78.20 81.17 72.11 88.36 74.53 83.67
Beyond AIME 63.23 74.00 39.83 57.42 54.45 42.83
IMO-AnswerBench 62.12 76.66 51.25 69.25 72.00 44.75
Code
LiveCodeBench (2408-2505) 75.77 76.43 48.71 69.45 72.01 57.10

注:SeRe (Sequential Reasoning) 使用最大 64K tokens;PaCoRe (Parallel Coordinated Reasoning) 聚合 16 个并行 rollouts,最大 128K tokens。测试采用: temperature=1, top_p=1, top_k=0。

Showcase

Showcase 通过真实案例展示 Step3-VL-10B 的多模态推理能力:Case 1 聚焦莫尔斯电码表格解析,其他案例覆盖GUI感知与视觉识别和推理。

Method

Architecture

  • Visual Encoder: PE-lang (Language-Optimized Perception Encoder), 1.8B parameters.
  • Decoder: Qwen3-8B.
  • Projector: Two consecutive stride-2 layers (resulting in 16× spatial downsampling).
  • Resolution: Multi-crop strategy consisting of a 728×728 global view and multiple 504×504 local crops.

Training Pipeline

Pre-training: Single-stage, fully unfrozen strategy using AdamW optimizer (Total: 1.2T tokens, 370K iterations).

  • Phase 1: 900B tokens.
  • Phase 2: 300B tokens.

Supervised Finetuning (SFT): Two-stage approach (Total: ~226B tokens).

  • Stage 1: 9:1 text-to-multimodal ratio (~190B tokens).
  • Stage 2: 1:1 text-to-multimodal ratio (~36B tokens).

Reinforcement Learning: Total >1,400 iterations.

  • RLVR: 600 iterations (Tasks: mathematics, geometry, physics, perception, grounding).
  • RLHF: 300 iterations (Task: open-ended generation).
  • PaCoRe Training: 500 iterations.