Step3-VL-10B: |
Step3-VL-10B 是一款轻量级开源基础模型,旨在重新定义紧凑高效与前沿多模态智能之间的权衡。尽管仅有 10B 参数,STEP3-VL-10B 在视觉感知、复杂推理和人类对齐方面表现卓越。
该模型在 10B 规模以下的模型中始终表现最优,并能媲美甚至超越规模大 10×–20× 的开源模型(如 GLM-4.6V 106B-A12B、Qwen3-VL-Thinking 235B-A22B)以及顶级闭源旗舰模型(如 Gemini 2.5 Pro、Seed-1.5-VL)。
Step3-VL-10B 的成功源于两大核心设计:高质量多模态语料库的统一预训练(1.2T tokens)与规模化多模态强化学习(超过 1,400 次 RL 迭代),并引入 Parallel Coordinated Reasoning (PaCoRe) 实现并行视觉探索的证据聚合。
Benchmark
评测采用"STEM 推理、识别、OCR & 文档、GUI Grounding、空间理解、代码"等核心维度,以横向对比方式呈现多个同行模型的分数差异。对比表格强调统计口径一致性:同一数据集版本、统一评测脚本、固定温度与采样参数。
| Benchmark | STEP3-VL-10B | GLM-4.6V | Qwen3-VL | Gemini-2.5 Pro | Seed-1.5-VL | |
|---|---|---|---|---|---|---|
| SeRe | PaCoRe | |||||
| 10B | 10B | 106B-A12B | 235B-A22B | — | — | |
| STEM / Multimodal Reasoning | ||||||
| MMMU | 78.11 | 80.11 | 75.20 | 78.70 | 83.89 | 79.11 |
| MMMU-Pro | 64.08 | 67.18 | 65.84 | 72.37 | 76.96 | 70.60 |
| MathVision | 70.81 | 75.95 | 63.50* | 72.10 | 73.30* | 68.70* |
| MathVista | 83.97 | 85.50 | 83.51 | 85.10 | 83.88 | 85.60 |
| LogicVista | 66.89 | 71.36 | 64.88 | 73.15 | 69.80 | 72.93 |
| DynaMath | 56.39 | 61.48 | 56.29 | 60.30 | 52.30 | 58.88 |
| ZeroBench (main) | 1.00 | 5.00 | 1.00 | 3.00 | 4.00 | 1.00 |
| ZeroBench (sub) | 27.54 | 29.94 | 29.04 | 28.40 | 33.53 | 31.74 |
| MathVerse (vision) | 75.73 | 78.30 | 72.84 | 76.65 | 78.30 | 77.79 |
| We-Math | 73.03 | 73.90 | 71.14 | 74.70 | 80.10 | 79.05 |
| VisuLogic | 29.68 | 32.70 | 28.30 | 31.80 | 31.40 | 34.30 |
| PhyX | 59.45 | 66.01 | 59.70 | 66.30 | 67.56 | 62.53 |
| Recognition / General VQA | ||||||
| MMBench (EN) | 92.05 | 92.38 | 92.75 | 92.70 | 93.19 | 92.11 |
| MMBench (CN) | 91.55 | 91.96 | 91.88 | 91.80 | 93.13 | 91.76 |
| SimpleVQA | 53.08 | 54.64 | 57.95 | 59.30 | 66.85 | 64.72 |
| MMStar | 77.48 | 77.64 | 75.30 | 76.80 | 79.18 | 77.91 |
| HallusionBench | 64.91 | 64.54 | 60.63 | 65.58 | 65.63 | 64.13 |
| MMVP | 68.16 | 68.00 | 71.33 | 71.30 | 70.67 | 74.00 |
| ReMI | 67.29 | 69.12 | 64.42 | 74.70 | 71.69 | 72.19 |
| M3GIA | 78.33 | 73.50 | 78.72 | 81.00 | 83.11 | 83.22 |
| DoYouSeeMe | 67.48 | 68.54 | 67.50 | 72.89 | 71.19 | 71.94 |
| Counting | ||||||
| CountBench | 88.75 | 88.80 | 92.06 | 92.46 | 87.78 | 91.85 |
| CountQA | 33.69 | 38.29 | 36.32 | 45.62 | 38.02 | 48.89 |
| PixMo-Count | 70.85 | 71.61 | 76.47 | 79.80 | 75.54 | 83.38 |
| OCR | ||||||
| OCRBench | 86.75 | 89.00 | 86.20 | 87.30 | 85.90 | 85.20 |
| OmniOCR | 76.98 | 78.14 | 84.53 | 87.20 | 66.05 | 87.80 |
| CC-OCR (Multi-Lang-OCR) | 76.59 | 77.51 | 74.08 | 80.80 | 81.10 | 78.82 |
| 2D / 3D Spatial Understanding | ||||||
| BLINK | 66.79 | 67.39 | 68.17 | 67.12 | 72.01 | 71.54 |
| CVBench | 83.49 | 85.92 | 83.72 | 87.86 | 84.36 | 86.27 |
| MMSI-Bench | 32.18 | 36.40 | 30.80 | 32.50 | 40.40 | 30.60 |
| ERQA | 48.87 | 51.75 | 47.75 | 53.50 | 62.25 | 48.50 |
| OmniSpatial | 51.58 | 52.58 | 50.49 | 53.10 | 55.64 | 51.99 |
| All-Angles-Bench | 57.21 | 64.71 | 62.94 | 60.59 | 65.88 | 57.65 |
| MindCube-tiny | 62.81 | 68.58 | 52.83 | 47.58 | 58.92 | 39.83 |
| RealWorldQA | 74.44 | 75.56 | 77.78 | 78.80 | 77.78 | 79.61 |
| SpatialViz-Bench | 45.51 | 52.03 | 37.46 | 46.36 | 45.34 | 35.25 |
| STARE | 61.75 | 64.57 | 60.38 | 70.89 | 62.36 | 62.99 |
| CoreCognition | 66.69 | 71.54 | 69.50 | 72.66 | 78.78 | 72.38 |
| V* | 82.85 | 84.29 | 85.86 | 89.53 | 80.63 | 90.58 |
| ViewSpatial | 46.14 | 48.41 | 43.87 | 48.58 | 44.15 | 44.14 |
| Exam (Text-Centric) | ||||||
| MMLU-Pro | 76.02 | 77.09 | 79.96 | 83.75 | 86.45 | 83.39 |
| GPQA-Diamond | 70.83 | 73.99 | 69.19 | 77.68 | 84.06 | 71.91 |
| SuperGPQA | 50.38 | 53.15 | 53.28 | 64.20 | 65.00 | 60.50 |
| LiveBench (2024-11-25) | 69.71 | 71.69 | 62.75 | 80.14 | 76.34 | 65.62 |
| Mathematics (Text-Centric) | ||||||
| AIME 2024 | 90.94 | 93.33 | 80.63 | 91.93 | 79.53 | 79.48 |
| AIME 2025 | 87.66 | 94.43 | 71.88 | 83.59 | 83.96 | 64.06 |
| HMMT 2025 | 78.18 | 92.14 | 57.29 | 67.71 | 65.68 | 51.30 |
| CNMO 2024 | 78.20 | 81.17 | 72.11 | 88.36 | 74.53 | 83.67 |
| Beyond AIME | 63.23 | 74.00 | 39.83 | 57.42 | 54.45 | 42.83 |
| IMO-AnswerBench | 62.12 | 76.66 | 51.25 | 69.25 | 72.00 | 44.75 |
| Code | ||||||
| LiveCodeBench (2408-2505) | 75.77 | 76.43 | 48.71 | 69.45 | 72.01 | 57.10 |
注:SeRe (Sequential Reasoning) 使用最大 64K tokens;PaCoRe (Parallel Coordinated Reasoning) 聚合 16 个并行 rollouts,最大 128K tokens。测试采用: temperature=1, top_p=1, top_k=0。
Showcase
Showcase 通过真实案例展示 Step3-VL-10B 的多模态推理能力:Case 1 聚焦莫尔斯电码表格解析,其他案例覆盖GUI感知与视觉识别和推理。
Thinking Process
用户现在需要把每个字母转换成莫尔斯电码,先看每个字母在表中的编码。首先分解"STEPFUN"每个字母:S、T、E、P、F、U、N。
- S:看右边第三列第一个,S是···(三个点)。(一个短横)
- T:第三列第二个,T是-。
- E:左边第一列第五个,E是·(一个点)。
- P:中间列(J到R)中P对应的符号是·--·点、短横、短横、点)。
- F:最左侧列第六行,F的符号是··-·(两个点、一个短横、一个点)。
- U:最右侧列第三行,U的符号是··-(两个点、一个短横)。
- N:中间列(J到R)第五行,N的符号是-·一个短横、一个点)。
再将每个字母的编码按顺序拼接(字母间用空格分隔更清晰),"STEPFUN"的莫尔斯电码为:··· - · ·--· ··-· ··- -·
Final Answer
Thinking Process
So, let's look at the tabs at the top of the editor.
- chapte... (leftmost, maybe chapter1.tex as the active file in the center)
- chapte... (next)
- thesis... (this is thesis.tex, not a chapter)
- chapte... (next)
- chapte... (next)
- chapte... (last one before the dark theme part)
To determine the minimum number of chapter .tex files opened, we analyze the tabs at the top of the editor.
- Identify tabs: The image shows multiple tabs labeled chapte... (truncated for "chapter...") and one tab labeled thesis....
- Exclude non-chapter tabs: The thesis... tab is not a chapter file, so we count only the chapte... tabs.
- Count chapter tabs: There are 5 visible chapte... tabs (excluding the thesis... tab).
Final Answer
Thinking Process
Got it, let's break this down. First, find the monitor's brand.
...
The logo on the monitor's base is HP. Then, we need to do a Caesar cipher with a left shift of 5.
...
- First, H is the 8th letter (A=1, B=2... H=8). Left shift 5: 8-5=3 → C.
- Then, P is the 16th letter. 16-5=11 → K.
So "HP" shifted left 5 is "CK".
Final Answer
Thinking Process
To determine the number of edges with the shortest length, we first identify the smallest edge weight in the diagram.
...
The smallest weight is 1. Next, we count all edges labeled with 1:
...
- A -> T: labeled 1
- T → X: labeled 1
- P → O: labeled 1
- W → X: labeled 1
...
No other edges have a weight of 1. Thus, there are 4 edges with the shortest length.
Final Answer
Method
Architecture
- Visual Encoder: PE-lang (Language-Optimized Perception Encoder), 1.8B parameters.
- Decoder: Qwen3-8B.
- Projector: Two consecutive stride-2 layers (resulting in 16× spatial downsampling).
- Resolution: Multi-crop strategy consisting of a 728×728 global view and multiple 504×504 local crops.
Training Pipeline
Pre-training: Single-stage, fully unfrozen strategy using AdamW optimizer (Total: 1.2T tokens, 370K iterations).
- Phase 1: 900B tokens.
- Phase 2: 300B tokens.
Supervised Finetuning (SFT): Two-stage approach (Total: ~226B tokens).
- Stage 1: 9:1 text-to-multimodal ratio (~190B tokens).
- Stage 2: 1:1 text-to-multimodal ratio (~36B tokens).
Reinforcement Learning: Total >1,400 iterations.
- RLVR: 600 iterations (Tasks: mathematics, geometry, physics, perception, grounding).
- RLHF: 300 iterations (Task: open-ended generation).
- PaCoRe Training: 500 iterations.