Step1X-3D: Towards High-Fidelity and Controllable Generation of Textured 3D Assets

Abstract

While generative artificial intelligence has advanced significantly across text, image, audio, and video domains, 3D generation remains comparatively underdeveloped due to fundamental challenges such as data scarcity, algorithmic limitations, and ecosystem fragmentation. To this end, we present Step1X-3D, an open framework addressing these challenges through: (1) a rigorous data curation pipeline processing >5M assets to create a 2M high-quality dataset with standardized geometric and textural properties; (2) a two-stage 3D-native architecture combining a hybrid VAE-DiT geometry generator with an SD-XL-based texture synthesis module; and (3) the full open-source release of models, training code, and adaptation modules. For geometry generation, the hybrid VAE-DiT component produces watertight TSDF representations by employing perceiver-based latent encoding with sharp edge sampling for detail preservation. The SD-XL-based texture synthesis module then ensures cross-view consistency through geometric conditioning and latent-space synchronization. Benchmark results demonstrate state-of-the-art performance that exceeds existing open-source methods, while also achieving competitive quality with proprietary solutions. Notebly, the framework uniquely bridges 2D and 3D generation paradigms by supporting direct transfer of 2D control techniques~(e.g., LoRA) to 3D synthesis. By simultaneously advancing data quality, algorithmic fidelity, and reproducibility, Step1X-3D aims to establish new standards for open research in controllable 3D asset generation.

Data curation

Our curation pipeline consists of three main stages. First, we filter out low-quality data by removing assets with poor textures, incorrect normals, transparent materials or single surface. Second, we convert non-watertight meshes into watertight representations to enable proper geometry supervision. Third, we uniformly sample points on the surface along with their normals to provide comprehensive coverage for the VAE and diffusion model training. Through our comprehensive data processing pipeline, we successfully curated roughly 2 million high-quality 3D assets from multiple sources: extracting 320k valid samples from the original Objaverse dataset, obtaining an additional 480k from Objaverse-XL, and combining these with carefully selected data from ABO, 3D-FUTURE, and some internal created data.

Step1X-3D method overview

Step1X-3D employs: (1) a geometry generation stage using a hybrid 3D VAE-DiT diffusion model to produce Truncated Signed Distance Function (TSDF) (later meshed via marching cubes), and (2) a texture synthesis stage leveraging an SD-XL-fine-tuned multi-view generator. This generator conditions on the produced geometry and input images to produce view-consistent textures that are then baked onto the mesh. This integrated approach aims to advance the field by simultaneously resolving critical challenges in data quality, geometric precision, and texture fidelity, while establishing new benchmarks for open, reproducible research in 3D generation.

Acknowledgements

BibTeX

@article{li2025step1x, title={Step1X-3D: Towards High-Fidelity and Controllable Generation of Textured 3D Assets}, author={Li, Weiyu and Zhang, Xuanyang and Sun, Zheng and Qi, Di and Li, Hao and Cheng, Wei and Cai, Weiwei and Wu, Shihao and Liu, Jiarui and Wang, Zihao and others}, journal={arXiv preprint arXiv:2505.07747}, year={2025} }

Step1X-3D: Towards High-Fidelity and Controllable Generation of Textured 3D Assets

Abstract

Data curation

Step1X-3D method overview

Gallery of generated 3D untextured and textured mesh assets by Step1X-3D

Controllable 3D generation with symmetry or asymmetry geometry

Controllable 3D generation with sharp, normal or smooth geometry.

Qualitative comparison with SOTA methods on textured mesh

Acknowledgements

BibTeX