Turbo2K: Towards Ultra-Efficient and High-Quality 2K Video Synthesis

Jingjing Ren^{1 *}, Wenbo Li^{2 *}, Zhongdao Wang², Haoze Sun², Bangzhen Liu³, Haoyu Chen¹, Jiaqi Xu², Aoxue Li², Shifeng Zhang², Bin Shao², Yong Guo⁴, Lei Zhu^{1, 5 #}

¹ The Hong Kong University of Science and Technology (Guangzhou), ² Huawei Noah's Ark Lab, ³ South China University of Technology , ⁴ Max Planck Institute for Informatics, ⁵ The Hong Kong University of Science and Technology

Arxiv

* Equal contribution # Corresponding author

Overview

Demand for 2K video synthesis is growing alongside rising expectations for ultra-clear visuals. While diffusion transformers (DiTs) excel at high-quality video generation, scaling them to 2K resolution is computationally expensive due to quadratic memory and processing costs. We present Turbo2K, an efficient framework for detail-rich 2K video generation with significantly improved training and inference efficiency. Turbo2K operates in a highly compressed latent space, drastically reducing computational demands. To compensate for compression-induced quality loss and limited model capacity, we employ a knowledge distillation strategy, enabling a lightweight student model to inherit the generative power of a larger teacher model. Despite architectural and latent-space differences, we observe strong alignment in DiTs' internal representations, supporting effective knowledge transfer. Additionally, we introduce a two-stage synthesis pipeline: low-resolution multi-level features are generated first, then used to guide high-resolution video synthesis. This ensures structural coherence, fine-grained detail, and eliminates redundant encoding-decoding steps. Turbo2K achieves compelling efficiency , generating 5-second, 24fps 2K videos up to 20× faster than existing methods, making scalable, high-resolution video generation practical for real-world applications.

Turbo2K generates high‑quality, detail‑rich, and aesthetically pleasing videos while achieving significant speed advantages over existing methods.

Turbo2K framework overview. Left: Heterogeneous model distillation aligns the student model's internal representation with a larger teacher model to enhance semantic understanding and detail richness. Right: Two-stage synthesis first generates a low-resolution (LR) video, extracting semantic features to guide high-resolution (HR) generation.

Text‑to‑Video

Contact Us

Feel free to contact Wenbo Li at fenglinglwb@gmail.com for cooperation, communication, or applying internships.

Thanks UltraPixel for providing this webpage template!