Towards Flexible, Scalable, and Adaptive Multi-Modal Conditioned Face Synthesis

1ROAS Thrust, The Hong Kong University of Science and Technology (Guangzhou), 2Centre of Smart Health, The Hong Kong Polytechnic University 3School of Cyber Science and Technology, Shenzhen Campus of Sun Yat-sen University

Our method's versatile synthesis capabilities, demonstrating high-fidelity facial image generation from a flexible combination of modalities including mask, text, sketch, lighting, expression, pose, and low-resolution images. Remarkably, these diverse face synthesis tasks are achieved within a single sampling process of a unified diffusion U-Net, demonstrating the method's efficiency and the seamless integration of multi-modal information.

Various Facial Synthesis and Editing Applications

First slide thumbnail Second slide thumbnail Second slide thumbnail Second slide thumbnail Second slide thumbnail Second slide thumbnail Second slide thumbnail Second slide thumbnail Second slide thumbnail


Recent progress in multi-modal conditioned face synthesis has enabled the creation of visually striking and accurately aligned facial images. Yet, current methods still face issues with scalability, limited flexibility, and a one-size-fits-all approach to control strength, not accounting for the differing levels of conditional entropy, a measure of unpredictability in data given some condition, across modalities.

To address these challenges, we introduce a novel uni-modal training approach with modal surrogates, coupled with an entropy-aware modal-adaptive modulation, to support flexible, scalable, and scalable multi-modal conditioned face synthesis network. Our uni-modal training with modal surrogate that only leverage uni-modal data, use modal surrogate to decorate condition with modal-specific characteristic and serve as linker for inter-modal collaboration , fully learns each modality control in face synthesis process as well as inter-modal collaboration. The entropy-aware modal-adaptive modulation finely adjust diffusion noise according to modal-specific characteristics and given conditions, enabling well-informed step along denoising trajectory and ultimately leading to synthesis results of high fidelity and quality. Our framework improves multi-modal face synthesis under various conditions, surpassing current methods in image quality and fidelity, as demonstrated by our thorough experimental results.



Uni-modal training with modal surrogate and entropy-aware modal-adaptive modulation mechanism. During training, we randomly sample uni-modal data, of which the condition is fused with its modal surrogate to learn modal-specific intrinsic and other modal surrogate to learn inter-modal collaboration. The fused features are sent to the diffusion U-Net to guide the de-noising process of the corrupted input image. The output noise is further modulated according to the condition features and UNet feature to adaptively adjust noise level given the conditions. The training process is provided in Algorithm 1.

Video Demo