Enhancing Diffusion Face Generation with Contrastive Embeddings and SegFormer Guidance
ResearchFeatured

Enhancing Diffusion Face Generation with Contrastive Embeddings and SegFormer Guidance

Benchmark study comparing UNet and DiT architectures for unconditional generation, with novel InfoNCE contrastive loss and SegFormer-based segmentation for attribute-conditioned face synthesis.

Overview

A comprehensive benchmark of diffusion models for human face generation on the CelebAMask-HQ dataset, evaluating both unconditional and conditional pipelines. Building on multi-conditioning approaches, our main contributions are the integration of InfoNCE loss for attribute embedding and the adoption of a SegFormer-based segmentation encoder, enhancing semantic alignment and controllability of attribute-guided synthesis.

Research Motivation

Diffusion models have become state-of-the-art for high-fidelity image synthesis, but generating controllable human faces with fine-grained attribute control remains challenging, especially with limited training data. We investigated how contrastive learning and advanced segmentation encoding can improve attribute-guided face generation.

Methodology

Unconditional Generation

  • UNet Architectures: Implemented multiple UNet variants with varying depths (4-6 blocks) and attention placements
  • DiT (Diffusion Transformer): Explored transformer-based backbone as alternative to UNet
  • EMA Stabilization: Applied Exponential Moving Average on model weights, reducing FID by ~19 points
  • Training: DDPM framework at 128×128 resolution with learning rates 1e-5 to 5e-4

Conditional Generation with LoRA

  • Fine-tuned Stable Diffusion v2 using Low-Rank Adaptation (LoRA)
  • Added rank-decomposed matrices to UNet attention layers for efficient adaptation
  • VQ-VAE encoder compresses 512×512 images into 64×64×4 latents
  • Compared conditional vs unconditional fine-tuning approaches

Attribute-Conditioned Generation

  • Latent Space Encoding: Pre-trained VQ-VAE compresses images to 64×64 latent space
  • Conditional UNet: UNet2DConditionModel with 256D embeddings via cross-attention
  • Attribute Embedder: MLP maps 40D multi-hot attribute vectors to 256D embeddings
  • Novel InfoNCE Loss: Contrastive objective ensures embeddings capture semantic relationships

Segmentation Guidance with SegFormer

  • Replaced baseline ResNet-18 mask encoder with pretrained SegFormer
  • Processes combined segmentation masks into 128D spatial embeddings
  • Joint attribute + segmentation embeddings concatenated and projected (512→256D)
  • Provides precise spatial and attribute control for face synthesis

Key Innovation: InfoNCE for Attribute Embedding

The InfoNCE (Noise-Contrastive Estimation) loss trains the attribute embedder to capture semantic relationships. The loss is formulated as:

Where:

  • B: Batch size
  • sim(·,·): Cosine similarity between embeddings
  • τ: Temperature parameter (default 0.07)
  • pos: Indices of positive pairs (attribute similarity > 0.8)

Training Process:

  1. Attribute Similarity: Cosine similarity between multi-hot vectors identifies positive pairs (threshold 0.8)
  2. Embedding Similarity: Scaled by temperature parameter τ
  3. Contrastive Learning: Encourages similar attributes to cluster, dissimilar to separate
  4. Result: More discriminative embeddings that enhance the UNet's conditioning precision

Experimental Results

Unconditional Diffusion

ModelConfigurationFID ↓
UNet-R3LR 2e-4, Warmup 3000, EMA72.62
UNet-R5LR 2e-4, Warmup 3000, EMA76.08
UNet-R6LR 2e-4, Warmup 1500, EMA73.35
UNet-R5No EMA92.90
DiT-Large2.7k samples89.90
DiT-Small2.7k samples94.00

Conditional Generation with Attributes

ConfigurationFID ↓
LC_UNet_3 (Baseline)74.07
LC_UNet_3 + InfoNCE70.98
LC_UNet_3 + InfoNCE + Segmentation63.85

LoRA Fine-tuned Stable Diffusion v2

ConditionInference StepsFID ↓
Conditional (No Tuning)50114.73
Conditional + LoRA15065.31
Unconditional + LoRA5091.32

Key Findings

  • Architecture: Deeper UNet variants with mid-level attention achieve superior FID (72.62–76.08)
  • EMA Impact: Weight averaging reduces FID by ~19 points (92.90→73.99)
  • InfoNCE Benefit: Improves attribute-guided FID from 74.07 to 70.98
  • Segmentation Guidance: Adding SegFormer masks further reduces FID to 63.85
  • LoRA Efficiency: Conditional LoRA models achieve best FID of 65.31, outperforming unconditional variants

Key Achievements

  • Best unconditional FID of 72.62 with UNet + EMA stabilization
  • Best conditional FID of 63.85 with attribute + segmentation conditioning
  • Improved attribute-guided FID from 74.07 to 70.98 through InfoNCE loss integration
  • LoRA fine-tuned Stable Diffusion v2 achieved FID of 65.31

Technologies Used

PyTorchStable DiffusionLoRASegFormerVQ-VAEHuggingFaceUNetDiTInfoNCE

Skills Applied

Stable DiffusionContrastive LearningTransformersGenerative AILatent Diffusion Models

Project Details

  • Role

    Team Lead, Researcher

  • Team Size

    4

  • Duration

    Mar 2025 - Jun 2025

Want to know more about this project?

Ask my AI