Direct pixel-space megapixel image generation with diffusion models

Stability AI¹, LMU Munich², Birchlabs³, Independent Researchers⁴

Preprint, 2024

^*Indicates Equal Contribution

Samples generated directly in RGB pixel space using our HDiT models trained on FFHQ-1024² and ImageNet-256².

Abstract

We present the Hourglass Diffusion Transformer (HDiT), an image generative model that exhibits linear scaling with pixel count, supporting training at high-resolution (e.g. 1024²) directly in pixel-space.
Building on the Transformer architecture, which is known to scale to billions of parameters, it bridges the gap between the efficiency of convolutional U-Nets and the scalability of Transformers.

HDiT trains successfully without typical high-resolution training techniques such as multiscale architectures, latent autoencoders or self-conditioning.
We demonstrate that HDiT performs competitively with existing models on ImageNet-256², and sets a new state-of-the-art for diffusion models on FFHQ-1024².

Efficiency

MY ALT TEXT

Scaling of computational cost w.r.t. target resolution of our HDiT-B/4 model vs. DiT-B/4 (Peebles & Xie, 2023), both in pixel space. At megapixel resolutions, our model incurs less than 1% of the computational cost compared to the standard diffusion transformer DiT at a comparable size.

High-level Architecture Overview

MY ALT TEXT

High-level overview of our HDiT architecture, specifically the version for ImageNet at input resolutions of 256² at patch size p = 4, which has three levels. For any doubling in target resolution, another neighborhood attention block is added. “lerp” denotes a linear interpolation with learnable interpolation weight. All HDiT blocks have the noise level and the conditioning (embedded jointly using a mapping network) as additional inputs.

Files

We provide the 50k generated samples used for FID computation for our 557M ImageNet model without CFG (part 1, 2, 3, 4, 5, 6, 7, 8), with CFG = 1.3 (part 1, 2, 3, 4, 5, 6, 7, 8), and for our FFHQ-1024² model without CFG (part 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21).

BibTeX

@misc{crowson2024hourglass,
    title = {{S}calable {H}igh-{R}esolution {P}ixel-{S}pace {I}mage {S}ynthesis with {H}ourglass {D}iffusion {T}ransformers},
    author = {Katherine Crowson and Stefan Andreas Baumann and Alex Birch and Tanishq Mathew Abraham and Daniel Z Kaplan and Enrico Shippole},
    year = {2024}
}

Direct pixel-space megapixel image generation with diffusion models

Samples generated directly in RGB pixel space using our HDiT models trained on FFHQ-1024² and ImageNet-256².

Abstract

Efficiency

Scaling of computational cost w.r.t. target resolution of our HDiT-B/4 model vs. DiT-B/4 (Peebles & Xie, 2023), both in pixel space. At megapixel resolutions, our model incurs less than 1% of the computational cost compared to the standard diffusion transformer DiT at a comparable size.

High-level Architecture Overview

Files

BibTeX

AI solves International Math Olympiad problems at silver medal level

Jacek Karpińśki, the computer genius the communists couldn’t stand (2017)

Reverse Engineering for Everyone

LEAVE A REPLY Cancel reply

Most Popular

Facebook doesn’t think hackers accessed third-party sites

It’s getting a lot harder for global brands to win in China

Why it’s time for investors to go on the defense

Facebook doesn’t think hackers accessed third-party sites

Recent Comments

EDITOR PICKS

Top Fashion Trends to Look for in Every Important Collection

Spring Fashion Show at the University of Michigan Has Started

Top Ten Kitchen Shortcuts for Indian Food Delights

POPULAR POSTS

Reflecting on 18 Years at Google

Gboard Hat Version

Feathered robotic wing paves way for flapping drones

POPULAR CATEGORY

Direct pixel-space megapixel image generation with diffusion models

Samples generated directly in RGB pixel space using our HDiT models trained on FFHQ-10242 and ImageNet-2562.

Abstract

Efficiency

Scaling of computational cost w.r.t. target resolution of our HDiT-B/4 model vs. DiT-B/4 (Peebles & Xie, 2023), both in pixel space. At megapixel resolutions, our model incurs less than 1% of the computational cost compared to the standard diffusion transformer DiT at a comparable size.

High-level Architecture Overview

Files

BibTeX

LEAVE A REPLY Cancel reply

Most Popular

Recent Comments

EDITOR PICKS

POPULAR POSTS

POPULAR CATEGORY

Samples generated directly in RGB pixel space using our HDiT models trained on FFHQ-1024² and ImageNet-256².