February 24, 2023
Using Stable Diffusion techniques to create 2D game environments
Tools and Techniques Used
- Stable Diffusion v1.5
- Automatic1111 WebUI
- Alpaca Photoshop Plugin
- Boosting Monocular Depth
- Substance Designer
- Unity URP
- Amplify Shader Editor
Whilst exploring ways of generating isometric backgrounds, I came across a post by Ivan Garcia Filho showing game assets created with a Stable Diffusion based platform using a detailed prompt and high step count.
An intricate modular top-down isometric concept art with PBR materials of a victorian gothic ornated steampunk lamp, in ominous hellish industrial mood and a neat and clean composition with sharp precisely stabilized straight lines, colorful tone mapped cinematic volumetric lighting and global illumination producing shinning edge reflections and detailed ambient occlusion with smooth cold shadows and hot highlights increasing depth and perspective
I started testing some prompts using the same structure but changing the content and the style modifiers to see what sort of futuristic / cyberpunk elements it could generate and get a feel for how the prompt was working.
Early results weren’t great but that was due to using a lower step count, which I didn’t think would be necessary but makes a huge difference here. Bumping up the steps into the 100s and a higher CFG of 15-30 snapped it into more interesting results.
I really liked the style of the open style building section, so I carried on iterating through ranges of steps and CFG with X/Y Plot in Automatic1111 WebUI using the same prompt.
An intricate modular top-down isometric concept art with PBR materials of a cyberpunk building, in ominous hellish industrial mood and a neat and clean composition with sharp precisely stabilized straight lines, colorful tone mapped cinematic volumetric lighting and global illumination producing shinning edge reflections and detailed ambient occlusion with smooth cold shadows and hot highlights increasing depth and perspective
The almost infinite nature of Stable Diffusion generations can make it difficult to settle on a particular output. Early on I used to experience a lot of FOMO, feeling like I had missed the perfect seed or setting, but doing X/Y plots and being brutal about curation has helped me get to a desired result quicker over time. I chose CFG 16 and iterated through steps and hit a great result at 100.
I mainly work in Photoshop for editing and cleaning up generated images and have been beta testing the Alpaca Stable Diffusion plugin, which allows me to to continue working in a familiar environment whilst accessing features like inpainting and img2img.
I placed the building image onto a larger canvas and used outpainting to extend the rest of the building and some more of the walkway using the same prompt.
I cleaned up the background and ran the final image through img2img at double resolution to get more details.
One of my ideas for using the backgrounds in Unity game engine, was to remove the lighting from the image and then add it back in using custom shaders. I achieved this by painting out the strong colors in Photoshop, using a new layer set to color blending mode and neutral grey colors sampled from the original image.
In order to get my own lighting I needed to create a normal map for the 2D scene. Rather than hand painting it (which is an option) I tried automating the process using MiDaS and LeRes in Boosting Monocular Depth to generate depth maps of the image.
I brought the MiDaS depth map and (inverted) LeRes depth map images into Substance Designer to use the Height to Normal World Units node to generate normal maps and then combined them both using Normal Blend node since individually they roughly represent large and small detail information.
The resulting normal map is far from perfect but sufficient enough for my testing. I masked out the background in Photoshop and filled it with the base normal vector value RGB(128, 128, 255) or #8080FF
Here is how the scene looks in Unity using the delit image as the base color and the normal map for the background material on a 3D plane, and two colored point lights in 3D space. The lighting wraps around surfaces in an almost convincing way in some places and makes a crude illusion of scene lighting.
I tried a few different techniques using the depth and normal maps in a custom Unity URP shader I created in Amplify Shader Editor. I used the depth map to try Parallax Occlusion Mapping to add some subtle fake 3D perspective to the camera movement but it looked pretty bad since the depth is incorrect for the isometric view.
I created an implementation of Normal Mapping Shadows which enabled me to have shadows cast by the 2D scene from the sun directional light. Though the effect added some interesting visual qualities to the scene it is not great for representing actual lighting, but could potentially be used as a custom lighting pass to shade 3D characters.
For the main alleyway environment featured in the video at the top of the page, I followed the same methods of iterating through XY plots of an alleyway prompt until I had a starting point I was happy with.
I mirrored the image and placed it into a 2048×2048 canvas in Photoshop then used the Alpaca plugin to outpaint the rest of the environment using the same prompt but slightly altering the wording from alleyway to building or street etc to guide the content that was being generated. For each new section I picked my favourite option from 5 possible generations.
Upscaling is often a tricky process which I like to handle in stages. For this 2K image I divided into quarters and ran them through img2img at double resolution with the same initial prompt. Then due to inconsistencies between quarters, I repeated the same process for the overlapping seam areas and the center and composited them together in Photoshop using masks to blend each area and produce a seamless final 4K image.
I am quite happy with the look of the backgrounds, though I would definitely like more control over the scene content which could be enabled with future techniques. Some of the testing didn’t result in usable content and there are areas for improvement:
- Explore new methods of guiding the scene content
- Find ways to increase style consistency of different types of locations
- Develop better shader techniques for creating pseudo 3D effects from 2D backgrounds
- Train a custom model for generating Normal Maps from backgrounds