BeyondScene: Higher-Resolution Human-Centric Scene Generation With Pretrained Diffusion

1Dept. of Electrical and Computer Engineering, 2Interdisciplinary Program in Artificial Intelligence *These authors contributed equally to this work
Seoul National University, Korea
ECCV 2024

Generating 8K image using BeyondScene

This 8192×8192 image, generated by BeyondScene, surpasses the training resolution of SDXL by 64×, while exceeding the technical classification of 8K (7680×4320)






Resolution: 8192x8192; Prompt: “A background of the top of the mountain, A hiker with a broad smile, proudly wearing red hiking jacket. A content hiker, taking in the fresh air decked out in orange hiking jacket. An excited hiker, eyes sparkling with joy donned in yellow hiking jacket. A hiker, laughing at a joke, comfortably wearing green hiking jacket. A thoughtful hiker, with a serene expression, clad in blue hiking jacket. A jubilant hiker, radiating happiness dressed in purple hiking jacket.”



Abstract

Generating higher-resolution human-centric scenes with details and controls remains a challenge for existing text-to-image diffusion models. This challenge stems from limited training image size, text encoder capacity (limited tokens), and the inherent difficulty of generating complex scenes involving multiple humans. While current methods attempted to address training size limit only, they often yielded human-centric scenes with severe artifacts. We propose BeyondScene, a novel framework that overcomes prior limitations, generating exquisite higher-resolution (over 8K) human-centric scenes with exceptional text-image correspondence and naturalness using existing pretrained diffusion models. BeyondScene employs a staged and hierarchical approach to initially generate a detailed base image focusing on crucial elements in instance creation for multiple humans and detailed descriptions beyond token limit of diffusion model, and then to seamlessly convert the base image to a higher-resolution output, exceeding training image size and incorporating details aware of text and instances via our novel instance-aware hierarchical enlargement process that consists of our proposed high-frequency injected forward diffusion and adaptive joint diffusion. BeyondScene surpasses existing methods in terms of correspondence with detailed text descriptions and naturalness, paving the way for advanced applications in higher-resolution human-centric scene creation beyond the capacity of pretrained diffusion models without costly retraining.



Results

Comparision of BeyondScene with Baselines MultiDiff, ScaleCrafter, Regional MultiDiff, ScaleCrafter, SDXL, SDXL with ControlNet, SyncDiffusion, T2I Adapter

Higher-Resolution Synthsized Images, Resolution: 7168 x 4096

Prompt: “A background of the middle of snowy mountain, there are a man is riding ski with red ski suit, a woman is riding ski with yellow ski suit, a man is riding ski with orange ski suit, a woman is riding ski with blue ski suit, a man is riding ski with green ski suit and a man is riding ski with black ski suit.”


Prompt: “The background of alps in spring, there are a girl with blond hair, wearing traditional red alps dress with elaborate embroidery and dancing with joy, a girl with blond hair, wearing traditional blue alps dress with elaborate embroidery and dancing with joy, a girl with blond hair, wearing traditional yellow alps dress with elaborate embroidery and dancing with joy, a girl with blond hair, wearing traditional white alps dress with elaborate embroidery and dancing with joy and a girl with blond hair, wearing traditional pink alps dress with elaborate embroidery and dancing with joy.”


Higher-Resolution Synthsized Images, Resolution: 4096 x 4096

Prompt: “In the background garden of 3D game, there are girl in red dress, Zelda character, wearing red dress with traditional leather accessories and colorful patterns girl in blue dress, Zelda character, wearing blue dress with traditional leather accessories and colorful patterns girl in green dress, Zelda character, wearing green dress with traditional leather accessories and colorful patterns, girl in yellow dress, Zelda character, wearing yellow dress with traditional leather accessories and colorful patterns.”


Higher-Resolution Synthsized Images, Resolution: 3584 x 2048

Prompt: “A background of the large wave in sunny beach, there are a woman enjoying surfing with yellow wet suit, a woman enjoy surfing with pink wet suit, a man enjoy surfing with blue wet suit, a man enjoy surfing with white wet suit and a woman enjoy surfing with black wet suit.”


Prompt: “In the alien planet with galaxy in the sky, there are a storm trooper, a storm trooper, a storm trooper, a Darth Vader, a storm trooper, a storm trooper, a storm trooper, a storm trooper”


Prompt: “A background of the beach in sunny day, there are a man running and wearing blue swim pants, a man running and wearing purple swim pants, a woman running and wearing green swim suit, a woman running and wearing pink swim suit, a man running and wearing red swim pants, a woman running and wearing yellow swim suit and a woman running and wearing red swim suit.”


Prompt: “A background of the forest at night and campfire in the middle of the forest, there are a girl sitting down in front of the fire with yellow shirts and listening the story, a girl sitting down in front of the fire with pink shirts and listening the story, a boy sitting down in front of the fire with blue shirts and listening the story, a boy sitting down in front of the fire with white shirts and talking about scary story and a girl sitting down in front of the fire with black shirts and listening the story”