Generative Human-Scene Interactions

Breathing Life into Virtual Worlds through Generative Human-Scene Interaction

The Vision

As virtual reality, the metaverse, and spatial computing continue to evolve, the demand for truly immersive digital environments has never been higher. However, populating these worlds with lifelike, intelligent characters remains a significant challenge. Traditional methods often rely on rigid, hand-crafted rules or focus solely on a single character’s basic movements, ignoring the rich, dynamic context of the world around them.

Our projects introduce novel frameworks for Generative Human-Scene Interaction. Instead of merely placing static avatars in a 3D space, we give virtual characters the “cognitive ability” to understand their surroundings and the “physical awareness” to interact naturally with objects and other characters over time.

How It Works: “Brains” for Smarter Virtual Interactions

To bridge the gap between high-level human intent and low-level 3D execution, our framework approaches behavior generation through two complementary, cutting-edge pathways:

  • The Semantic Brain (Multimodal Large Language Models) (Li et al., 2025): We leverage the advanced reasoning capabilities of AI language models, giving them “eyes” to see the virtual environment. By analyzing the spatial layout and semantic context of a scene, the AI can act like a virtual director. It drafts logical, context-aware “scripts” or activity descriptions—determining exactly what characters should do, where they should stand, and how they should interact with specific objects based on real-world common sense.
  • The Structural Brain (Graph Generative Models) (Li & Yu, 2023): For complex scenarios involving multiple people and objects (like a waiter taking orders from two seated customers), we represent the scene as a dynamic web of relationships, or a “graph.” Our system predicts how these relationships evolve frame-by-frame, generating fluid “activity snippets.” This ensures that the interactions are not just meaningful in a single moment, but logically connected over time.

Real-World Impact & Applications

By automating the creation of contextually meaningful, multi-character interactions, this project opens up exciting possibilities across various fields:

  • Immersive VR/AR Experiences: Creating rich, populated digital worlds where non-player characters (NPCs) react organically to their environment and to users.

  • Social Metaverse: Generating authentic social behaviors and crowd dynamics in shared virtual spaces.

  • Gaming & Content Creation: Allowing developers and storytellers to populate scenes with complex activities using simple, high-level prompts rather than tedious manual animation.

  • Robotics & Simulation: Providing highly realistic, synthetic training environments for teaching robots how to navigate human-centric spaces.

References

2025

  1. ismar25llmactivity.jpg
    Crafting Dynamic Virtual Activities with Advanced Multimodal Models
    Changyang Li, Qingan Yan, Minyoung Kim, Zhan Li, Yi Xu, and Lap-Fai Yu
    In 2025 IEEE International Symposium on Mixed and Augmented Reality (ISMAR), 2025

2023

  1. s23snippets.gif
    Generating Activity Snippets by Learning Human-scene Interactions
    Changyang Li and Lap-Fai Yu
    ACM Transactions on Graphics (Proceeding of SIGGRAPH), 2023