Generating Activity Snippets by Learning Human-Scene Interactions

We present an approach to generate virtual activity snippets, which comprise sequenced keyframes of multi-character, multi-object interaction scenarios in 3D environments, by learning from recordings of human-scene interactions. The generation consists of two stages. First, we use a sequential deep graph generative model with a temporal module to iteratively generate keyframe descriptions, which represent abstract interactions using graphs, while preserving spatial-temporal relations through the activities. Second, we devise an optimization framework to instantiate the activity snippets in virtual 3D environments guided by the generated keyframe descriptions. Our approach optimizes the poses of character and object instances encoded by the graph nodes to satisfy the relations and constraints encoded by the graph edges. The instantiation process includes a coarse 2D optimization followed by a fine 3D optimization to effectively explore the complex solution space for placing and posing the instances. Through experiments and a perceptual study, we applied our approach to generate plausible activity snippets under different settings.

Generating Activity Snippets by Learning Human-Scene Interactions

Our approach generates an activity snippet referring to a sequence of high-level keyframe descriptions of multi-character, multi-object interactions represented as graphs, based on which our approach automatically instantiates virtual characters and objects in a 3D environment.

Abstract

Video

Results

Generated "serve food" activity snippets.

A generated "prepare burger" activity snippet. The keyframe description generator was trained using a synthetic overcooked dataset.

An interactive "prepare cake" activity snippet. The activity advances conditioned on the player's actions.

A "serve food" activity snippet in mixed reality.

BibTeX