Research

Beyond the Pose: Towards an Interdisciplinary Understanding of Dynamic Human Movement.

We have brought together an interdisciplinary team comprising members from Performing Arts, Computer Science, Kinesiology, and Inclusive Design, to research and develop methods for improving the representation and interpretation of human movement—in all its diversity—to enable a richer, more embodied, and more equitable presence for bodies of all sorts in engagements with machine learning systems.

THE CHALLENGE: How we are situated in the world within our individual bodies defines much of our experience and agency as humans. We navigate the world with purposeful movements and expressive gestures involving joints and muscles working in complex synergies. As we become increasingly sur- rounded by machine learning (ML) systems, it is essential that these systems be sensitive and respon- sive to the nuances of our actions. How our movements are represented within ML systems will significantly determine our experience of being in the world: how our embodied presence is accommodated and valued in our increasingly ML-mediated lives. (Read more)

Real-time Generation of Panoramic scenes from Voice using a custom Stable Diffusion pipeline (2023)

Awwal Malhi, David Rokeby

We have extended our exploration of text to scrolling image to work with the latest version of Stable Diffusion and adapting it to our particular performance-based projects. Here is a sneak peak of a very casual interaction with the most recent version. This is the direct output responding in real-time to the spoken words.. no editing and no cherry-picking.

Real-time Generation of Panoramic scenes from Voice using CLIP and VQGAN (2021-2022)

Kathy Zhuang, Awwal Malhi, David Rokeby, Xavier Snelgrove

Advances in machine learning based image generation from texts prompts has led us to wonder how such technologies might be used in live performance. We are exploring optimized generation of scrolling panoramas generated in real-time from live spoken text. We are using OpenAI’s CLIP and CompVis’s VQGAN in our current experiments, but also exploring the use of newer diffusion-based generators.

Our primary challenges include optimizing the system to generate the scrolling images quickly and fluidly, and massaging the incoming transcribed speech into effective unfolding prompts to provide a tangible and immediate sense that the generated images are responding to that speech in compelling ways.

The above image is a small section of a much larger panorama that was the result of a recent experiment.

This is another section of the same scroll that include a variety of prompts about “a skeleton in love” using references to different artists and art movements (from the left to right: Cubism, Diego Rivera, Gustav Klimt, Piet Mondrian, something we can’t remember, and finally H. R. Giger on the right.)

Real-Time Pose and Gesture-based cueing using Motion Capture (2021-2022)

David Rokeby

As part of our foray into Motion Capture Research for performance, we developed a machine-learning system that was trained to associated specific poses of a performer in a motion capture suit with specific lighting, sound and video cues.

Our aim with this particular project was to explore ways that motion capture and machine learning could be used to give a live performer more agency within complex multimedia performances, so that they could judge the timing of cues as best suited the audience and the flux of the moment.

The video below documents the incorporation of this system into a scene from Bertolt Brecht’s The Resistible Rise of Arturo Ui.

Mask-Guided Discovery of Semantic Manifolds in Generative Models

Mengyu Yang, David Rokeby, Xavier Snelgrove: 2020-10-09
accepted for the Workshop on Machine Learning for Creativity and Design (NeurIPS), 2020

Paper: pdf
Project Page and Code: GitHub

Abstract

Advances in the realm of Generative Adversarial Networks (GANs) have led to architectures capable of producing amazingly realistic images such as StyleGAN2 which, when trained on the FFHQ dataset, generates images of human faces from random vectors in a lower-dimensional latent space. Unfortunately, this space is entangled – translating a latent vector along its axes does not correspond to a meaningful transformation in the output space (e.g., smiling mouth, squinting eyes). The model behaves as a black box providing neither control over its output nor insight into the structures it has learned from the data.

However, the smoothness of the mappings from latents to faces plus empirical evidence suggest that manifolds of meaningful transformations are in fact hidden inside the latent space but obscured by not being axis-aligned or even linear. Travelling along these manifolds would provide puppetry-like abilities to manipulate faces while studying their geometry would provide insight into the nature of the face variations present in the dataset – revealing and quantifying the degrees-of-freedom of eyes, mouths, etc.

We present a method to explore the manifolds of changes of spatially localized regions of the face. Our method discovers smoothly varying sequences of latent vectors along these manifolds suitable for creating animations. Unlike existing disentanglement methods that either require labelled data or explicitly alter internal model parameters, our method is an optimization-based approach guided by a custom loss function and manually defined region of change.

Reaching Through the Screen: Exploring real-time video processing to allow for multi-person interaction within Zoom video conferencing sessions

David Rokeby: 2020-06-12

Abstract

The COVID-19 lockdown has forced us all into a situation where most of our contact with people outside of our families is over video conferencing software such as Zoom. While these systems have been of great utility in the current situation, they have created a situation where most of our social engagement is narrowly framed through video and sound on flat screens with bad speakers. We are exploring ways to reach through these screens to have different kinds of engagements with each other within these frames. Using real-time screen capture, we route the whole zoom screen to real-time video processing software where we can separate the group of images into individual video streams. These individual streams can be analyzed and processed in real-time in various ways to create alternative video and audio experiences. Preliminary experiments include: collaging all participants into a single shared screen where they can engage and collaborate together visually, or giving each participant control over a part of a collective improvisation either through controlling sounds contributing to a shared sound scape, or by giving each control over one joint in a computer animated puppet that all can see through Zoom screen-sharing.

Guided Text Generation Tools for Performance

David Rokeby: 2020-03-24

A tool for exploring GPT-2 models fine-tuned on corpuses of plays. This tool provides the ability to interact with the model, showing the likelihood for next words in the generation, allowing for stepping backwards, changing word selections, guiding the generation through word suggestions, and changing text generation hyper-parameters in the fly.

Abstract

Recent advances in text generation using Transformer architectures, like OpenAI’s GPT-2 offer new possibilities for the generation of text for creative applications. The simplest examples involve processes like fine-tuning a GPT-2 model on the works of Shakespeare to create a system that will generate properly structured and surprising coherent alternative shakespeare-ish texts. But the utility of this method is limited by the standard mode of usage of these systems where the system is given a text as a prompt and then proceeds to suggest next words in sequence; once the prompt is given, the transformer generates without further interaction. We are developing tools for exploring, visualizing and manipulating GPT-2 models to discover approaches that make it possible to guide or direct the output of transformers in order to allow for more sustained engagement beyond the choice of an initial prompt.

Voxel-based Mapping of Space for Performance

David Rokeby: 2019-11-20

prototype app converting depth map to homogenous voxels. The Voxel representation is easier to process than a point cloud. We further allow for arbitrarily complex remapping of voxels into regions which can be set up as triggers or modulators for other systems.

Abstract

Depth cameras offer enormous potential for interactive performance. Most approaches use skeleton tracking to map joints into 3-d space. While this is effective for many applications, the skeleton data tends to be noisy and fragile. Extracting clean motion information is challenging. Mapping the 3-d point-cloud into voxel space can yield must cleaner and more stable motion information the can be used to provide movement-based interaction possibilities for performance.

Learning Mappings Between Spaces for Creative Control

Xavier Snelgrove: 2019-10-20

Abstract

We propose a research program to use machine learning techniques to find alignments between the geometric structures on manifolds of user input (e.g. pose, voice, hand gesture, etc.) and manifolds of generative models (e.g. face generating GANs, vocal synthesis, etc.). We hypothesize that this may allow for more intuitive control of generative models, with particular application in live creative performance