Longread

2023-10-10

How TRI is using Generative AI to teach robots

Toyota Research Institute (TRI) is utilizing generative AI to teach robots new dexterous skills just through demonstration data. This mirrors how large language models can generate text without explicit coding.

TRI has trained robots in over 60 complex manual tasks like pouring, tool use, and deformable object manipulation. Critically, this was done without writing any new code, merely providing more data.

A TRI VP remarked how rapidly and reliably this approach acquires new skills. As it relies solely on memorized camera and tactile representations, it handles traditionally tricky areas like deformable materials well.

The parallels to large language model training point towards developing vast generalizable "behavior models" for robots. This could massively reduce the time and data needed to expand robots' capabilities.

At the RoboBusiness conference, which will be held on October 18-19 in Santa Clara, California, robotics industry leaders will discuss the use of large language models (LLM) and text generation applications in robotics. It will also explore fundamental ways of applying generative artificial intelligence to robotics design, model learning, modeling, control algorithms, and product commercialization.

The discussion group will include Pras Velagapudi, Vice President of Innovation at Agility Robotics, Jeff Linnell, CEO and founder of Formant, Ken Goldberg, William S. Floyd Jr., Distinguished Engineer at the University of California at Berkeley, Amit Goel, Director of Product Management at NVIDIA, and Ted Larson, CEO of OLogic.

Leveraging parallels with recent AI breakthroughs could accelerate unlocking versatile robotic dexterity. But challenges remain around sim-to-real transfer and ensuring safety.

This research direction shows promise for scaling robots' learned skills exponentially. Rather than coding behaviors, generative models could allow "programming" robots simply by providing more demonstrations.

If robots can gain dexterity the way language models gain linguistic competence, it would revolutionize their practical abilities. TRI's work offers an early glimpse of this future powered by generative AI.

Toyota Research Institute's robot learning approach combines human demonstrations with linguistic goal descriptions. AI distributional policy learning then autonomously acquires the demonstrated skills from dozens of examples.

The training process is interface-agnostic, using various inexpensive input devices. More dexterous behaviors use bimanual tactile rigs with positional communication for precise posture tracking.

This positional mapping lets the teacher feel resistance forces from the robot. Closing this tactile feedback loop is crucial for complex skills, especially where visual observation is insufficient.

For example, in bimanual tool use, there are many ambiguous internal force configurations not visible externally. Tactile sensing provides vital information about contact geometry, slippage, and forces.

TRI utilizes soft bubble sensors covering robot surfaces for rich spatial contact data. Historically, leveraging this dense tactile data has proven challenging.

But diffusion modeling effectively condenses the myriad possibilities of visual-tactile sensors into reliable skill acquisition. In tests, tactile feedback was critical for success in delicate tasks like egg-beating.

This demonstrates how rich behavioral models can emerge from combining human guidance and modern AI techniques. Leveraging multimodal sensory streams is key to learning robust, dexterous skills.

As robots venture into unstructured real-world situations, tactile intelligence will only grow in importance. TRI's research provides a promising template for future robots to learn complex manual abilities through generative demonstration and feedback.

Instead of generating images from text, Toyota Research Institute uses diffusion models to generate robot actions from sensor inputs and natural language commands.

This offers three key advantages over prior methods:

  • Applicability to multimodal demonstrations - humans can teach naturally without restricting inputs.
  • Handles high-dimensional action spaces - enables longer-term planning to avoid short-sighted actions.
  • Stable and reliable training - scales robot learning without meticulous configuration or control point tuning.

Diffusion excels at high-dimensional output forecasting critical for complex, multi-limb robots. It also predicts full action trajectories rather than individual time steps.

Remarkably, diffusion policies require very little training optimization. This avoids the practical bottleneck of real-world robot evaluations to find optimal control parameters.

Most AI systems can be accurately evaluated with offline metrics. But real-world robot learning requires closed-loop physical testing, which is expensive and time-consuming.

Diffusion's out-of-the-box reliability bypasses this constraint. It greatly expands the horizons of robot skills that can be feasibly learned.

This research exemplifies diffusion's strengths for robotics - leveraging multimodal human guidance for stable large-scale skill acquisition, while minimizing real-world evaluation needs.

As robot learning moves beyond labs, overcoming training burdens will be key. TRI's work points to generative diffusion models making real-world robot instruction safer, easier and more scalable.

While generative models can rapidly teach robots new skills, Toyota Research Institute notes they remain fragile in novel conditions. Failures tend to occur due to:

  • Lack of recovery demonstrations in training data
  • Changes in camera angles or backgrounds
  • Unseen test-time manipulations
  • Cluttered environments absent in training

To address this, TRI leverages Drake, a modeling platform for robust robot design. Drake's physics simulator helps improve skill generalization between simulated and real conditions.

TRI robots have already acquired 60 dexterous skills. They aim to reach hundreds by end of 2023, and 1000 by end of 2024 through further data expansion.

As large language models compose concepts in new ways from limited examples, TRI envisions similarly capable "large behavior models". These would combine semantic generalization with physical intelligence and creativity.

This is key for versatile robots that actively interact with varied environments and improvise skills as needed.

While diffusion models have accelerated early skill acquisition, brittleness to novel scenarios remains a key challenge. TRI's roadmap for combating this issue is further dataset growth and physics-based simulation.

If robots can learn generalizable behavioral repertoires at scale like humans, it would be transformative. TRI's incremental progress points to pathways for developing this level of adaptive physical intelligence.

Share with friends:

Write and read comments can only authorized users