1X, a robotics firm backed by OpenAI, is advancing its goal to provide physical labor through intelligent androids. In a recent update, their humanoid robot Eve demonstrated its capability to autonomously complete consecutive tasks, marking the beginning of 1X’s journey towards developing an advanced AI system. This system enables chaining simple tasks into complex actions through voice commands, facilitating seamless multi-robot control and remote operation.
The androids created by 1X employ Embodied Learning, integrating AI software directly into their physical forms for enhanced capabilities. While previous demonstrations focused on the robots’ ability to manipulate simple objects, the team emphasizes the importance of mastering task chaining for them to become effective service robots.
Researchers at 1X faced challenges in integrating multiple tasks into a single neural network model. Small multi-task models (<100M parameters) often suffered from a forgetting problem, where fixing one task’s performance negatively impacted others. Increasing model parameters could mitigate this issue but extended training time, delaying progress.
To address this, 1X developed a voice-controlled natural language interface to chain short-horizon capabilities across multiple small models into longer ones. Human direction in skill chaining enables the accomplishment of long-horizon behaviors efficiently.
“To accomplish this, we’ve built a voice-controlled natural language interface to chain short-horizon capabilities across multiple small models into longer ones. With humans directing the skill chaining, this allows us to accomplish the long-horizon behaviors,” said Eric Jang, vice president of AI at 1X Technologies, in a blog post.
Chaining autonomous robot skills poses difficulties, as each subsequent skill must adapt to variations caused by preceding actions. This complexity compounds with each successive skill, requiring solutions to address sequential variations.
The user experience is streamlined through a high-level language interface, allowing operators to direct multiple robots with ease. This novel approach simplifies data collection and evaluation, as operators can compare predictions of new models with existing baselines during testing.
“From the user perspective, the robot is capable of doing many natural language tasks and the actual number of models controlling the robot is abstracted away. This allows us to merge the single-task models into goal-conditioned models over time,” said Jang.
Once the goal-conditioned model aligns well with single-task models’ predictions, researchers can transition to a unified, more powerful model seamlessly. This approach enhances efficiency and simplifies the user workflow.
Using this high-level language interface to direct robots provides a novel user experience for data collection. “Instead of using VR to control a single robot, an operator can direct multiple robots with high-level language and let the low-level policies execute low-level actions to realize those high-level goals,” said Jang.