Low-Code Building Blocks for Speech AI Robotics

When examining an intricate speech AI robotic system, it’s easy for developers to feel intimidated by its complexity. Arthur C. Clarke claimed, “Any…

When examining an intricate speech AI robotic system, it’s easy for developers to feel intimidated by its complexity. Arthur C. Clarke claimed, “Any sufficiently advanced technology is indistinguishable from magic.”

From accepting natural-language commands to safely interacting in real-time with its environment and the humans around it, today’s speech AI robotics systems can perform tasks to a level previously unachievable by machines.

Join experts from Google, Meta, NVIDIA, and more at the first annual NVIDIA Speech AI Summit. Register now.

Take Spot, a speech AI-enabled robot that can fetch drinks on its own, for example. To easily add speech AI skills, such as automatic speech recognition (ASR) or text-to-speech (TTS), many developers leverage simpler low-code building blocks when building complex robot systems.

Photograph of Spot, an intelligent robotic system, after it has successfully completed a drink order. — *Figure 1. Spot, a robotic dog, fetches a drink in real time after processing an order using ASR and TTS skills provided by NVIDIA Riva.*

For developers creating robotic applications with speech AI skills, this post breaks down the low-code building blocks provided by the NVIDIA Riva SDK.

By following along with the provided code examples, you learn how speech AI technology makes it possible for intelligent robots to take food orders, relay those orders to a restaurant employee, and finally navigate back home when prompted.

Design an AI robotic system using building blocks

Complex systems consist of several building blocks. Each building block is much simpler to understand on its own.

When you understand the function of each component, the end product becomes less daunting. If you’re using low-code building blocks, you can now focus on domain-specific customizations requiring more effort.

Our latest project uses “Spot,” a four-legged robot, and an NVIDIA Jetson Orin, which is connected to Spot through an Ethernet cable. This project is a prime example of using AI building blocks to form a complex speech AI robot system.

Architectural diagram of an AI robotics system with Riva low-code speech AI blocks shown as core components for platform, navigation, and interaction. — *Figure 2.* *A speech AI robot system with Riva low-code speech AI blocks to add ASR and TTS skills*

Our goal was to build a robot that could fetch us snacks on its own from a local restaurant, with as little intervention from us as possible. We also set out to write as little code as possible by using what we could from open-source libraries and tools. Almost all the software used in this project was freely available.

To achieve this goal, an AI system must be able to interact with humans vocally, perceive its environment (in our case, with an embedded camera), and navigate through the surroundings safely. Figure 2 shows how interaction, platform, and navigation represent our Spot robot’s three fundamental operation components, and how those components are further subdivided into low-code building blocks.

This post focuses solely on the human interaction blocks from the Riva SDK.

Add speech recognition and speech synthesis skills using Riva

We have so many interactions with people every day that it is easy to overlook how complex those interactions actually are. Speaking comes naturally to humans but is not nearly so simple for an intelligent machine to understand and talk.

Riva is a fully customizable, GPU-accelerated speech AI SDK that handles ASR and TTS skills, and is deployable on-premises, in all clouds, at the edge, and on embedded devices. It facilitates human-machine speech interactions.

Riva runs entirely locally on the Spot robot. Therefore, processing is secure and does not require internet access. It is also completely configurable with a simple parameter file, so no extra coding is needed.

Riva code examples for each speech AI task

Riva provides ready-to-use Python scripts and command-line tools for real-time transformation of audio data captured by a microphone into text (ASR, speech recognition, or speech-to-text) and for converting text into an audio output (TTS, or speech synthesis).

Adapting these scripts for compatibility with Open Robotics (ROS) requires only minor changes. This helps simplify the robotic system development process.

ASR customizations

The Riva OOTB Python client ASR script is named transcribe_mic.py. By default, it prints ASR output to the terminal. By modifying it, the ASR output is routed to a ROS topic and can be read by anything in the ROS network. The critical additions to the script’s main() function are shown in the following code example:

   inter_pub = rospy.Publisher('intermediate', String, queue_size=10)
   final_pub = rospy.Publisher('final', String, queue_size=10)
   rospy.init_node('riva_asr', anonymous=True)

The following code example includes more critical additions to main:

       for response in responses:
           if not response.results:
               continue
           partial_transcript = ""
           for result in response.results:
               if not result.alternatives:
                   continue
               transcript = result.alternatives[0].transcript
               if result.is_final:
                   for i, alternative in enumerate(result.alternatives):
                       final_pub.publish(alternative.transcript)
              else:
                  partial_transcript += transcript
           if partial_transcript:
               inter_pub.publish(partial_transcript)

TTS customizations

Riva also provides the talk.py script for TTS. By default, you enter text in a terminal or Python interpreter, from which Riva generates audio output. For Spot to speak, the input text talk.py script is modified so that the text comes from a ROS callback rather than a human’s keystrokes. The key changes to the OOTB script include this function for extracting the text:

def callback(msg):
   global TTS
   TTS = msg.data

They also include these additions to the main() function:

   rospy.init_node('riva_tts', anonymous=True)
   rospy.Subscriber("speak", String, callback)

These altered conditional statements in the main() function are also key:

       while not rospy.is_shutdown():
           if TTS != None:
               text = TTS

Voice interaction script

Simple scripts like voice_control.py consist primarily of the callback and talker functions. They tell Spot what words to listen for and how to respond.

def callback(msg):
   global pub, order
   rospy.loginfo(msg.data)
   if "hey spot" in msg.data.lower() and "fetch me" in msg.data.lower():
       order_start = msg.data.index("fetch me")
       order = msg.data[order_start + 9:]
       pub.publish("Fetching " + order)

def talker():
   global pub
   rospy.init_node("spot_voice_control", anonymous=True)
   pub = rospy.Publisher("speak", String, queue_size=10)
   rospy.Subscriber("final", String, callback)
   rospy.spin()

In other words, if the text contains “Hey Spot, … fetch me…” Spot saves the rest of the sentence as an order. After the ASR transcript indicates that the sentence is finished, Spot activates the TTS client and recites the word “Fetching” plus the contents of the order. Other scripts then engage a ROS action server instructing Spot to navigate to the restaurant, while taking care to avoid cars and other obstacles.

When Spot reaches the restaurant, it waits for a person to take its order by saying “Hello Spot.” If the ASR analysis script detects this sequence, Spot recites the order and ends it with “please.” The restaurant employee places the ordered food and any change in the appropriate container on Spot’s back. Spot returns home after Riva ASR recognizes that the restaurant staffer has said, “Go home, Spot.”

The technology behind a speech AI SDK like Riva for building and deploying fully customizable real-time speech AI applications deployable on-premises, in all clouds, at the edge, and embedded, brings AI robotics into the real world.

When a robot seamlessly interacts with people, it opens up a world of new areas where robots can help without needing a technical person on a computer to do the translation.

Deploy your own speech AI robot with a low-code solution

Teams such as NVIDIA, Open Robotics, and the robotics community, in general, have done a fantastic job in solving speech AI and robotics problems and making that technology available and accessible for everyday robotics users.

Anyone eager to get into the industry or to improve the technology they already have can look to these groups for inspiration and examples of cutting-edge technology. These technologies are usable through free SDKs (Riva, ROS, NVIDIA DeepStream, NVIDIA CUDA) and capable hardware (robots, NVIDIA Jetson Orin, sensors).

I am thrilled to see this level of community support from technology leaders and invite you to build your own speech AI robot. Robots are awesome!

For more information, see the following related resources:

Explore how to get started with integrating and deploying Riva ASR and TTS models in production with high-performance inference and minimal effort with the free ebook, Building Speech AI Applications.
Explore the NVIDIA Riva Quick Start Guide for Embedded Platforms for scripts and utilities for getting started with Riva Speech Skills on Embedded platforms.
Get started with downloading Riva, try guided Riva labs on ready-to-run infrastructure in LaunchPad, or get support for large-scale Riva deployments with NVIDIA AI Enterprise software.