Speech AI Spotlight: Visualizing Spoken Language and Sounds on AR Glasses

Image of glasses with computer screen reflected.

Audio can include a wide range of sounds, from human speech to non-speech sounds like barking dogs and sirens. When designing accessible applications for people…

Audio can include a wide range of sounds, from human speech to non-speech sounds like barking dogs and sirens. When designing accessible applications for people with hearing difficulties, the application should be able to recognize sounds and understand speech.

Such technology would help deaf or hard-of-hearing individuals with visualizing speech, like human conversations and non-speech sounds. Combining speech and sound AI together, you can overlay the visualizations onto AR glasses, making it possible for users to see and interpret sounds that they wouldn’t be able to hear otherwise. 

According to the World Health Organization, about 1.5B people (nearly 20% of the global population) live with hearing loss. This number could rise to 2.5B by 2050.

Cochl, an NVIDIA partner based in San Jose, is a deep-tech startup that uses sound AI technology to understand any type of audio. They are also a member of the NVIDIA Inception Program, which helps startups build their solutions faster by providing access to cutting-edge technology and NVIDIA experts.

The platform can recognize 37 environmental sounds, and the company went one step further by adding cutting-edge speech-to-text technology. This gives a truly complete understanding of the world of sound.

AR glasses to visualize any sound

AR glasses have the potential to greatly improve the lives of people with hearing loss as an accessible tool to visualize sounds. This technology can help enhance their communication abilities and make it easier for them to navigate and participate in the world around them.

Video 1. Cochl.Sense and NVIDIA Riva working on Microsoft HoloLens 2!

In this scenario, automatic speech recognition (ASR) is used to enable the glasses to recognize and understand human speech. This technology can be integrated into the glasses in several ways:

  • Using a microphone to capture the speech of a person talking to a deaf or hard-of-hearing individual and then using ASR algorithms to interpret and transcribe the speech into text. This text can then be displayed on the glasses, enabling the deaf or hard-of-hearing person to read and understand the speech.
  • ASR can also be used to enable the glasses to respond to voice commands so that users can control the glasses with their voice.
  • They are also able to display all conversations on the screen, such as transcribing voice directions from maps while you drive and any other sounds like horns or sirens from emergency vehicles and wind noise.

The technology behind the solution

Cochl used NVIDIA Riva to power its ASR capabilities within its software stack. Riva is a GPU-accelerated, fully customizable SDK for developing speech AI applications. By using Riva, the platform has been able to expand its capabilities to understand a wide range of sounds, including non-speech sounds.

“We’ve tested lots of speech recognition services, but only Riva provided exceptionally high and stable real-time performance. So now we can make our sound AI system be closer to human auditory perception,” said Yoonchang Han, co-founder and CEO at Cochl.

“As we have observed, AR glasses are most likely to be used in open spaces with noisy environments. NVIDIA Riva has helped us transcribe speech accurately even in noisy environments and has given us a seamless experience to integrate into our Cochl.Sense platform.”

Future of assistive technology

Creating a generalized AI system that perceives sounds like humans is a huge challenge. To make AR glasses more accessible, lighter wearable technology is required.

However, at this point, they are still an ideal medium for translating sounds and speech to visual information. By integrating machine listening functionality, AR glasses can bring safer, more convenient, and more enjoyable daily life to deaf or hard-of-hearing people all around the world.

Cochl is also exploring more use cases for speech AI, such as offering closed captioning for any videos on AR glasses and visualizing multi-speaker transcriptions. To provide the best experience for individuals with hearing difficulties, they are exploring ways to analyze and visualize music to help them understand the genre and emotion of the music at a minimum.

They are excited to experiment with more NVIDIA solutions including Riva, NVIDIA NeMo, and NVIDIA TensorRT.

Get started with speech AI today

Interested in adding speech AI to your VR applications? Browse these resources to get started:

  • To learn about speech AI, from end-to-end pipeline basics to developing your first speech AI application, see the free ebooks.
  • To gain knowledge on how to customize a speech recognition pipeline for your application, see the self-paced Get Started with Highly Accurate Custom ASR for Speech AI course.
  • To gain insight on how to incorporate speech AI into your XR, VR, and AR applications, see Developing the Next Generation for Extended Reality Applications with Speech AI.

Source:: NVIDIA