Voice Ordering Innovating Quick Serve Industry – a Technical Overview

Posted By:

Businesses embracing the digital transformation of Industry 4.0 are looking to AI to solve big problems and increase efficiency. Of the various frontiers in AI, voice is a major contender. The tsunami of voice-based AI technologies disrupting some of the largest markets in the world is being led by a veritable race-to-the-top between digital assistants such as Amazon’s Alexa and Microsoft’s Cortana.

One such industry – Quick Serve Restaurants (generally referred to as “fast food” or “fast casual dining”) – sought to improve customer experience while increasing labor force efficiency with AI. The industry explored how AI could be used to improve the drive-thru experience and provide new possibilities and analytical insights. It permeates our households and businesses, but how does it work?

Voice-based AI technologies need a hardware-to-software connection within a single comprehensive solution that produces near-real-time outputs into a structured format (i.e. a Point-of-Sale (POS) system or database dashboard). Specialized hardware designed to handle speech recognition, noise or echo cancellation, and related tasks, is requisite with much of this tech being highly proprietary. The process of leveraging speech can be thought of in three distinct parts: transforming the sound into words, discerning the meaning of the words, and producing an action or response. Finally, once a system has captured, analyzed, and “understood” speech, connecting to the ideal output needs to happen as close to real-time, as possible.

To better understand how sound can be turned into action, think of it as data instead. Sound gets recorded in analog format and digitized so it may be utilized by machine learning models. When audio is recorded, the sample rate is a measure of how often a data point is recorded – generally in kilohertz (kHz), or thousands of cycles per second. Each piece of data is incorporated into a statistical model to find what the overall sound is represented.

The sound then needs to be separated into words, which are labeled in the dataset. Neural networks, a foundational technology in deep learning and AI, is a popular way of accomplishing this, especially if not all the data is labeled with what the recording is supposed to represent. This only completes the first part of the process but is a significant hurdle by itself.

Once the recordings have been transcribed into text, the next step is natural language processing (NLP). Depending on the application of the solution, various techniques for extracting meaning can be employed using machine learning. From this stage, actions or responses can be automated directly from the solution. This may be enough to show tremendous value for the business, however much more is possible once these pieces are in place (per Girish Khanzode):

Given the amount of research and development being poured into improving the performance and power of various voice-related technologies, it’s no surprise that innovation occurs daily. Opportunities to have a more personalized experience for customers while armed with rudimentary knowledge of the individual becomes possible. For example, with sentiment analysis a solution could turn a poor interaction into a positive one by intelligently recommending a verbal response or course of action to an employee in real-time. It is also possible to use vocal patterns to gain insight into otherwise anonymous transactions!

By using AI to transcribe customer orders and requests, it is easy to imagine the advantages to the customer and the business. Innovation is particularly important in competitive markets to maintain market share and provide customers with the best possible experience. The extent to which voice is transforming industries now, and in the future, quickly becomes an exercise of the imagination.