André Woodley Jr.

founder. engineer. creative.

Skipping Speech to Text

Nearly all voice applications translate speech to text. However, while voice applications need to do multiple things analyzing text we still return to the audio itself for deeper processing. 

For example we can analyze text to get intent analysis but we need to analyze the wave patterns of the audio file itself to get "real" sentiment analysis. But I wonder if that is the wrong approach.

I am learning Spanish and right now I translate Spanish words I hear into English for further processing. This is a multi-step process that is time consuming and delays my response. My friend Yoni at Mister C's Barbershop dealt with this issue when he had to pick up English coming from Spanish. He said he didn't really learn English until he challenged himself to stop translating in his head and instead associated the words with the objects itself. At that point he could skip translating and respond extremely fast because he essentially learned the language - we do this with our native tongue.

What if we do the same approach using audio in voice ai applications. We could remove translating speech to text - understanding what is being said by analyzing the audio itself and comparing it to pass records using a vector database and pass records of audio, translations, intent, speech to text, etc. If we don't have a similar record translate speech to text (aka inquire).

If this were to work data is once again at the top of the list and will significantly reduce cost and time required for voice ai applications to understand and respond. 

What is also interesting is it follows the human way of learning. We need to be exposed to things, directed and corrected for a certain amount of time to understand languages.

My final quick thoughts:

We are in the early days of voice ai protect your data. Companies with the data will significantly outpace those with little. Speech to Text should be an beginning step. Keep the raw audio. Cut the raw audio into clips and pair them with the translation, sentiment, intent, etc. Storing the raw audio into a vector database for it be queried based on similarity.