I’ve been looking for Speech Recognition capability on Cloud that can work for Plain Old Telephone Services (PSTN)…but looks like there only three options that I’m aware of….
Cisco Tropo ASR
Tropo Automatic Speech Recognition (ASR) has been there in the market for over 2 years. I have a sample code that you could try on github (https://github.com/julianfrank/tropotry). The limitation that I felt with this solution was that it is a word spotting solution. So you provide a set of words or sentences that you expect to hear and Tropo ASR matches them to the set. While this is cool a decade back, we live in an era where Google Chrome Browser is capable of converting spoken sentences from voice to text immediately on budget phones! I found the Tropo ASR to have a delay of average 3-7 seconds between the user stopping their utterance and the ASR providing its detection result. So I give this a skip.
I have seen some trials that used the ‘Recording’ available in Twilio’s Gather verb, send it to Cloud based Speech to Text Providers like IBM Watson and then use the result. While this solution works, the UX suffers as the delay is upwards of 10 seconds! So this too is not a practical solution
Twilio Speech to Text
Twilio has now announced its own Speech to Text Capability as part of its Gather verb and it works perfectly from my tests. This service is in beta right now and costs about $0.02 for 15 second batches of speech. Of course the rate reduces with volume.
In this blog I’ll show how to quickly try out the Twilio Speech to Text in 10-15 mins.
I’m going to use the glitch WebIDE to build the code. You can remix your own version quickly from the url provided to get started even more quickly.
So I’m going to make a very simple app that will receive the call, Look at the Result and Tell the Caller what it heard and then disconnect…Simple. As per the Gather Documentation there are three important receivers that we need to build in our app.
Call Hook -> This is the url that you would configure on Twilio Portal as part of the number procurement. When Twilio gets a call on this number it will immediately send the available call details to the ‘call-hook’ url. For this blog I’m going to use the ‘https://twiliostt.glitch.me/call’. You can replace the url as per your solution later.
This URL Needs to return a Twiml XML Message with the next steps. I plan to just tell Twilio to Gather the Speech and send it back to the ‘action url’ .
action -> This is the url that will be contacted by Twilio when the user has stopped speaking and Twilio has done the speech to text conversion. As per my test this takes less than a second. Twilio will expect a return Twiml on what to do next. I Intend to just parse the Incoming message for SpeechResult and Confidence and speak them back…
partialResultCallback -> This is the most exciting part and is similar to the google StreamingRecognize Solution. Here Twilio sends the text (UnstableSpeechResult) as the user speaks instead of waiting for the user to stop. This feature could be useful if your solution is actually doing some real-time NLP.
Show me the Code
I have used a simple swagger based NodeJS app to implement this…You can find the code at https://github.com/julianfrank/twiliostt.
Hope this helps ….