Okay, here's some brutal feedback, please take it with your best interests at heart. I am an English native speaker who has lived in Spain for 10 years, and has become fluent.
1) These speech-to-text models are poor when it comes to non-natives. This is unfortunate as the idea you had and the product you've designed could be incredible for language learning. However - it's a bit crap - sorry - I can speak Spanish well and was asked in the conversation if I wanted a medium sized cup of coffee, I replied "sí, mediano", the resulting text was outputted as "mariano", then in the role play the coffee shop worker then assumed my name was Mariano! Completely ludicrous and frustrating.... in real life the coffee shop worker is clearing expecting the word 'mediano' and will hear what I said and know that's what I was trying to say. The speech-to-text-model completely fails to get this.
Until speech-to-text models trained on non-natives are made readily available, products like this with so much promise will infuriate learners, which will stop them paying for it.
And this was ordering a coffee.... imagine an actually complicated conversation.
So my advice would be, right now the speech-to-text models aren't capable of doing what you're hoping they can do... but.... once you get a model that can, this will be insanely popular....
So hang in there, other than that it was a fun experience, and critically, people are scared of practising with real people, something like this would be insanely popular if it actually worked well. Good luck.
Eh I disagree. It's not perfect, but that's just about expectation management. Users don't expect voice-to-text to be perfect - in fact, the past 10 years shitty experiences have been the norm. I think it depends on the level of mistakes the transcribing makes, but that's only going to improve with time with advances in AI and as the product evolves (using the context of the conversation like you say would be a great start).
Even in its current state this is an awesome product. There are so so many people in the world learning a language, and one of the hardest parts is practicing after you stop learning (like you leaving Spain). People like that will really love something like this
In future especially to be able to cater to people who are still learning, it should be feasible to use a similar product to train and correct people's pronounciation.
Speech recognition is far from perfect but even then it's incredibly useful. It CAN be infuriating (or downright hilarious). Hell, it's infuriating when occasionally a (human!) waiter switches to English after I mangled a sound in Hungarian, even though I'm C1/advanced in the language.
Problem or opportunity? Like someone else pointed out, the limitations of speech-to-text can be turned into an opportunity to improve one's pronunciation. It's getting extremely good for native speakers. As foreign speakers it's a chance to improve.
In any case, I'm sure I can add a layer or two to the code to reduce misunderstandings. This is actually exciting!
p.s. as mentioned in the OP, feedback on pronunciation is planned (actually in the works).
It completely misunderstood me as well in spanish, it actually inputted english instead of spanish. I like the idea but this is not working at all for me, at least for now.
For the creator: Do not get discouraged, I hope you do get this working properly and see a lot of traction.
Exactly the level of quality we expect when the product idea starts out with "AI" and "scalable". And completely forgets (or doesn't bother think about) what beginning students actually need.
I also had difficulty getting it to understand me. Theres a couple solutions I can think of that may make this more usable:
1) Speech to text into an input field, allow the user to modify
2) I presume this is uses an LLM to generate the responses, submit the new text and give it the entire convo as context but initially ask it to "correct" the text to what would make sense in context based on similar sounding words.
Edit: Hah oh it's not too great right now at all. Tried it again and it ended up writing Cyrillic as my response despite me speaking Spanish.
Slightly off topic, but I could imagine that what you are alluding to regarding the expectation of certain words or phrases depending on the context of the conversation could be used to improve speech-to-text models. The speech could be parsed into multiple options which can ranked by a language model with the conversation context.
It does and that's indeed Whisper I'm currently using. I do have mixed feelings about it:
- On the one hand, it performs well in so many cases… and having multilingual support built-in is great!
- On the other hand: there's actually NO OPTION to Whisper to recognize just two languages (you either recognize ONE language or ANY language with it, which can cause issues depending on one's pronunciation and the language at hand.)
Will definitely turn OFF multilingual speech recognition by default, because the huge majority of negative reactions in this thread stem from this.
Yes, with German something similar happened and it misinterpreted "bitte" (please) as Peter and called me Peter from then on. I know my German is far from fluent, but I'm pretty sure what I said sounded more like bitte than Peter!
1) These speech-to-text models are poor when it comes to non-natives. This is unfortunate as the idea you had and the product you've designed could be incredible for language learning. However - it's a bit crap - sorry - I can speak Spanish well and was asked in the conversation if I wanted a medium sized cup of coffee, I replied "sí, mediano", the resulting text was outputted as "mariano", then in the role play the coffee shop worker then assumed my name was Mariano! Completely ludicrous and frustrating.... in real life the coffee shop worker is clearing expecting the word 'mediano' and will hear what I said and know that's what I was trying to say. The speech-to-text-model completely fails to get this.
Until speech-to-text models trained on non-natives are made readily available, products like this with so much promise will infuriate learners, which will stop them paying for it.
And this was ordering a coffee.... imagine an actually complicated conversation.
So my advice would be, right now the speech-to-text models aren't capable of doing what you're hoping they can do... but.... once you get a model that can, this will be insanely popular....
So hang in there, other than that it was a fun experience, and critically, people are scared of practising with real people, something like this would be insanely popular if it actually worked well. Good luck.