Self initiated university experiment
Research, Concept, Rapid Prototyping

Make dialect recognizable for speech recognition.


The way we pronounce words has an effect on how "natural" we perceive communication. Would it be possible to develop a voice assistant that also can deal with very differently pronounced words? I wanted to experiment based on this question.

Laptop showing a sound visualization in browser. Different looking sound visualizations floating around the laptop.

Unnatural way of speaking with voice assistants.

Problem observed

"Alexa, turn on the radio! Alexa, lower the light!" Even if voice recognition systems are already well developed and part of our daily lives it often feels unnatural talking to them. This might be caused by the adaption of our voice to make these systems understand what we want/need.

Man speaking to voice assistant.

Let's visualize pronunciation to generate training data.


How to bring a word with the same meaning but different pronunciation into a format that a system can use for analyzing without "typing" the word as it is pronounced?
Well, the idea was simple: Draw an image of it!

P5.FFT: What is the amplitude at different frequency levels?
Comparing similar pronounced letters in first tests.

Train IBM Visual Recognition.


Being able to visualize sound and single words now about 200 generated images were used to train a Visual Recognition model. Therefore IBM Visual Recognition was used.

To keep it simple only two pronunciation types (standard german & franconian dialect) were chosen to train and test. The chosen word was "Potato" (ger. Kartoffel 🥲).

Training IBM model with "Potato" pronounced in standard & franconian german. Negative examples shall help to optimize training.
Person speaks to interface, draws image and sends it to IBM Server. A few seconds later the result is visible.

Dialect easily recognizable for a trained model when circumstances are similar.


The experiment ended with a model that was able to recognize whether the word was pronounced in franconian or standard german. In tital the tested results were all correct, even if the ratio has larger fluctuations time by time. There are many uncertainties when it comes to different surroundings or more background noise.

Results are correct when new images by other persons are sent.

More data & effort needed. But it was fun!

Learnings & Outlook

1. Data is key! Huge sets of data which also cover factors like voice tonality and background noise would be recommendable.

2. Just one word.. uff. Using this method for a whole language might not be very efficient.

3. Cool to use open access IBM services for experiments! Interesting not knowing what the model uses for decision making.

4. Sad that people are forced to speak standard german when interacting with the digital world since dialects are already die out. But keeping some distance between system and human might also be healthy. Still unsure if I would go for a dialect supporting system or not.

Thank you for watching.

Special Thanks

Thanks to Tim helping me with setting up the code! Always nice to experiment with you!