I have been looking for TensorFlow models pre-trained on speech
data, preferably in js/python. That I can use to extract embeddings
for streaming/recorded audio up to 1 min long.
I intend to use the embeddings as an input to my machine
So far, I have found only this:
This is trained to classify 20 voice commands. So, I feel the
embeddings from this model may not have sufficient discriminative
power to identify, let’s say – phonemes, 1000 words each from
English, French and a few other popular languages.
I am not worried about embedding->word mapping. At the
current stage, I am happy to use the embeddings to evaluate
similarity score of two different sound samples. E.g. I am not
worried about resolving confusion between – ‘red’ and ‘read(past
tense)’. In fact – ‘I read a red book’ ‘Eye red a read buk’ should
result to 95+% match.
Any hints/redirection are also greatly appreciated. Perhaps
there are simpler ways to achieve the same.