We present the Factorial Deep Markov Model (FDMM) for representation learning of speech. The FDMM learns disentangled, interpretable and lower dimensional latent representations from speech without supervision. We use a static and dynamic latent variable to exploit the fact that information in a speech signal evolves at different time scales. Latent representations learned by the FDMM outperform a baseline ivector system on speaker verification and dialect identification while also reducing the error rate of a phone recognition system in a domain mismatch scenario.
A Fatorial Deep Markov Model For Unsupervised Disentangled Representation Learning From Speech
Sameer Khurana, Shafiq Joty, Ahmed Ali, and James Glass. In International Conference on Acoustics, Speech, and Signal Processing (ICASSP'19) , pages 6540 - 6544, 2019.
PDF Abstract BibTex Slides