Abstract

Automatic sung speech recognition is a challenging problem that remains largely unsolved. Challenges are due to both the intrinsic poor intelligibility of sung speech and the difficulty of separating the vocals from the musical accompaniment. In recent years, deep neural network techniques have revolutionised spoken speech recognition systems through advances in both acoustic modelling and audio source separation.

This thesis evaluates whether these new techniques can be adapted to work for sung speech recognition. For this, it first presents an analysis of the differences between spoken and sung speech. Then motivated by this analysis, the thesis makes four major contributions.

First, the thesis addresses the lack of large, standardised sung speech datasets suitable for evaluating sung speech recognition. The opportunity for building a suitable dataset has recently arisen with the release of Smule’s DAMP-MVP dataset, a large unaccompanied karaoke performance dataset. However, constructing a well-balanced and easy-to-use evaluation dataset from this weakly-labelled and weakly-annotated data presents many challenges. This thesis presents solutions to these challenges.

Second, the thesis reconsiders the problem of sung speech acoustic modelling. New musically-motivated features are considered to capture the importance of the vocal source information. Features considered include pitch, voicing degree, voice quality, and beat-based features. It is shown that pitch and voicing degree features are useful for improving recognition performances.

Third, accompanied sung speech recognition poses a challenging source separation problem. This thesis investigates the use of modern time-domain source separation networks. Also, it investigates whether ‘speaker embedding’ ideas can be employed for music source separation by considering the use of `instrument’ embeddings.

Finally, a complete system that combines the deep neural network based source separation and speech recognition components are jointly evaluated, dealing with the mismatch between the distorted sung speech originated from the separation network and the `clean’ sung speech used for acoustic modelling.