The Sheffield University System for the MIREX 2020:Lyrics Transcription Task.
Published in MIREX, 2020
Recommended citation: Roa Dabike, Gerardo and Barker, Jon. (2020). "The Sheffield University System for the MIREX 2020:Lyrics Transcription Task." MIREX 2020.
This extended abstract describes the system we submitted to the MIREX 2020 Lyrics Transcription task. The system consists of two modules: a source separation front-end and an ASR back-end. The first module separates the vocal from a polyphonic song by utilising a convolutional timedomain audio separation network (ConvTasNet). The second module transcribes the lyrics from the separated vocal by using a factored-layer time-delay neural network (fTDNN) acoustic model and a 4-gram language model. Both the separation and the ASR modules are trained on a large open-source singing corpora, namely, Smule DAMPVSEP and Smule DAMP-MVP. Using a separation module audio pre-processing reduced the transcription error by roughly 11% absolute WER for polyphonic songs compared with transcriptions without vocal separation. However, the best WER achieved was 52.06%, very high compared to WERs as low as 19.60% that we achieved previously for unaccompanied song.