Deciphering Speech: a Zero-Resource Approach to Cross-Lingual Transfer in ASR

We present a method for cross-lingual training an ASR system using absolutely no transcribed training data from the target language, and with no phonetic knowledge of the language in question. Our approach uses a novel application of a decipherment algorithm, which operates given only unpaired speech and text data from the target language. We apply this decipherment to phone sequences generated by a universal phone recogniser trained on out-of-language speech corpora, which we follow with flat-start semi-supervised training to obtain an acoustic model for the new language. To the best of our knowledge, this is the first practical approach to zero-resource cross-lingual ASR which does not rely on any hand-crafted phonetic information. We carry out experiments on read speech from the GlobalPhone corpus, and show that it is possible to learn a decipherment model on just 20 minutes of data from the target language. When used to generate pseudo-labels for semi-supervised training, we obtain WERs that range from 32.5% to just 1.9% absolute worse than the equivalent fully supervised models trained on the same data.


Introduction
In recent years there has been considerable research devoted to reducing the amount of human effort required to build an automatic speech recognition (ASR) system for a new language.Conventional ASR training requires large quantities of manually-transcribed training data, as well as a hand-crafted pronunciation dictionary.Recent grapheme-based hybrid-HMM approaches [1] have shown success at removing the need for explicit pronunciation knowledge, whilst more recent endto-end systems [2] have removed the need for a lexicon entirely by modelling output tokens at the character or word-piece level.However, transcribed training data is typically still required, with end-to-end systems being particularly data hungry.
The process of manual transcription can be extremely timeconsuming and expensive.Consequently a body of research has focused on reducing the need for such data, for example through the use of approximately-matching "in-the-wild" speech and text data, known as lightly-supervised training [3], and through the use of unlabelled data transcribed with a seed model, known as semi-supervised training [4].However, in both cases, manual expertise is required to train the initial model.
In his 2012 position paper, Glass [5] described the road towards unsupervised speech processing through a set of scenar-ios that, he noted, "might seem increasingly outlandish and impractical".He suggested a move from "expert-based" systems, with a dictionary and phoneme set provided, through "databased" systems with parallel speech and text data, to what he called "decipher-based" systems, through which ASR training could be achieved using entirely untranscribed speech, together with unpaired text data.This scenario has the significant advantage that for any languages with a significant web presence at least, both resources are likely to be relatively abundant without any human effort.Since Glass's paper, significant effort has been devoted to this so-called "zero-resource" scenario.Approaches to this problem tend to fall into two categories: those attempting to learn phoneme-or word-like patterns from speech in a bottom up manner, often motivated by child speech learning [6,7]; and those using cross-lingual information to inform the target model.The latter category extends a long strand of research into cross-lingual ASR methods -which seek to improve supervised training on a target language through the use of outof-language language data -to the case where no transcribed data exists for the target language.There have been a variety of recent approaches to this problem, all of which in one way or another address the problem of matching the modelling units of the out-of-language model to meaningful units in the target language.The earliest approaches used IPA-based phone mapping schemes [8] whilst more recent related methods have used automatic multilingual pronunciation mining from the web [9] or cross-lingual transfer from languages with similar orthographies [10].[11] uses knowledge of compositional phonetics in the target language to remove the need for resources in the target language, building on earlier supervised approaches such as [12], whilst [13] used a semi-supervised approach, extending an initial lexicon using unpaired phonemic transcripts and text data.
Separately, there has been significant work towards building language-universal systems, generally with shared phonetic knowledge [14].These approaches can be problematic due to differing phonotactics between languages [15], though language-specific embeddings may be used [16].Again, these methods require knowledge of pronunciations in a target language in order to produce word output.Purely graphemic multilingual systems have been developed [1] but require the target language to be in the training set; supervised transliterations approaches have been used in the context of end-to-end systems [17].
Purely bottom-up approaches to zero-resource ASR, whilst interesting, have not generally yielded state-of-the-art ASR performance, when compared to cross-lingual methods.However, groundbreaking work in this area [18] uses Facebook's wav2vec2.0architecture [19] to produce phone-like sequences in an entirely bottom-up manner, which are then mapped to phonemized text sequences using an adversarial objective [20].However, this work relies on manually-obtained phone units and a system trained on a large hand-crafted pronunciation dictionary, with the authors noting that it is easier to learn a mapping between the speech audio and phone units.Further, the wav2vec2.0models need very large amounts of training data to be effective.
We believe that no prior research has yet achieved Glass's vision of removing both the need for transcribed audio data and human phonetic knowledge of the target language, thus building a system with no expert input.In this paper, we propose to return to his original term of "decipher-based" systems.Inspired by this, and by similar work for unsupervised transliteration and machine translation [21,22], we here present a method for deciphering speech data using only mismatched text data from the language of interest.Our method starts with a cross-lingual approach, taking a universal phone recogniser trained on a variety of source languages.No self-supervised pre-training is needed, and we find that the technique is highly data efficient, requiring just 20 minutes of speech data from the target language to achieve a successful decipherment.Furthermore, no phonetic knowledge of the target language is used, making the method applicable in principle to almost any language.

Zero-Resource Cross-Lingual Transfer
Our method for zero-resource cross-lingual transfer uses a three-stage approach.First a universal phone recogniser transcribes audio into phones.Then, we decipher this phone sequence into graphemes from the target language -for this, only language models trained on target-language text data are required.Finally, a flat-start semi-supervised training procedure is used to train a new acoustic model using the deciphered pseudo-labels.The complete pipeline is illustrated in Figure 1.We describe the three steps in detail below.

Universal Phone Recognition
The aim of a universal phone recogniser is to phonetically transcribe speech from any language.To achieve good generalisa-tion to unseen languages, it is necessary to train the model on a diverse set of languages, in order to cover as wide a set of phones as possible.One way to train such a system is to simply pool data and phonemic lexicons and train a multilingual model with a shared phoneme set.In this work, for simplicity, we use a shared phoneme set to train a conventional hybrid HMM-DNN system on six well-resourced languages.We are aware that [14] notes that pooling the phoneme sets is sub-optimal as phonemes might have different surface forms in different languages, and that the use of linguistically-derived allophone mappings [14] might be beneficial.

Decipherment
The task of decipherment is to convert a cipher into plain natural language, a classic example being deciphering a letter substitution cipher.The use of this technique to decipher the output of a multilingual phone recogniser is the most significant contribution of this paper.We start with the work of Knight [23], who showed that a noisy-channel framework can be used for decipherment.In this framework the probability of deciphering a cipher X into an English1 text Y is modelled as where we call Plex(X | Y ) the lexical model and Plm(Y ) the language model.The lexical model produces the probability that an English letter y corresponds to a cipher letter x and the language model Plm(Y ) assigns probabilities to sequences of English letters.The language model can be trained on any text corpora and the lexical model Plex(X | Y ) can be trained in an unsupervised fashion with the Baum Welch algorithm [24].
Once the lexical model is trained, the most probable English text corresponding to the cipher can be deciphered with the Viterbi algorithm to obtain: In the past, decipherment was used in various NLP applications such as unsupervised machine transliteration [25], unsupervised machine translation [21,26] or unsupervised Chinese pronunciation learning [27].However, phoneme-tographeme (P2G) conversion is much more difficult than solving a deterministic substitution cipher for two reasons: first, a grapheme can be mapped to many phonemes, for example English grapheme "a" can be pronounced as AH, AA, AE or EH.Second, one grapheme can correspond to multiple sequential phonemes, for example "x" is pronounced as "K S"; similarly, one phoneme can align with multiple sequential graphemes, for example "th" is often pronounced as "DH".Finally, when applying phoneme-to-grapheme conversion at the utterance level, our model needs to be able to perform word segmentation.These challenges are further multiplied when we deal with noisy inputs from the universal phone recogniser.
To be able to deal with insertion and deletions inherent to the P2G, we use the following parameterisation proposed by Nuhn [22]: In this parameterisaton the decipherment model consists of three components: lexical model Plex(X | Y ), alignment model Pali(A), and language model Plm(Y ).The random variable A represents a sequence of substitution, insertion and deletion operations.We can also express decipherment using WFST notation as: where X is an input phone acceptor, L is the lexicon model transducer, A is the alignment model transducer and G is the language model acceptor.
The role of the lexical model is to model the probabilities of mapping phones into graphemes, and also to model the probability of phone deletions Plex(x | ϵ).The lexical model is implemented as a simple one state flower transducer.It is initialised to allow all possible substitutions but during training the unseen substitutions are pruned from the model, which results in faster training and inference due to a smaller composition.However, since we train on small amounts of data in an unsupervised fashion it is possible that some important arcs are pruned from the model, which can be detrimental.Therefore, following [26] we smooth the lexical model at various stages of training with the following equation: where α is a smoothing parameter (we use 0.9) and |X| denotes the size of the input phone-set.Finally, the lexical model always maps silence phones to silence or a word boundary in the output.This results in faster training/inference -because silence prunes the space of possible word segmentation -and more accurate decipherment.
The language model is the component most important to decipherment, providing information to the training process.The language model predicts the probability of a sequence of graphemes in the target language.Therefore, it is possible to use character n-gram models.It is important to keep in mind that using n-gram models with large contexts results in a big composition when composed with an unpruned lexical model; therefore it is not feasible to use them from the beginning.Hence we start with a simple bigram model and move to using up to 5-gram grapheme models as training progresses.Subsequently, we use a word trigram language model together with a grapheme lexicon for the final round of training.Since the composition of the lexical model, alignment model and the word language model L • A • G is slow, and is required after every training epoch, we reimplemented the standard composition X • (L • A • G) with a three-way composition X • (L • A) • G [28] .We also use pruning to speed up training and inference with the word language models.Our decipherment pipeline is implemented in Open-FST [29] and its design is heavily inspired by the BaumWelch library from OpenGrm [30].

Semi-Supervised Training
In conventional semi-supervised training (SST) we use a seed-model to create "pseudo-labels" for untranscribed speech data [4,31].In our previous work [32], we showed that SST can be successful even with seed models with WER over 80% if lattices are used to model uncertainty in the hypotheses.In the previous section we described how decipherment can be used to convert the output of a universal phone recogniser into a sequence of target-language graphemes, to be used as pseudolabels for untranscribed data.The pseudo-labels can either be one-best transcripts or decipherment lattices.Unlike conventional SST, we have no seed model for the target language, since there is no equivalence between the outputs of the phone recogniser and the target language graphemes.We therefore choose to train a model with flat-start lattice-free MMI (LF-MMI) [33], initialising the lower layers of the model with the universal phone recogniser, but using a randomly-initialised output layer.

Experiments
To demonstrate that decipherment can be used for cross-lingual transfer, we trained a universal phone recogniser on English, French, German, Spanish, Russian and Polish and we performed decipherment experiments on Bulgarian (BUL), Czech (CES), Hausa (HAU), Portuguese (POR), Swahili (SWA), Swedish (SWE) and Ukrainian (UKR).

Setup
Our experiments were performed using the GlobalPhone corpus [34].This corpus comes with data from various languages and contains lexicons, which can be mapped to X-SAMPA, enabling the pooling of phonemes across languages.All these properties make GlobalPhone an ideal test-bed for evaluation of zero-resource cross-lingual transfer with decipherment.
To train the universal phone recogniser as a multilingual model with a shared phone-set we pooled 20 hours of English LibriSpeech [35] with the training data from GlobalPhone German, French, Spanish, Russian and Polish [34], 110 hours of data in total.The multilingual model was a small time-delayed neural network [36] with 18 hidden layers each having 798 hidden layer size and 90 bottleneck size.In total the model had 7.2M parameters, used 40 dimensional cepstral mean and variance normalised MFCC features as inputs, and was trained with LF-MMI [37] using the Kaldi toolkit [38].We used a phonebigram language model estimated on the multilingual training data for cross-lingual phone decoding.
We trained all language models on the text data from Com-monCrawl [39].Because the CommonCrawl text data is noisy we preprocessed it as follows.We performed word tokenization and removed tokens consisting only of non-alpha-numeric characters.We mapped words containing characters outside of the target language alphabet or containing letters repeated at least 3 times in a row to <unk> and removed sentences containing any word longer than 20 characters or with three consecutive singleletter words.We trained language models with SRILM [40] on up to 1B tokens and we pruned the language models to only contain the 300k most frequent words.We did not use the pre- trained GlobalPhone language models because we found that some of them had been also trained on the training transcripts which could bias the results of semi-supervised training.But for completeness we also include the oracle results obtained with GlobalPhone LMs in Table 1.
The decipherment model was trained on 20 minutes of the shortest utterances from the development set of each language.We increased the power of the character-level language model over successive epochs from a bigram up to a 5-gram, performing 20 iterations of full-batch training with each language model.These grapheme language models were trained on the first 50k lines of the normalised CommonCrawl text data.To prevent issues with bad initialisation of the decipherment model we performed 50 random restarts with the bigram language model and we picked the model with the best likelihood on the training data for successive training [41].To speed up training with the larger grapheme models, we pruned the lexical model to retain probabilities for only the top 20 phones for each grapheme after training with the bigram grapheme language model.After this stage we smoothed the lexical model and continued training with the CommonCrawl word language model.Finally, we smoothed the lexical model again and deciphered the GlobalPhone training data.Note that during training we used a word language model containing only the 100k most frequent words but during inference we used the language model containing the 300k most frequent words.
Subsequently we performed two iterations of semisupervised training with the deciphered lattices representing alternative pseudo-labels.In the first iteration we used these lattices for flat-start LF-MMI training [33].Instead of training the acoustic model from scratch, we replaced the output layer of the multilingual model with a new layer producing pseudolikelihoods for mono-graphemes.In the second iteration we used the mono-grapheme model to re-decode the training data.To prevent overfitting to the training data we did not continue training the mono-grapheme model, but again replaced the output layer of the pretrained multilingual model and used that model for training.This time we used bi-grapheme targets, because of now having more reliable pseudo-labels with which to estimate the state clustering tree.Since the decipherment tends to produce a lot of deletion errors we used a deletion penalty during decoding to allow the model to learn to fix them [42].
We compared the performance with two other approaches.The first was standard supervised training, called Oracle in Table 1, and in the second we used linguistic knowledge to map phones from the target language to the closest phone in the pooled multilingual phone set to generate pseudo-labels for semi-supervised training [8,43], called Phone-mapping in Table 1.In both approaches we also initialised the acoustic models by replacing the output layer of the pretrained multilingual model.

Results
Our results in Table 1 show that decipherment achieves comparable or better results to the hand-crafted phone-mapping approach for Bulgarian, Czech, Portuguese and Ukrainian, which are well-resourced languages.For all these languages, decipherment followed by semi-supervised training is only 2 -4% absolute worse than Oracle with CommonCrawl LM.Swedish is the only well-resourced language for which decipherment performs much worse with the absolute difference of 32.5%.
For lower-resourced languages Swahili and Hausa, phone mapping achieves better results.For Swahili we were unable to achieve a successful decipherment.By listening to the Glob-alPhone Swahili data we identified several problems, including beeps at the beginning of each utterance and a lot of leading and trailing silence.Even when we removed the beeps and leading and trailing silence in the Swahili data the performance was bad (as reported in Table 1).Therefore, we decided to evaluate the method also on Swahili data from the ALFFA corpus [44] which has been used for unsupervised speech recognition experiments in [18].On this dataset decipherment followed by semisupervised training achieves a WER of 41.8% which compares to 32.2% reported in [18] and 24.6% achieved by our oracle model.These Swahili results suggest that in order to be able to decipher speech from a new language we need to find speech amenable to decipherment.
We believe that our results in all languages could be further improved by using a better universal phone recogniser [14], a better pre-trained model for initialisation [19] and by leveraging more crawled data for SST [32,45].Our results are inline with our previous work on conventional low-resource SST where we showed that it is possible to perform SST even with a bad seed acoustic model provided that we have a good language model [32].

Conclusions and Future Work
We presented a method for zero-resource cross-lingual transfer of ASR models based on decipherment that allows training of ASR models using only untranscribed speech, text corpora and a universal phone recogniser.Across seven test languages our method was able to produce a working acoustic model for six, which could be further improved by using more untranscribed data for SST.In future we plan to apply decipherment to more challenging languages, but we believe that for this it may be necessary to train a more robust universal phone recogniser that works reliably across a wider range of various languages.We also intend to improve our decipherment algorithm to enable selection of utterances that can be reliably deciphered.Further, we hope to replace the universal phone recogniser with automatic unit discovery to create a pronunciation lexicon-free alternative for unsupervised speech recognition [18].

Figure 1 :
Figure 1: A diagram of a zero-resource cross-lingual transfer pipeline.

Figure 2 :
Figure 2: A diagram of an alignment model which allows one insertion or one deletion in a row.