Anti-Spoofing for Text-Independent Speaker Verification: An Initial Database, Comparison of Countermeasures, and Human Performance

In this paper, we present a systematic study of the vulnerability of automatic speaker verification to a diverse range of spoofing attacks. We start with a thorough analysis of the spoofing effects of five speech synthesis and eight voice conversion systems, and the vulnerability of three speaker verification systems under those attacks. We then introduce a number of countermeasures to prevent spoofing attacks from both known and unknown attackers. Known attackers are spoofing systems whose output was used to train the countermeasures, while an unknown attacker is a spoofing system whose output was not available to the countermeasures during training. Finally, we benchmark automatic systems against human performance on both speaker verification and spoofing detection tasks.


I. INTRODUCTION
The task of automatic speaker verification (ASV), sometimes described as a type of voice biometrics, is to accept or reject a claimed identity based on a speech sample. There are two types of ASV system: text-dependent and textindependent. Text-dependent ASV assumes constrained word content and is normally used in authentication applications because it can deliver the high accuracy required. However, text-independent ASV does not place constraints on word content, and is normally used in surveillance applications. For This work was partially supported by EPSRC under Programme Grant EP/I031022/1 (Natural Speech Technology) and EP/J002526/1 (CAF) and by TUBITAK 1001 grant No 112E160. This article is an expanded version of [1], [2] * Z. Wu is the correspondence author, and the remaining authors have been listed in alphabetical order to indicate equal contributions.
Z. Wu T. Toda is with Information Technology Center, Nagoya University, Japan. e-mail:tomoki@icts.nagoya-u.ac.jp example, in call-center applications 1,2 , a caller's identity can be verified during the course of a natural conversation without forcing the caller to speak a specific passphrase. Moreover, as such a verification process usually takes place under remote scenarios without any face-to-face contact, a spoofing attack -an attempt to manipulate a verification result by mimicking a target speaker's voice in person or by using computer-based techniques such as voice conversion or speech synthesisis a fundamental concern. Hence, in this work, we focus on spoofing and anti-spoofing for text-independent ASV.
Due to a number of technical advances, notably channel and noise compensation techniques, ASV systems are being widely adopted in security applications [3], [4], [5], [6], [7]. A major concern, however, when deploying an ASV system, is its resilience to a spoofing attack. As identified in [8], there are at least four types of spoofing attack: impersonation [9], [10], [11], replay [12], [13], [14], speech synthesis [15], [16] and voice conversion [17], [18], [19], [20], [21]. Among the four types of spoofing attack, replay, speech synthesis, and voice conversion present the highest risk to ASV systems [8]. Although replay might be the most common spoofing technique which presents a risk to both text-dependent and textindependent ASV systems [12], [13], [14], it is not viable for the generation of utterances of specific content, such as would be required to maintain a live conversation in a callcenter application. On the other hand, open-source software for state-of-the-art speech synthesis and voice conversion is readily available (e.g., Festival 3 and Festvox 4 ), making these two approaches perhaps the most accessible and effective means to carry out spoofing attacks, and therefore presenting a serious risk to deployed ASV systems [8]. For that reason, the focus in this work is only on those two types of spoofing attacks.

A. Speech Synthesis and Voice Conversion Spoofing
Many studies have reported and analysed the vulnerability of ASV systems to speech synthesis and voice conversion spoofing. The potential vulnerability of ASV to synthetic speech was first evaluated in [22], [23]. An HMM-based speech synthesis system was used to spoof an HMM-based, text-prompted ASV system. They reported that the false acceptance rate (FAR) increased from 0% to over 70% under a speech synthesis spoofing attack. In [15], [16], the vulnerability of two ASV systems -a GMM-UBM system (Gaussian mixture models with a universal background model), and an SVM system (support vector machine using a GMM supervector) -was assessed using a speaker-adaptive, HMMbased speech synthesizer. Experiments using the Wall Street Journal (WSJ) corpus (283 speakers) [24] showed that FARs increased from 0.28% and 0.00% to 86% and 81% for GMM-UBM and SVM systems, respectively. These studies confirm the vulnerability of ASV systems to speech synthesis spoofing attack.
Voice conversion as a spoofing method has also been attracting increasing attention. The potential risk of voice conversion to a GMM ASV system was evaluated for the first time in [25], which used the YOHO database (138 speakers). In [26], [27], [17], text-independent GMM-UBM systems were assessed when faced with voice conversion spoofing on NIST speaker recognition evaluation (SRE) datasets. These studies showed an increase in FAR from around 10% to over 40% and confirmed the vulnerability of GMM-UBM systems to voice conversion spoofing attack.
Recent studies [18], [19] have evaluated more advanced ASV systems based on joint factor analysis (JFA), i-vectors, and probabilistic linear discriminative analysis (PLDA), on the NIST SRE 2006 database. The FARs of these systems increased five-fold from about 3% to over 17% under attacks from voice conversion spoofing.

B. Spoofing countermeasures
The vulnerability of ASV systems to spoofing attacks has led to the development of anti-spoofing techniques, often referred to as countermeasures. In [28], a synthetic speech detector based on the average inter-frame difference (AIFD) was proposed to discriminate between natural and synthetic speech. This countermeasure works well if the dynamic variation of the synthetic speech is different from that of natural speech; however, if global variance compensation is applied to the synthetic speech, the countermeasure becomes less effective [15].
In [29], [30], a synthetic speech detector based on image analysis of pitch-patterns was proposed for human versus synthetic speech discrimination. This countermeasure was based on the observation that there can be artefacts in the pitch contours generated by HMM-based speech synthesis. Experiments showed that features extracted from pitch-patterns can be used to significantly reduce the FAR for synthetic speech. The performance of the pitch-pattern countermeasure was not evaluated for detecting voice conversion spoofing.
In [31], a temporal modulation feature was proposed to detect synthetic speech generated by copy-synthesis. The modulation feature captures the long-term temporal distortion caused by independent frame-by-frame operations in speech synthesis. Experiments conducted on the WSJ database showed the effectiveness of the modulation feature when integrated with frame-based features. However, whether the detector is effective across a variety of speech synthesis and voice conversion spoofing attacks is unknown. Also using spectro-temporal information, a feature derived from local binary patterns [32] was employed to detect voice conversion and speech synthesis attacks in [33], [34].
Phase-and modified group delay-based features have also been proposed to detect voice conversion spoofing [35]. A cosine-normalised phase feature was derived from the phase spectrogram while the modified group delay feature contained both magnitude and phase information. Evaluation on the NIST SRE 2006 data confirmed the effectiveness of the proposed features. However, it remains unknown whether the phase-based features are also effective in detecting attacks from speech synthesisers using unknown vocoders. Another phase-based feature called the relative phase shift was proposed in [16], [36], [37] to detect speech synthesis spoofing, and was reported to achieve promising performance for vocoders using minimum phase rather than natural phase.
In [38], an average pair-wise distance (PWD) between consecutive feature vectors was employed to detect voiceconverted speech, on the basis that the PWD feature is able to capture short-term variabilities, which might be lost during statistical averaging when generating converted speech. Although the PWD was shown to be effective against attacks from their own voice conversion system, this technique (which is similar to the AIFD feature proposed in [28]) might not be an effective countermeasure against systems that apply global variance enhancement.
In contrast to the above methods focusing on discriminative features, a probabilistic approach was proposed in [39], [40]. This approach uses the same front-end as ASV, but treats the synthetic speech as a signal passed through a synthesis filter. Experiments on the NIST SRE 2006 database showed comparable performance to feature-based countermeasures. In this work, we focus on feature-based anti-spoofing techniques, as they can be optimised independently without rebuilding the ASV systems.

C. Motivations and Contributions of this Work
In the literature, each study assumes a particular spoofing type (speech synthesis or voice conversion) and often just one variant (algorithm) of that type, then designs and evaluates a countermeasure for that specific, known attack. However, in practice it may not be possible to know the exact type of spoofing attack and therefore evaluations of ASV systems and countermeasures under a broad set of spoofing types are desirable. Most, if not all, previous studies have been unable to conduct a broader evaluation because of the lack of a standard, publicly-available spoofing database that contains a variety of spoofing attacks. To address this issue, we have previously developed a spoofing and anti-spoofing (SAS) database including both speech synthesis and voice conversion spoofing attacks [1]. This database includes spoofing speech from two different speech synthesis systems and seven different voice conversion systems. Now, we first broaden the SAS database by including four more variants: three text-to-speech (TTS) synthesisers and one voice conversion system. They will be referred to as SS-SMALL-48, SS-LARGE-48, SS-MARY and VC-LSP 5 , and are described in Section II.A.
We also develop a joint speaker verification and countermeasure evaluation protocol, then refine that evaluation protocol to enable better generalisability of countermeasures developed using the database. We include contributions from both the speech synthesis and speaker verification communities. This database is offered as a resource for researchers investigating generalised spoofing and anti-spoofing methods 6 . We hope that the availability of a standard database will contribute to reproducible research 7 .
Second, with the SAS database, we conduct a comprehensive analysis of spoofing attacks on six different ASV systems. From this analysis we are able to determine which spoofing type and variant currently poses the greatest threat and how best to counter this threat. To the best of our knowledge, this study is the first evaluation of the vulnerability of ASV using such a diverse range of spoofing attacks and the most thorough analysis of the spoofing effects of speech synthesis and voice conversion spoofing systems under the same protocol.
Third, we present a comparison of several anti-spoofing countermeasures to discriminate between human and artificial speech. In our previous work, we applied cosine-normalised phase [35], modified group delay [35] and segment-based modulation features [31] to detect voice converted speech, and applied pitch pattern based features to detect synthetic speech [29], [30]. In this work, we evaluate these countermeasures against both spoofing types and propose to fuse decisions at the score level in order to leverage multiple, complementary sources of information to create stronger countermeasures. We also extend the segment-based modulation feature to an utterance-level feature, to account for long-term variations.
Finally, we perform listening tests to evaluate the ability of human listeners to discriminate between human and artificial speech 8 . Although the vulnerability of ASV systems in the face of spoofing attacks is known, some questions still remain unanswered. These include whether human perceptual ability is important in identifying spoofing and whether humans can achieve better performance than automatic approaches in detecting spoofing attacks. In this work, we attempt to answer these questions through a series of carefully-designed listening tests. In contrast to the human assisted speaker recognition (HASR) evaluation [43], we consider spoofing attacks in 5 The four systems are new in this article while other systems have been published in a conference paper [1]. SS-SMALL-48 and SS-LARGE-48 allow us to analyse the effect of sampling rates of spoofing materials. SS-MARY is useful to understand the effect of waveform concatenation-based speech synthesis spoofing. 6 Based on this database, a spoofing and countermeasure challenge [41], [42] has already been successfully organised as a special session of INTER-SPEECH 2015. 7 The SAS corpus is publicly available: http://dx.doi.org/10.7488/ds/252 8 The preliminary version was published at INTERSPEECH 2015 [2] where we focused on human and automatic spoofing detection performance on wideband and narrowband data. The current work benchmarks automatic systems against human performance on speaker verification and spoofing detection tasks. speaker verification and conduct listening tests for spoofing detection, which was not considered in the HASR evaluation.

II. DATABASE AND PROTOCOL
We extended our SAS database [1] by including additional artificial speech. The database is built from the freely available Voice Cloning Toolkit (VCTK) database of native speakers of British English 9 . The VCTK database was recorded in a hemi-anechoic chamber using an omni-directional headmounted microphone (DPA 4035) at a sampling rate of 96 kHz. The sentences are selected from newspapers, and the average duration of each sentence is about 2 seconds.
To design the spoofing database, we took speech data from VCTK comprising 45 male and 61 female speakers and divided each speaker's data into five parts: A: 24 parallel utterances (i.e., same sentences for all speakers) per speaker: training data for spoofing systems. B: 20 non-parallel utterances per speaker: additional training for spoofing systems. C: 50 non-parallel utterances per speaker: enrolment data for client model training in speaker verification, or training data for speaker-independent countermeasures. D: 100 non-parallel utterances per speaker: development set for speaker verification and countermeasures. E: Around 200 non-parallel utterances per speaker: evaluation set for speaker verification and countermeasures. In Parts B -E, sentences were randomly selected from newspapers without any repeating sentence across speakers. In Parts A and B, we have two versions, downsampled to 48 kHz and 16 kHz respectively, while in Parts C, D and E all signals are downsampled to 16 kHz. Parts A and B allow us to analyse the effects of sampling rate for spoofing attack. For training the spoofing systems, we designed two training sets. The small set consists of data only from Part A, while the large set comprises the data from Parts A and B together.

A. Spoofing systems
We implemented five speech synthesis (SS) and eight voice conversion (VC) spoofing systems, as summarised in Table I. These systems were built using both open-source software (to facilitate reproducible research) as well as our own state-ofthe-art systems (to provide comprehensive results): NONE: This is a baseline zero-effort impostor trial in which the impostor's own speech is used directly with no attempt to match the target speaker.
SS-LARGE-16: An HMM-based TTS system built with the statistical parametric speech synthesis framework described in [44]. For speech analysis, the STRAIGHT vocoder with mixed excitation is used, which results in 60-dimensional Bark-Cepstral coefficients, log F 0 and 25-dimensional bandlimited aperiodicity measures [45], [46]. Speech data from 257 (115 male and 142 female) native speakers of British English is used to train the average voice model. In the speaker adaptation phase, the average voice model is transformed using structural variational Bayesian linear regression [47] followed by maximum a posteriori (MAP) adaptation, using the target speaker's data from Parts A and B. To synthesise speech, acoustic feature parameters are generated from the adapted HMMs using a parameter generation algorithm that considers global variance (GV) [48]. An excitation signal is generated using mixed excitation and pitch-synchronous overlap and add [49], and used to excite a Mel-logarithmic spectrum approximation (MLSA) filter [50] corresponding to the STRAIGHT Bark cepstrum, to create the final synthetic speech waveform. SS-LARGE-48: Same as SS-LARGE-16, except that 48 kHz sample rate waveforms are used for adaptation. The use of 48 kHz data is motivated by findings in speech synthesis that speaker similarity can be improved significantly by using data at a higher sampling rate [51].
SS-SMALL-16: Same as SS-LARGE-16, except that only Part A of the target speaker data is used for adaptation.
SS-SMALL-48: Same as SS-SMALL-16, except that 48 kHz sample rate waveforms are used to adapt the average voice.
SS-MARY: Based on the Mary-TTS 10 unit selection synthesis system [52]. Waveform concatenation operates on diphone units. Candidate units for each position in the utterance are found using decision trees that query the linguistic features of the target diphone. A preselection algorithm is used to prune candidates that do not fit the context well. The target cost sums linguistic (target) and acoustic (join) costs. Candidate diphone and target diphone labels and their contexts are used to compute the linguistic sub-cost. Pitch and duration are used for the join cost. Dynamic programming is used to find the sequence of units with the minimum total target plus join cost. Concatenation takes place in the waveform domain, using pitch-synchronous overlap-add at unit boundaries.
VC-C1: The simplest voice conversion method, which modifies the spectral slope simply by shifting the first Mel-Generalised Cepstral coefficient (MGCs) [53]. No other speaker-specific features are changed. The STRAIGHT vocoder is used to extract MGCs, band aperiodicities (BAPs) and F 0 .
VC-EVC: A many-to-many eigenvoice conversion (EVC) system [54]. The eigenvoice GMM (EV-GMM) is constructed from the training data of one pivot speaker in the ATR Japanese speech database [55], and 273 speakers (137 male, 136 female) from the JNAS database 11 . Settings are the same as in [56]. The 272-dimensional weight vectors are estimated by using the Part A of the training data. STRAIGHT is used to extract 24-dimensional MGCs, 5 BAPs, and F 0 . The conversion function is applied only to the MGCs.
VC-FEST: The voice conversion toolkit provided by the open-source Festvox system. It is based on the algorithm proposed in [57], which is a joint density Gaussian mixture model with maximum likelihood parameter generation considering global variance. It is trained on the Part A set of parallel training data, keeping the default settings of the toolkit, except that the number of Gaussian components in the mixture distributions is set to 32.
VC-FS: A frame selection voice conversion system, which is a simplified version of exemplar-based unit selection [58], using a single frame as an exemplar and without a concatenation cost. We used the Part A set for training. The same features as in VC-C1 are used, and once again only the MGCs are converted.
VC-GMM: Another GMM-based voice conversion method very similar to VC-FEST but with some enhancements, which also uses the parallel training data from Part A. STRAIGHT is used to extract 24-dimensional MGCs, 5 BAPs, and F 0 . The search range for F 0 extraction is automatically optimized speaker by speaker to reduce errors. Two GMMs are trained for separately converting the 1 st through 24 th MGCs and 5 BAPs. The number of mixture components is set to 32 for MGCs and 8 for BAPs, respectively. GV-based post-filtering [59] is used to enhance the variance of the converted spectral parameter trajectories.
VC-KPLS: Voice conversion using kernel partial least square (KPLS) regression [60], trained on the Part A parallel data. Three hundred reference vectors and a Gaussian kernel are used to derive kernel features and 50 latent components 11 http://www.milab.is.tsukuba.ac.jp/jnas/instruct.html are used in the PLS model. Dynamic kernel features are not included, for simplicity. STRAIGHT is used to extract 24dimensional MGCs, 25 BAPs, and F 0 .
VC-TVC: Tensor-based arbitrary voice conversion (TVC) system [56]. To construct the speaker space, the same Japanese dataset as in VC-EVC is used. The size of the weight matrices that represent each speaker is set to 48 × 80. The same part of the SAS database and the same features as in VC-EVC are used, and again only MGCs are converted, without altering other features.
VC-LSP: This system is also based on the standard GMMbased voice conversion method similar to VC-GMM using the parallel training data from Part A. STRAIGHT is used as the speech analysis-synthesis method. 24-dimensional line spectral pairs (LSPs) and their delta coefficients are used as the spectral features. A 16-component GMM is trained for the modelling of joint LSP feature vectors. For each component, the four blocks of its covariance matrix are set to be diagonal. No quality enhancement or post-filtering techniques are applied during the reconstruction of converted speech.
In addition to the above descriptions, for all the voice conversion approaches, F 0 is converted by a global linear transformation: simple mean-variance normalisation. In VC-KPLS, VC-EVC, VC-TVC, VC-FS and VC-C1, the source speaker BAPs are simply copied, without undergoing any conversion.

B. Speaker Verification and Countermeasure Evaluation Protocol
For the evaluation of ASV systems, enrolment data for each client (speaker) were selected from Part C under two conditions: 5-utterance or 50-utterance enrolments. For five utterances, this is about 5-10 seconds of speech while for 50 utterances it is about 1 minute of speech. The development set, used to tune the ASV system and decide thresholds, was taken from Part D and involves both genuine and impostor trials. All utterances from a client speaker in Part D were used as genuine trials, and this results in 1498 male and 1999 female genuine trials. For the impostor trials, ten randomly selected non-target speakers were used as impostors. All Part D utterances from a specific impostor were used as impostor trials against the client's model, leading to 12981 male and 17462 female impostor trials. The evaluation set is taken from Part E and is arranged into genuine and imposter trials in a similar way to the development set, with 4053 male and 5351 female genuine trials, and 32833 male and 46736 female impostor trials. A summary of the development and evaluation sets is shown in Table II. We used the synthetic speech and voice conversion systems described above to generate artificial speech for both development and evaluation sets. During the execution of spoofing attacks, the transcript of an impostor trial was used as the textual input to each speech synthesis system, and the speech signal of the impostor trial was the input to each voice conversion system. As a result, the zero-effort impostor trial, the speech synthesis spoofed trial, and the voice conversion spoofed trial all have the same language content (i.e., word sequence). In this way, the number of spoofed trials of one spoofing system is exactly the same as the number of impostor trials presented in Table II. This allows a fair comparison between non-spoofed and spoofed speaker verification results. Only five of the available spoofing systems were used during development, with all thirteen spoofing systems (Table I) being run on the evaluation set. Hence, the number of total spoofed trials is 12981×5 and 17462×5 for males and females, respectively, for the development set, and 32833×13 and 46736×13 for male and female speakers, respectively, for the evaluation set. In the countermeasure evaluation protocol, we used a further 25 speakers' voices as training data and only implemented five attacks (as known attacks) on the training set. The 25 speakers do not appear in the development and evaluation sets for ASV, and this allows us to develop speaker-and gender-independent countermeasures. For countermeasure development and evaluation sets, the same speakers and same spoofed trials are used as those for ASV. This allows us to integrate countermeasures with ASV systems and to evaluate the integration performance. A summary of the countermeasure protocol is presented in Table III.

III. SPEAKER VERIFICATION SYSTEMS
We used three classical ASV systems: Gaussian Mixture Models with a Universal Background Model (GMM-UBM) [61], Joint Factor Analysis (JFA) [62] and i-vector with Probabilistic Linear Discriminant Analysis (PLDA) [63]. In this paper, we use PLDA to refer to this i-vector-PLDA system. Each system was implemented under the two enrolment scenarios: 5-utterance and 50-utterance enrolment. All systems used the same front-end to extract acoustic features: 19dimensional Mel-Frequency Cepstral Coefficients (MFCCs) plus log-energy with delta and delta-delta coefficients. By excluding the static energy feature (but retaining its delta and delta-delta), 59-dimensional feature vectors are obtained. To extract MFCCs, we applied a Hamming analysis window, the size of which is 25 ms with a 10-ms shift, and we employed a mel-filter bank with 24 channels. We note that C0 is not retained in the extracted MFCCs. In practice, the SPro toolkit 12 was used to extract MFCCs. The AudioSeg toolkit was used to perform voice activity detection (VAD) [64]. GMM-UBM: with 512 Gaussian components in the UBM, and a client speaker model obtained by performing maximum a posteriori (MAP) adaptation, with the relevance factor set to 10. Only mean vectors were adapted, keeping diagonal covariance matrices and mixture weights the same as in the UBM.
JFA: using a UBM with the same 512 components as the GMM-UBM as well as eigenvoice and eigenchannel spaces with 300 and 100 dimensions, respectively. Cosine scoring was performed on the speaker variability vectors.
PLDA: a PLDA system operating in i-vector space. An i-vector is a low-dimensional vector to represent a speakerand channel-dependent GMM supervector M through a low rank matrix T , as M = m + T w, where m is a speakerand channel-independent supervector, which is realised by a UBM supervector in this work; T is also called the total variability matrix; and w is the i-vector. In this work, 400dimensional i-vectors were extracted with the maximum a posteriori (MAP) criterion and using the same UBM as the JFA system. Linear discriminant analysis (LDA) was first applied to reduce the i-vector dimension to 200. Then, ivectors were centred, length-normalised, and whitened. The whitening transformation was learned from i-vectors in the development set. After that, a Gaussian PLDA model was trained using the expectation-maximisation (EM) algorithm which was run for 20 iterations. The rank of the eigenspace (number of columns in the eigenmatrix) was set to 100. Scoring was done with a log-likelihood ratio test. In practice, the MSR Identity Toolbox [65] was used to implement the PLDA system.
We used three WSJ databases (WSJ0, WSJ1, and WSJ-CAM) and the Resource Management database (RM1) for training the UBM, eigenspaces, and LDA. The statistics of the three databases are presented in Table IV. The sampling rate of all four database is 16 kHz. We note that our preliminary experimental results suggested that WSJCAM was very useful for improving verification performance. The maximum likelihood criterion was employed to train the UBM and eigenspaces while the Fisher criterion was used to train LDA.
The 50 enrolment utterances were merged into 10 sessions (each being the concatenation of 5 utterances); either 1 or 10 of these sessions were used in enrolment, for the two enrolment scenarios. For PLDA, when using 10 enrolment sessions, ivectors were extracted from each session then averaged as suggested in [66]; for JFA, all features from all sessions 12 Available at: http://www.irisa.fr/metiss/guig/spro/ were merged. We denote the ASV systems with 5 enrolment utterances (presented as 1 session) as GMM-UBM-5, JFA-5 or PLDA-5 and those with 50 enrolment utterances (presented as 10 sessions) as GMM-UBM-50, JFA-50 or PLDA-50.

IV. ANTI-SPOOFING COUNTERMEASURES
We now examine five countermeasures 13 , described below along with the features they are based on, and then propose a fusion of these countermeasures in order to learn complementary information and improve anti-spoofing performance.
Given a speech signal x(n), short-time Fourier analysis can be applied to transform the signal from the time domain to the frequency domain by assuming the signal is quasi-stationary within a short time frame, e.g., 25ms. The short-time Fourier transform of the speech signal can be represented as follows: where X(ω) is the complex spectrum, |X(ω)| is the magnitude spectrum and ϕ(ω) is the phase spectrum. It is usual to define |X(ω)| 2 as the power spectrum, from which features that only contain magnitude information, e.g., MFCCs, can be derived. The proposed feature-based countermeasures are derived from the complex spectrum X(ω) that has two parts: a real part X R (ω) and an imaginary part X I (ω), and from which the phase spectrum ϕ(ω) can be obtained.
To extract frame-wise features, we employ a hamming window, the size of which is 25ms, with a 5ms shift. The FFT length is set to 512.

A. Cosine Normalised Phase Feature
Even though phase information is important in human speech perception [67], most speech synthesis and voice conversion systems use a simplified, minimum phase model which may introduce artefacts into the phase spectrum. The cosine normalised phase (CosPh) feature is derived from the phase spectrum, and can be used to discriminate between human and synthetic speech. The feature is computed as follows: 1) Unwrap the phase spectrum.
2) Compute the CosPh spectrum by applying the cosine function to the spectrum in 1) to normalise to [−1.0, 1.0]. 3) Apply a discrete cosine transform (DCT) to the spectrum in 2). 4) Keep the first 18 cepstral coeffcients, and compute their delta and delta-delta coeffcients as features. By normalizing the values of the unwrapped phase spectrum, we can simplify subsequent statistical modeling. We note that the motivation for applying the DCT is decorrelation and dimensionality reduction; C0 is not retained.

B. Modified Group Delay Cepstral Feature
In addition to the artefacts in the phase spectrum, the statistical averaging inherent in parametric modeling of the magnitude spectrum may also introduce artefacts, such as oversmoothed spectral envelopes. The use of both phase and magnitude spectra can therefore be useful for detecting synthetic speech. The Modified Group Delay Cepstral Coefficients (MGDCCs) can be used to detect artefacts in both spectra of synthetic speech. The MGDCC feature has also been used in speech recognition [68] and speaker verification [69]. The MGDCCs are derived from the complex spectrum as follows: 1) Apply the fast Fourier transform (FFT) to a windowed speech signal, x(n) and nx(n) to compute X(ω) and Y (ω), respectively. Here nx(n) is the re-scaled signal of x(n). 2) Compute the cepstrally-smoothed power spectrum 14 |S(ω)| 2 of |X(ω)| 2 . 3) Compute the MGD spectrum (R and I denote the real and imaginary parts of the spectrum) 4) Reshape τ ρ (w) as 5) Apply the DCT to τ ρ,γ (w) and keep the first 18 cepstral coefficients with their delta and delta-delta coefficients as MGDCC features. In (2) and (3), ρ and γ are two weighting variables that control the shape of the MGD spectrum. We set ρ = 0.7 and γ = 0.2 based on the performance on the development set.

C. Segment-Based Modulation Feature
In speech synthesis and voice conversion, the speech signal is usually divided into overlapping frames for modeling, and this frame-by-frame or state-by-state modeling may introduce artefacts in the temporal domain due to the independence assumptions made by the underlying statistical model. These temporal artefacts are evident in the modulation domain and can be used to detect synthetic and voice-converted speech. The Segment-based Modulation Feature (SMF) is extracted from the MGD cepstrogram based on our previous work [31]. The procedure for computing the SMF is illustrated in Fig. 1 and described as follows: 14 Cepstrally-smoothed spectrum is obtained through the following steps: a) compute the log-amplitude spectrum from X(ω), and apply a median filter to smooth the spectrum; b) apply the DCT to the log spectrum and keep the first 30 cepstral coefficients; c) apply the inverse DCT to the cepstral coeffcients to obtain the cepstrally-smoothed spectrum S(ω). 1) Divide the 18-dimensional MGD spectrogram into overlapping segments using a 50-frame window with 20frame shift. 2) Apply mean and variance normalisation to the MGD trajectory of each dimension to make it have zero mean and unit variance 15 . 3) Take the FFT of the normalised 18-dimensional trajectories to compute modulation spectra. 4) Concatenate the modulation spectra in one cepstrogram segment into a supervector, and use this as the SMF feature vector. 5) Average all the SMF vectors of one utterance to get an average feature vector. This averaged feature vector will be used as the feature vector for the utterance.  In practice, we used a 64-point FFT to extract a 32dimensional modulation spectrum for each MGD trajectory. Hence, the modulation supervector of each segment is 18 × 32 = 576. We pass this supervector to a support vector machine (SVM) for classification. In practice, we employed the LIBSVM toolkit [70] to implement the SVM. We used a radial basis kernel, and set the penalty factor to 34.

D. Utterance-Based Modulation Feature
To extract the segment-based modulation feature, a speech signal needs to be divided into short segments first and then the corresponding modulation features are extracted for each segment. An alternative approach is to extract modulation features at the utterance level, to obtain Utterance-based Modulation Features (UMFs).
The process to extract UMFs is similar to that of SMFs, but only steps 2 -4 are applied, without dividing the utterances into frames. In practice, we used a 1024-point FFT to extract the modulation spectrum for each MGD trajectory, then applied a DCT to the modulation spectrum, and after that kept the first 32 coefficients as features. Hence, the dimensionalities of UMF and SMF for each utterance are the same: 576. Again, we pass the feature vector to an SVM for classification. The configuration of the SVM here is the same as that for SMF in Section IV-C.

E. Pitch Pattern Feature
The prosody of synthetic speech is generally not the same as natural speech [71] and therefore the pitch pattern is another good candidate feature for a countermeasure.
and n, m are the sample instant and lag, respectively, over which the autocorrelation is computed. The lag parameter is chosen such that pitch frequencies can be observed [72]; in this work, we choose 32 ≤ m ≤ 320 for a sample rate of 16kHz.
Once the pitch pattern is computed, we segment it into a binary pitch pattern image through the rule where θ is a threshold; we set θ = 1/ √ 2 for all n, based on preliminary results on the development set. An example pitch pattern image is shown in Fig. 2.
Extracting features from the pitch pattern is a two-step process: 1) computation of the pitch pattern; 2) image analysis. First, the pitch pattern is computed using (4) and segmented using (7) to form a binary image. In the second step, image processing of the segmented binary pitch pattern is performed in order to extract the connected components (CCs), i.e., black regions in Fig. 2. This processing includes determining the bounding box and area of a CC, which are then used to distinguish between two types of CC: pitch pattern connected components (PPCC) and irregularly-shaped components or artefacts.
The resulting CCs are then analysed and the mean pitch stability µ s , mean pitch stability range µ R , and time support (TS) of each CC are computed as in [29]. The proposed image processing-based approach determines parameters on a perconnected component basis and then computes statistics over the connected components of the utterance. The six element utterance feature vector used for classification contains µ R and the TS of the artefacts, the number of artefacts, µ S and T S of the PPCC, and standard deviation of the TS of PPCC. Other utterance features were considered during the training and development stage but were found not to contribute to the classifier accuracy. For the pitch pattern countermeasure, a maximum likelihood classifier based on the log-likelihoods computed from the utterance feature vectors was used for classification. During training, human and spoofing utterance feature vectors were modeled as multivariate Gaussian distributions with full covariance matrices. During testing, the utterance is determined to be human if the log-likelihood ratio is greater than a threshold calibrated to produce equal error rate (EER) on the development set.

F. Fused countermeasure
To benefit from the multiple feature-based countermeasures, we propose a fused countermeasure. In speaker verification, system fusion is one way to combine multiple individual speaker verification systems to achieve better performance [73], [74]. A similar strategy can be applied for anti-spoofing, as each feature-based countermeasure discussed above has its own pros and cons. For example, the pitch pattern feature-based countermeasure is expected to work well in detecting waveform concatenation based spoofing attacks, while other countermeasures are expected to detect phase and temporal artefacts. It is expected the fused countermeasure can benefit from the pros of each individual countermeasure.
We perform linear fusion at the score level. We first train a linear fusion function on the development set which only contains known attacks, and then apply the fusion function on the evaluation scores; finally, the fused score is used to discriminate between human and spoofed speech. In practice, we used the BOSARIS Toolkit 16 to train the fusion function.

A. Evaluation Metric
In both speaker verification and spoofing detection, there are two types of errors: 1) genuine or human speech is accepted as impostor or spoofed speech; 2) impostor or spoofed speech is accepted as genuine or human. The first type of error is a false rejection error, while the second type is a false acceptance. When the false acceptance rate (FAR) equals to the false rejection rate (FRR), we are at the equal error rate (EER) point. In this work, when reporting the false acceptance rates (FARs) and the false rejection rates (FRRs) for a specific spoofing algorithm, the decision threshold is set to achieve the EER operating point for that spoofing algorithm. When reporting overall spoofing performance, all the spoofed samples are pooled together and treated as one (unknown) spoofing algorithm when setting the threshold, because in practice one may not know the exact type of spoofing algorithm.

B. Spoofing ASV Systems without Countermeasures
We evaluated the performance of the ASV systems for the various synthetic speech and voice conversion variants described in Section II-A. Prior to the evaluation, the ASV decision threshold was set to the EER point on the development set, using only human speech.
Speaker verification results are presented in Table V. The FARs for the baseline experiment, which uses only human speech, are low (as expected) because the SAS database has near-ideal recordings, free from channel and background noise. In particular, the lowest FARs for GMM, JFA and PLDA systems are 0.09%, 1.25% and 1.16%, respectively. Note that the short duration of the trials preculdes even lower FARs and FRRs.
Whilst the ASV systems achieve excellent verification performance, they are still vulnerable to spoofing. The simple VC-C1 spoofing attack, which only modifies the spectral slope of the source speaker, increases FAR for nearly every ASV system. The attacks using speech synthesis or voice conversion, with more advanced algorithms, lead to FARs as high as 99.95%. On average, speech synthesis leads to FARs of over 95% for male voices and over 80% for female voices, and more sophisticated voice conversion algorithms lead to FARs of close to 80% for both male and female voices. These observations are consistent with previous studies on clean speech [16] and telephone quality speech [18], [19], and confirm the vulnerability of ASV systems to a diverse range of spoofing attacks. In general, our experiments suggest that it is easier to spoof male speakers than female speakers in the sense that the FARs for the various spoofing attacks for female speakers are generally lower than that for male speakers. We speculate that it is relatively harder to model 16 https://sites.google.com/site/bosaristoolkit/ female speech or perform female-to-female conversion due to the higher variability of female speech.
Although ASV systems that have more enrolment data available to them give lower FARs in the baseline case, they are not necessarily more resistant to spoofing attack. For example, under the VC-FEST attack, the FARs of JFA-5 and PLDA-5 male systems are 91.25% and 97.41%, respectively, and the FARs of JFA-50 and PLDA-50 are even higher at 97.71% and 99.54%, respectively. Similar patterns can be observed for other spoofing algorithms, as well as for female speech.
From the perspective of spoofing, the first interesting observation is that voice conversion is as effective at spoofing as speech synthesis, given the same amount of training data. Most of the speech synthesis systems used in this work require a large amount of data to train the average voice model, which is adapted to the target. On the other hand, most voice conversion algorithms, including VC-FEST, VC-GMM and VC-FS, only need source and target speech data to train their conversion functions. Voice conversion spoofing is sometimes even more effective than speech synthesis. It is worth highlighting that the publicly-available voice conversion toolkit VC-FEST is at least as effective as the other voice conversion and speech synthesis techniques.
The second interesting observation is that, although VC-TVC and VC-EVC use a Japanese database to train eigenvoices for adaptation to English data, these methods still increase FARs as much as the other variants. This suggests that attackers could use alternate speech resources, i.e. speech corpora in another language, if they cannot find enough resources for the target language.
The third observation is that the use of higher sampling rate training data in speech synthesis results in higher FARs of ASV systems. This suggests that such data includes more speaker-specific characteristics and that attackers can use this to conduct more effective spoofing if they have access to such data.
The last observation is that more training data can improve the effectiveness of speech synthesis and voice conversion spoofing systems. Comparing SS-SMALL-16k and SS-LARGE-16k, using 40 instead of 24 training utterances results in an increase of about 4% absolute FAR. In contrast, using more enrollment data for ASV systems does not seem to be helpful in defending against spoofing attacks (except VC-C1), although it does improve the baseline ASV performance without spoofing. We speculate that, as the spoofed speech sounds more like the target speaker, it will achieve higher likelihood scores under any target speaker model that has been trained using more enrollment data, and hence results in higher FARs. This also explains why ASV systems with more enrollment data succeed in defending against the VC-C1 attack, which can be easily distinguished by the human ear in terms of speaker similarity, as shown in Table VIII.
Given the wide-ranging spoofing results in Table V

C. Evaluation of Stand-Alone Countermeasures
We conducted experiments to evaluate the performance of stand-alone countermeasures, i.e. their ability to discriminate between human and artificial speech. When training countermeasures, five of the spoofing systems listed in Table I,  For MFC, CosPh, MGD and PP features, GMM-based maximum likelihood classifiers were employed, while for SMS and UMS features, SVM classifiers were used. Whilst many combinations of features and classifier could of course be imagined, these choices give us a representative range of countermeasures to compare. For each countermeasure, the detection threshold was set to achieve the EER point on the development set under all five known attacks, and then the countermeasure was applied to the evaluation set to compute the FARs shown in Table VI. These results show that the frame-based features MFCC, CosPh and MGD achieve better performance than the long-term features SMS, UMS and PP. Even though the modulation features SMS and UMS are derived from the MGD features, they do not perform as well as frame-based MGD features. This observation is consistent with our previous work [31]. In the database, due to the short duration of trials, long-term features generally only provide a rather small number of feature vectors per utterance.
In respect of the frame-based features, the MGD-based countermeasure achieves the best overall performance in terms of low FARs and works well at detecting most types of spoofed speech with the notable exception of the SS-MARY attack. The MGD features include both magnitude and phase spectrum information, whereas MFCCs only capture magnitude spectrum and CosPh only phase spectrum. With respect to long-term features, both SMS and UMS perform well at detecting statistical parametric speech synthesis spoofing, yet fail to detect most of the voice conversion algorithms or unit selection speech synthesis.
The pitch pattern countermeasure detects synthetic speech well, but does not detect some voice conversion speech such as that from VC-C1, VC-FEST, VC-KPLS and VC-LSP. This is probably due to the fact that speech synthesis usually predicts fundamental frequency (F0) from text (and so produces rather unnatural trajectories) whereas voice conversion usually copies a source speaker's F0 trajectories to generate a target speaker's voice. Hence, voice conversion introduces fewer pitch pattern artefacts than speech synthesis. We note that the pitch pattern countermeasure achieves the best performance of 1.96% FAR against the SS-MARY unit selection synthesis attack. In general, most of the countermeasures achieve better performance for known attacks than for unknown attacks, as spoofing data from known attacks are available for training countermeasures and those from unknown attacks are not available to train the detectors. From the perspective of spoofing algorithms, SS-MARY is the most difficult to detect, and this is presumed to be due to the fact that it uses original waveforms to generate spoofed speech and thus introduces fewer artefacts when compared with other methods.
We also fused the six individual countermeasures at the score level to create a new countermeasure as detailed in Section IV-F. The linear combination weights for MFC, CosPh, MGD, SMS, UMS and PP countermeasures are 26.71, 9.56,   THE DECISION THRESHOLD IS SET TO  THE ASV EQUAL ERROR RATE (EER) POINT ON THE DEVELOPMENT SET USING ONLY HUMAN SPEECH.   Male  Female  GMM-GMM-JFA-JFA-PLDA-PLDA-GMM-GMM-JFA-JFA-PLDA-PLDA-Spoofing  UBM-5 UBM-50  5  50  5  50  UBM-5 UBM-50  5  50  5  Although the PP countermeasure can discriminate extremely well between human and SS-MARY speech, this ability is not picked up by the fused countermeasure because PP has a low weight. This is because the weights were learned on the development set, which of course only contains known attacks (the first group of 5 countermeasures in Table VI), but the PP countermeasure performs poorly on many of those known attacks, especially the voice conversion ones. Hence, it is given a low weight, and essentially ignored in the fused countermeasure.

D. Spoofing ASV Systems that Employ a Countermeasure
We conducted experiments to evaluate the overall performance of speaker verification systems that include a countermeasure. We only consider the proposed fused countermeasure here, because it exhibited better overall performance than any individual countermeasure. We integrated the fused countermeasure with each of the ASV systems as a postprocessing module -as illustrated in Fig. 3 -to reflect the practical use case in which a separately-developed standalone countermeasure is added to an already-deployed ASV system [16] without significant modification of that system.   Fig. 3. A speaker verification system with an integrated countermeasure.
The integrated system only accepts a claimed identity if it is accepted by the speaker verification system and classified as human speech by the countermeasure [16].
A good countermeasure should reduce FARs by rejecting non-human speech. The FAR results of systems with an integrated countermeasure are presented in Table VII. Comparing against the FARs of the ASV systems without a countermeasure in Table V, we can make the following observations. First, the FARs of all ASV systems are reduced dramatically for both male and female speech, and go down from about 70%-100% to below 1% in the face of most types of spoofing attack. This indicates that the fused countermeasure can be effectively integrated with any ASV system without needing additional joint optimisation. Second, the integrated system is robust against attacks from various state-of-the-art statistical parametric speech synthesis and voice conversion systems. However, it is still vulnerable to the unit selection synthesis (SS-MARY) spoofing attack. This suggests that new countermeasures are needed specifically for waveform selectionbased spoofing attacks. Third, although our stand-alone ASV systems achieve better performance for male than for female speakers, the integrated systems work equally well for both. In contrast, others have reported integrated systems working better for male speakers than for female speakers [40].
In general, by using the proposed fused countermeasure, the FARs of ASV systems under spoofing attack are reduced significantly. This indicates that the countermeasure is effective in detecting spoofing attacks.

E. Human Versus Machine
To complement the comparisons already presented, we now benchmark automatic (machine-based) methods against speaker verification by human listeners. To do this, three listening tests were conducted: two speaker verification tasks and one spoofing detection task. The first verification task contained only human speech signals, the second verification task contained human speech but all test signals were artificial (synthetic and voice-converted speech). The third task, a detection task, contained both human and artificial speech signals and the goal for the listener was to correctly discriminate these signals. All three tasks covered the 46 target speakers in the evaluation set of the SAS corpus.
In order to encourage listeners to engage with the tasks to the best of their ability, they were presented as role-play scenarios. The human listening tasks were designed to be as similar to the ASV tasks as possible (to facilitate direct comparisons), whilst taking into account listener constraints such as fatigue or memory limitations. Listening protocols were inspired by the ones used in [75] and the experiments were carried out via a web browser. In total, 100 native English listeners took part in the experiments. They were seated in sound-isolated booths and listened to all samples using Beyerdynamic DT 770 PRO headphones. Each listener performed three tasks and, on average, it took about an hour to complete the experiment. We only report results from listeners who completed all sessions in each task.
Task 1: Speaker Verification of Human Speech: In the speaker verification task, listeners were asked to imagine they were responsible for giving people access to their bank accounts. They were informed that they would only have a short recording of a person's voice to base their judgement on. It was stressed that it was important to not give access to "impostors" but equally important that access was given to the "bank account holder".
The listeners were given five sentences from each target speaker to familiarise themselves with the voice. After listening to the training samples, they were given 21 trials to judge as "same" or "different." The trials were pairs of samples which include a reference and a test sample. This was repeated for three different target speakers. In this task, each target speaker was judged by 5 listeners. The number of targets versus non-targets varied per speaker to keep listeners from keeping count for individual speakers. On average there were 10 targets and 11 non-targets per speaker. Genders were not mixed within a trial.
Listeners recognised impostors as genuine targets 2.39% of the time (FAR) while 9.38% of genuine trials were misclassified as impostors (FRR). Comparing with the baseline ASV performance in Table V, the results demonstrate that the speaker verification performance of humans is not as good as that of the best automatic systems. For example, PLDA-5 gives a FAR around 1.5% for both male and female speakers. This finding is similar to that in [76] for the NIST SRE 2008 dataset.
Task 2: Speaker Verification of Artificial Speech: In the second task, listeners were asked to decide whether an artificial voice 17 sounded like the original speaker's voice. The listeners were informed that the artificial voice would sometimes sound quite degraded but were asked to ignore the degradations as much as possible. Additionally, they were told that there would be artificial voices that were supposed to sound like the intended speaker as well as artificial voices that were not supposed to match the original speaker. The task was framed as "Your challenge is to decide which of the artificial voices are based on the 'bank account holder's voice' and which are based on an 'impostor's voice.' " As in the first task, listeners were given five natural speech samples from the intended speaker to familiarise themselves with the voice. After listening to these training samples, subjects were presented with pairs of reference and test samples to judge as "same" or "different." It was made clear to the listeners that the test sample would be an artificial voice. Each target speaker was judged by 5 listeners. For each target speaker there were 65 trials (13 systems, each presented 5 times). On average there were 39 targets and 26 non-targets per speaker. Once again gender was not mixed within any of the trials.
The results are presented in Table VIII (second column). The acceptance rate is not directly comparable with the automatic results presented in Table V but the relative differences across spoofing algorithms are comparable 18 .
It can be observed that SS-MARY gives the highest acceptance rate (i.e., listeners said that it sounded like the original speaker), while VC-C1 gives the lowest acceptance rate -this pattern is similar to that in the ASV results where SS-MARY achieves relatively high FARs and VC-C1 relatively low FARs. The results also indicate that spoofing systems that use more training data generally achieve higher acceptance rates with human listeners, mirroring what we saw earlier in the ASV results in Section V-B. An interesting difference between the ASV and human listener results is that, for human listeners, the use of higher sampling rate speech by some spoofing systems (SS-SMALL-48, SS-LARGE-48) leads to a lower acceptance rate than for lower sampling rate training data (SS-SMALL-16, SS-LARGE-16). This suggests that, whilst these types of spoofing systems (SS: statistical parametric speech synthesis) are able to generate information above 8 kHz that contributes to improved naturalness [51], listeners judge it as being more dissimilar to the natural speaker. This similarity observation is different from that in [51], where speaker-dependent speech synthesis is examined. An informal listening test gives the impression that SS-LARGE-48/SS-SMALL-48 produces more natural speech than SS-LARGE-16/SS-SMALL-16, as expected. However, as the reference target speech is a clean recording without any distortion, we speculate that it is more challenging for listeners to decide on the speaker similarity of the poor quality, buzzy-sounding speech of SS-LARGE-16/SS-SMALL-16.
Task 3: Detection: In the final task, listeners were asked to judge whether a speech sample was a recording of a human voice, or a sample of an artificial voice. The challenge to the listeners was formulated as: "Imagine an impostor trying to gain access to a bank account by mimicking a person's voice using speech technology. You must not let this happen. Your challenge in this final section is to correctly tell whether or not the sample is of a human or of a machine." Listeners were again given some training speech signals. They listened to five samples of human speech from one speaker (not present in the detection task) and five examples of artificial speech generated using five known spoofing systems.

VI. DISCUSSION AND FUTURE WORK
In this section, we summarise the findings in this work, and also discuss some of its limitations. Both the findings and the limitations suggest areas needing further research.

A. Research Findings
The main findings from this study are:

B. Limitations and Future Directions
We suggest future work in ASV spoofing and countermeasures along the following lines: • More diverse spoofing materials: The current SAS database is biased towards the STRAIGHT vocoder, and only one type of unit selection system was used to generate the waveform concatenation materials. Moreover, replay attack -which does not require any speech processing knowledge on the part of the attacker -was not considered here. A generalised countermeasure should be robust against all spoofing algorithms and any vocoder. The development of generalised countermeasures might be accelerated by collecting more diverse spoofing materials. As the amount of spoofing materials increases, ASV systems can access more representative prior information about spoofing, and the security of ASV systems should be enhanced as a result. • Truly generalised countermeasures: The proposed countermeasures did not generalise well to unknown attacks, and in particular to the SS-MARY attack. This is because the proposed countermeasures were biased towards detecting phase artefacts. To detect the SS-MARY attack or similar waveform concatenation attacks, we suggest further development of pitch pattern-based countermeasures. Discontinuity detection for concatenative speech synthesis [77] might also be useful in inspiring novel countermeasures against such attacks. Lastly, novel system fusion methods might also be a way to implement generalised countermeasures. A good fusion method should be able to benefit from all the individual countermeasures. Our proposed fusion method failed to take advantage of the strengths of the pitch pattern countermeasure, for example.
• Noise or channel robustness: The work here delberately focussed on clean speech without significant noise or channel effects. To make the proposed countermeasures appropriate for practical applications, it would of course be important to take channel and noise issues into consideration. • Text-dependent ASV: The current work assumes textindependent speaker verification. To make systems suitable for other voice authentication applications, spoofing countermeasures for text-dependent ASV must also be developed.

VII. CONCLUSIONS
All existing literature that we are aware of in the areas of ASV spoofing and anti-spoofing, report results for just one or two spoofing algorithms, and generally assumes prior knowledge of the spoofing algorithm(s) in order to implement matching countermeasures. As discussed in [8], the lack of a large-scale, standardised dataset and protocol was a fundamental barrier to progress in this area. We hope that this situation is now rectified, by our release of the standard dataset SAS, combined with the benchmark results presented in this paper.
To acheive this, speech synthesis, voice conversion, and speaker verification researchers worked together to develop state-of-the-art systems from which to generate spoofing materials, and thus to develop countermeasures. The SAS corpus developed in this work is publicly released under a CC-BY license [78]. We hope that the availability of the SAS corpus will facilitate reproducible research and as a consequence drive forward the development of novel generalised countermeasures against speaker verification system spoofing attacks. In 2010, he started working for the Naval Surface Warfare Center Port Hueneme Division and leads a team of engineers in developing augmented reality, prognostics, secure wireless, and cybersecurity capability for the surface Navy. His interests include machine learning, natural language processing, and cybersecurity.