A Multilinear Tongue Model Derived from Speech Related MRI Data of the Human Vocal Tract

We present a multilinear statistical model of the human tongue that captures anatomical and tongue pose related shape variations separately. The model was derived from 3D magnetic resonance imaging data of 11 speakers sustaining speech related vocal tract configurations. The extraction was performed by using a minimally supervised method that uses as basis an image segmentation approach and a template fitting technique. Furthermore, it uses image denoising to deal with possibly corrupt data, palate surface information reconstruction to handle palatal tongue contacts, and a bootstrap strategy to refine the obtained shapes. Our experiments concluded that limiting the degrees of freedom for the anatomical and speech related variations to 5 and 4 respectively produces a model that can reliably register unknown data while avoiding overfitting effects.


Introduction
The tongue as one of the main human articulators plays an important role in speech production. In speech science, it is therefore of great interest to understand its shape and motion during human articulation. The results of such an analysis could be used to derive a tongue model that is able to replicate those shape changes by manipulating a few parameters. Ideally, these parameters should be separated into two sets: One set should control the appearance of the tongue that is related to the anatomy of the speaker. The other set adapts the shape to the sound that should be produced. Such a tongue model would also provide insight into how a change of anatomy affects the articulation for speech production.
Areas of application for a tongue model are, for example, virtual avatars for multimodal spoken interaction or computer-aided pronunciation training. In the latter case, the user can be provided with visual information on how to move the tongue to produce a specific sound (Engwall, 2008). Additionally, such a model might be employed in an articulatory speech synthesis framework to approximate the vocal tract area function. Finally, a tongue model can also simply be used as a prior to register new data that is possibly very sparse.
We notice that we need data about the tongue shape during speech production to derive such a model. However, most of the articulators are contained inside the human mouth and therefore partially or completely hidden from view. This means that traditional imaging modalities based on light, e.g. photography, are of limited use for acquiring the desired shape information. Currently, magnetic resonance imaging (MRI) can be regarded as the stateof-the-art technique for investigating the interior of the human vocal tract during speech. It is non-invasive and non-hazardous to the observed speaker and in contrast to ultrasound or electromagnetic articulography (EMA), it is able to provide dense volumetric measurements of the vocal tract. Moreover, there is work on adapting the MRI method to the needs of speech research: Here, advances have been made to improve the measurement time (Kim et al., 2009) and quality of the acquired scans (Lingala et al., 2016).
The measured MRI data only contains raw image data and has to be further processed to extract the desired shape information. In our case, a suitable shape representation is given by a polygon mesh. Such a representation offers the advantage that it can be directly used in different fields. For example, in computer graphics such meshes are used to generate animations of complex objects (Botsch et al., 2010, Chapter 9) or to model objects of highly complex geometry and topology. Furthermore, polygon models have been used in speech processing to generate acoustical simulations (Blandin et al., 2015). More importantly, they have been used to perform a statistical analysis of a class of shapes, like for example human bodies (Allen et al., 2003), faces (Blanz and Vetter, 1999), or tongues (Badin and Serrurier, 2006).
Ideally, the extraction process should be at least minimally supervised, as doing it manually takes a lot of time and might require the expertise of an anatomical expert.
Afterwards, the collection of estimated meshes can be analyzed to derive a statistical tongue model. Such a model offers the advantage that for each generated tongue shape its probability can be measured. This is helpful in situations where the tongue model is used as a statistical prior.
In literature, a lot of research has focused on analyzing the tongue shape during speech production. The works by Engwall (2000), Serrurier (2006), andFang et al. (2016) used 3D MRI data of a single speaker to analyze the speech related shape variations by using principal component analysis (PCA) or linear component analysis (LCA) in the case of Engwall. They annotated the contour of the tongue manually in the scan data and used a mesh as shape representation.
There also exists research that aims at analyzing the anatomical and speech related shape variations separately: Harshman et al. (1977) investigated these variations in 2D X-Ray data. We note that this image modality is nowadays no longer used for this purpose due to its known negative health effects. Analysis on 2D MRI was conducted by Hoole et al. (2000), Ananthakrishnan et al. (2010), Vargas et al. (2012b), and Vargas et al. (2012a). Finally, Zheng et al. (2003) performed this analysis on sparse sets of 65 points that were manually extracted from 3D MRI scans.
On the whole, we see that there are still some open issues: Previous work focused only on 2D data or sparse 3D data to analyze the anatomical and speech related variations. This sparse data representation might not be sufficient to capture the whole complex structure of the tongue. Initial work investigating these variations in 3D meshes obtained from MRI data of 9 speakers was presented with Hoole et al. (2003), but neither evaluated nor published (Hoole, personal communication). Moreover, work that focused on the speech related shape variations of a more dense 3D representation of the tongue required manual annotation of the used MRI data, which makes it infeasible for large collections of data. Here, work exists that aims at facilitating the tongue shape extraction from MRI data. However, such approaches are often limited because they are restricted to 2D (Peng et al., 2010;Eryildirim and Berger, 2011;Raeesy et al., 2013), produce only a lowlevel volume segmentation (Lee et al., 2013), or require an anatomical expert to provide the tongue templates (Harandi et al., 2014).
In this paper, we present an extended version of our previous work (Hewer et al., 2014). Its contributions can be summarized as follows: We propose a minimally supervised framework for extracting tongue meshes from 3D MRI data. It is minimally supervised in the sense that a user only has to annotate a few landmarks in the scan data, which significantly reduces the burden on the user compared to annotating the entire tongue surface. Originally, it combined an image segmentation technique and a template matching approach to achieve that goal. Here, we add an image denoising method to the framework in order to deal with possibly corrupt data. Moreover, we modify the template matching approach to better handle volumetric point clouds. Furthermore, we integrate a strategy for restoring missing tongue surface information that occurs due to contact between hard palate and tongue. This improvement increases the amount of tongue shape configurations we can register. Finally, the framework is augmented by making use of a bootstrapping strategy, which refines the quality of the obtained shape meshes.
All these modifications allowed us to register speech related tongue shapes of 11 speakers that we used to derive a multilinear statistical model that captures almost the entire complex 3D surface geometry of the tongue and allows the anatomy and pose related variations to be modified separately.
We examined the obtained model with respect to its compactness, generalization, and specificity properties. In the case of the specificity analysis, we investigated the surface parts of the tongue mesh that play an important role during human articulation. The results of our experiments motivated us to choose a model with 5 degrees of freedom for the anatomy and 4 for the speech related tongue pose.
The remainder of the paper is organized as follows: In the next section, we start by describing how surface information of the vocal tract can be extracted from a given 3D MRI scan by denoising it and applying an image segmentation approach. We proceed by discussing the modified template matching approach in section 3 and also present the used templates of our approach. The next section, section 4, is dedicated to describing how we estimate a tongue mesh from the surface information by using the template fitting. Here, we present the bootstrapping strategy used and our approach to restore missing tongue surface information that is caused by contact between tongue and hard palate. Next, we turn to the multilinear tongue model in section 5. In this section, we outline how the acquired mesh collection can be aligned to only contain speech and anatomy related tongue shape variations and how the model is derived. We then turn to the evaluation of our approach in section 6 where we apply it to MRI scans of two datasets. Furthermore, we conduct experiments to evaluate the compactness, generalization, and specificity properties of the acquired model. Finally, we conclude in section 7 and outline possible future work.

Extracting Surface Information From MRI
As a first step, we want to extract a point cloud Q := {(q i , n i )} from an MRI scan that contains the surface points q i and the associated normals n i of the major articulators and related tissue. We are using a pure geometric representation of this surface information because it is easy to combine two point clouds into a single one. This is helpful in situations where we want to restore missing information in a point cloud Q that is present in another cloud Q * . As we are using image processing methods, we briefly describe how we are treating a volumetric MRI scan as a 3D image. We may represent an MRI scan as a function where s min and s max are real values. Here, Ω ⊂ R 3 is a discrete rectangular domain consisting of the sample positions where the scanner took the measurements. These coordinates are arranged on a regular grid where we have the grid spacings h x , h y , and h z . We say that s(q) represents the measured hydrogen molecule density at sample position q ∈ Ω.
This scan can be interpreted as a gray-scale 3D image by applying a quantization operator to the hydrogen density values that scales them to a standard gray-scale. Here, we decided to use a standard visualization where bright values indicate a high density and dark values a low density. Figure 1 shows two typical visualizations of such a representation: A sagittal slice and a coronal one showing an (x, y)-plane and a (y, z)-plane of the scan image respectively. As in general the original scan data contains much more information than the vocal tract itself, we usually crop it to a selected region of interest. This reduces the memory requirements and the processing time of our framework. By inspecting the scan, we observe that  the data is degraded due to measurement noise. As a remedy, we apply a 3D variant of the edge-enhancing diffusion (Weickert, 1998) to the image. An example result of the approach can be inspected in Figure 1. We see that the noise was removed and structural information like edges were preserved and enhanced.
We now want to extract a point cloud Q of the wanted surface information from the denoised MRI scan. First, we detect the spatial support of the region whose surface information we want to derive. That is, we want to find a partition such that Ω O contains the region of the major articulators and related tissue and Ω B = Ω \ Ω O everything else. By inspecting the denoised data, we notice that tissue can be distinguished from non-tissue, such as air and bone for example, by using color information. This observation motivates the use of image segmentation methods that make use of such a feature. In our case, we decided to use the method by Otsu (1979) to perform this task as it is fully automatic. An example segmentation can be seen in Figure 2.
As we are interested in the shape information of the surface, we proceed by extracting the surface points of the tissue from the obtained partition. We call q i ∈ Ω O a surface point if at least one of its neighbors is part of Ω B . Additionally, we use the partition to estimate normal information at the extracted surface points.
The obtained surface points and associated normals are then assembled in a point cloud. An example of such a point cloud can be inspected in Figure 2.

Template Matching
Now, we want to estimate the surface of the desired articulator from such a point cloud Q. Here, we use a polygon mesh M := (V, F) as surface representation. The set V := {v i } with v i ∈ R 3 is called the vertex set of the mesh. The other set, F, is the face set of our mesh.
We observe that a point cloud Q is a loose collection of points. In general, this collection contains more information than the desired articulator and there might be holes in the cloud with missing data. However, a subset of Q implicitly represents the surface of the desired articulator.
In order to identify this subset and estimate the articulator shape from it, we can apply a template fitting technique.
Given a template mesh M = (V, F) that resembles the wanted articulator and a point cloud Q, it finds a set A : The template matching finds this set A of deformations by minimizing the energy The data term E data is minimized if applying A to the mesh moves it towards some points in the point cloud. The smoothness term E smooth penalizes deformations that alter the original shape of the template. Finally, the landmark term E landmark produces energy if correspondences between landmarks on the mesh and user-provided points are violated by the deformation. As Equation 4 is not differentiable, it is usually optimized by minimizing a series of energies E t Def (A t ) where t ∈ [1,t max ]. We note that each energy uses adapted weights β t and γ t : where β min and γ min are set by the user. Originally, we used a standard heuristic (Allen et al., 2003;Li et al., 2009) to distinguish valid data observations from invalid ones in the optimization of E data . In particular, we say that q is a valid data point candidate for a deformed vertex A i (v i ) if the Euclidean distance between both is not too large and if their normals do not differ too much from each other.
We have modified this nearest neighbor heuristic somewhat: We collect all valid data point candidates within a fixed radius and then select the best candidate that lies below the current mesh surface. If no such candidate exists below the surface, we will select the best one above it. This modification is intended to prevent the template mesh from getting stuck at unrelated points  in the volumetric cloud during the optimization. An example showing the benefits of this modified heuristic can be inspected in Figure 3. Here, we note that we are showing the projection of the matched template on the scan data for the sake of visibility and interpretability.
In our framework, we use two templates: One for the tongue and one for the hard palate. Both templates were extracted from MRI data by means of a medical imaging software. Afterwards, we made the templates symmetric to remove this particular bias towards the original speaker.
The palate template consists of 994 vertices and 1828 faces with an average edge length of 1.4 mm. The tongue template contains 3100 vertices and 6102 faces with an average edge length of 1.8 mm. We note that the tongue template is missing the sublingual part of the tongue that is negligible for speech production.
Both templates can be inspected in Figure 4 together with the landmarks used.

Tongue and Palate Shape Estimation
We first estimate the palate shape for each MRI scan. This shape information is needed in some cases to restore tongue surface information that is missing due to contacts between tongue and palate.
First, we select a scan for each speaker where the hard palate is clearly visible and perform template matching. We note that, in general, using a single template might produce suboptimal results in some matching cases. In order to improve the results, we set up an iterative bootstrapping approach. In each iteration, we first compute a PCA model of the palate (Hewer et al., 2015) by using the results of the previous iteration. This model is then fitted to each point cloud and the results are afterwards used as the initialization for the template matching.
After we acquired the hard palate mesh for each considered speaker, we want to align this mesh to each scan of the corresponding speaker. This procedure serves the purpose of restoring tongue surface information that is missing due to contacts between tongue and palate as shown in Figure 5.
Here, we have to address the issue that the corresponding speaker might have moved between the scans. Fortunately, as the hard palate can only undergo rigid body transformations, we only have to estimate this type of motion. However, as the palate surface information might be partly missing, we fall back to color information for this task. To this end, we define the color profile set E(M, f ) ⊂ R of a mesh M in a scan f . A profile e i (M, f ) ∈ E(M, f ) is a vector such that its entries are given by where v i is a mesh vertex, n i its corresponding normal, and d the chosen sampling distance. We see that we start above the palate surface in order to avoid taking samples in the possible contact area between tongue and palate. Then, we can estimate the rigid body motion A for aligning a palate mesh M obtained from a scan f to a scan g by maximizing the energy: where J(V ) is the index set of the vertex set V , NCC the normalized crosscorrelation between its operands, and A(M) the transformed mesh. We decided to use the NCC as similarity measure because it is known to be robust against noise and brightness differences. Furthermore, the NCC between color profiles was already successfully used in a nearest neighbor heuristic for template matching (Harandi et al., 2014). A result of this alignment approach can be seen in Figure 5. Figure 6: Effect of the bootstrapping strategy. Left: Initial template matching has trouble correctly registering the tongue shape. Right: Bootstrapping improves the result.
We now inject this aligned palate mesh information into the point cloud of the corresponding scan in order to restore missing tongue surface information by using the palate surface as a replacement. Additionally, we use the aligned mesh as a boundary to remove points in the point cloud above the palate that are unrelated to the tongue. Finally, we use a template matching to extract the tongue shape from the corresponding modified point cloud. As in the palate case, we use a bootstrapping strategy to refine the results. This time, we use a multilinear model in each iteration as a statistical prior that is described in the next section. Effects of this bootstrapping operation can be seen in Figure 6.

Multilinear Tongue Model
Having obtained a collection of tongue meshes, we then want to derive a function M : S × P → M where M is a set of meshes. The set S ⊆ R m consists of coordinates s that describe a speaker's anatomical shape of the tongue. The set P ⊆ R n contains coordinates p that determine the shape for a specific speech related tongue pose. Here, we call S the speaker subspace and P the pose subspace of the model. Meshes M ∈ M should have the same face set as our tongue template mesh. Their vertex sets V (s, p), however, may differ from the original template with respect to their vertex positions.

Preparing the Training Mesh Collection
Deriving the function in Equation 9 implies we want to analyze only the anatomical and speech related variations in our mesh collection, which means we have to remove all other variations present. The Procrustes alignment technique (Dryden and Mardia, 1998) is a method suitable for this task as it may be used to remove any translational and rotational differences among the meshes in the collection. However, applying this technique directly to the acquired tongue meshes might destroy critical information, e.g., related to the speech related tongue pose. This is, for example, due to the fact that the tongue also undergoes translational and rotational motions because it is connected to the lower jaw.
As a remedy, we apply the Procrustes alignment to the hard palate meshes we obtained earlier to remove translational and rotational differences between the speakers that are unrelated to the tongue motion. The results are afterwards used as a reference to align the tongue meshes. To this end, we use a speaker's palate mesh that was earlier aligned to the corresponding scan. Here, we then estimate the rigid transformation that maps this aligned palate mesh to its Procrustes variant and apply the same motion to the corresponding tongue mesh. By doing so, we remove any translational and rotational differences related to head motions or position differences without destroying any speech or anatomy specific information.
Finally, we have to ensure that for each speaker the meshes for all selected poses are available. Here, we reconstruct a missing pose shape of a speaker by averaging available data: First, we compute the average shape of all meshes that are present for the speaker. Afterwards, we compute the mean shape of all meshes that are available for this specific pose from the other speakers. Finally, both meshes are averaged again. We note that there exist more sophisticated methods to restore missing information, like for example HALRTC (Liu et al., 2013). In our case, however, this averaging approach was sufficient.

Deriving the Model
In order to derive our desired function in Equation 9, we need to analyze the anatomical and speech related variations separately. In several works (Harshman et al., 1977;Hoole et al., 2000Hoole et al., , 2003Ananthakrishnan et al., 2010;Vargas et al., 2012b,a;Zheng et al., 2003), the PARAFAC method (Harshman, 1970) was used to perform this analysis. This method, also known as CAN-DECOMP, decomposes a tensor into a sum of r rank-1 tensors where r is provided by the user. Therefore, this technique can be regarded as an extension of the singular value decomposition to tensors. However, literature reports issues with this method: Hoole et al. (2000) found that it might be difficult to find reliable solutions. Vargas et al. (2012a) pointed out that the PARAFAC decomposition requires a lot of components to describe the observed data in a satisfactory way, which limits its usefulness as a dimensionality reduction method. Moreover, De Silva and Lim (2008) discovered that the associated standard approximation problem is mathematically ill-posed, which can lead to the problem of diverging components in a numerical setting.
Another suitable method is the Tucker decomposition (Tucker, 1966) that is sometimes also called higher-order singular value decomposition (HOSVD). This method computes the orthonormal spaces of a tensor associated with its modes. It may be regarded as a more flexible variant of the PARAFAC method (Kiers and Krijnen, 1991) and has previously been used to analyze 2D tongue shape data (Vargas et al., 2012b).
Considering the issues of PARAFAC, we decided to use the Tucker decomposition to analyze our data. Here, we follow the approach of  who used it to analyze the variations of human faces in different expressions. To this end, we first turn our tongue meshes into feature vectors by serializing the vertex sets V into vectors f i . Then, we compute the mean µ, and center the vectors. Afterwards, we organize those centered vectors in a tensor A ∈ R m×n×k . Here, we refer to the first mode of the tensor as the speaker mode where m represents the number of speakers, to the second mode as pose mode with n being the amount of different tongue poses, and to the third mode as the vertex mode with k representing the dimension of the vectors f i .
The HOSVD makes use of the fact that A can be decomposed as follows: In our case, the row vectors of U 1 ∈ R m×m are coordinates in our speaker space S that determine the anatomical shape for each of the original speakers. A similar observation applies to U 2 ∈ R n×n where the row vectors are coordinates in the pose space P. The tensor C ∈ R m×n×k is the core tensor of the decomposition that acts as a link between S and P. The operation C × n U is called the n-th mode multiplication of the tensor C with the matrix U.
The core tensor is the multilinear model we can use to create our function in Equation 9: Essentially, given s ∈ S and p ∈ P, we can use C to generate serialized vertex sets that represent the generated shape as follows: By letting V (s, p) be the vertex set reconstructed from v(s, p), we finally can define our function as: where F is the face set of our original template. We remark that the dimensionality of the speaker and pose subspace can be truncated to remove shape variations that may be considered negligible or related to noise. This means that our subspaces have dimensionalities m ≤ m and n ≤ n.

Model Fitting
We can use this derived model to register data, for example a point cloud Q. This time, we want to optimize for the parameters s ∈ S and p ∈ P that best describe the speaker anatomy and tongue pose that is represented in the data.
To this end, we minimize the following energy: where the data and landmark terms are equivalent in their modeling idea to their counterparts in the template matching case. Furthermore, we use the same nearest neighbor heuristic and optimization approach as in the template matching. This time, the weights for both terms remain constant during the optimization of the energy series. However, we note that if the correct neighbors are known, they can be set directly and only one energy has to minimized in that case.
It is common to limit the admissible values for s and p to avoid highly unlikely shapes. In particular, we limit each entry of s and p individually to an interval where σ i is the standard deviation of the corresponding variation in the used mesh collection and m i the corresponding entry of the mean coordinate in the respective subspace. Finally, h ∈ R + is a scale factor. We note that the above energy can also be used to fit a PCA model: In this case, the energy depends only on one parameter.

Evaluation
Our next goal is to apply the described framework to MRI data and evaluate the quality of the obtained tongue model.

Used Data
We use two datasets to derive our model: The dataset of Adam Baker (Baker, 2011) and the full dataset of the Ultrax project (Ult, 2014), which provides us with data of 12 speakers in total.
The Ultrax project consists of static MRI scans of 11 adult speakers where 7 are female and 4 are male. All speakers are phonetically trained and were recorded while sustaining the vocal tract configuration for different phones. For each speaker, 13 speech related scans are available that correspond to the phone set [i, e, E, a, A, 2, O, o, u, 0, @, s, S].
The Baker dataset was recorded as part of the Ultrax project, but released separately. It contains 25 scans of one male speaker that are speech related and depict different articulatory configurations.
The data was recorded at the Clinical Research Imaging Centre in Edinburgh using a Siemens Verio 3T scanner where they were acquired with an echo time of 0.93 ms and a repetition time of 2.36 ms. The individual scans consist of 44 sagittal slices with a thickness of 1.2 mm and a slice size of 320 × 320 pixels. Here, we have as grid spacings h x = h y = 1.1875 mm and h z = 1.2 mm.
For our analysis, we decided to exclude one speaker of the Ultrax dataset that showed a high activity of the soft palate, which caused problems in our framework. Furthermore, we use the whole phone set that was recorded for the Ultrax data. However, we note that the Baker dataset is lacking scans for the phones [a, O, 0, @, s, S] where the shape information has to be reconstructed.
In total, we are using the shape information of 11 speakers with 13 different tongue shape configurations. This means that we arrive at a tensor A ∈ R 11×13×9300 where the dimension of the vertex mode is related to the vertex count of the tongue template we are using.

Applied Settings
For this data, the following settings were applied in our framework to extract the mesh collection: Figure 7: Speech related regions of the tongue surface: Tongue tip (red), tongue blade (brown), tongue back (violet), tongue dorsum (blue), and the lateral regions (green).
In the case of template matching, we used α = 1, β = 10, β min = 6, and γ = 10. Thus, we start with a high weight for the smoothness and landmark terms to drive the template to the correct neighborhood at the beginning of the optimization. The template matching for the tongue used γ min = 0 to damp the effects of falsely placed landmarks. We used γ min = 10 for the palate matching to ensure that its extremities were correctly aligned. For the model fitting that is applied during the bootstrapping, we used α = γ = 1. In the nearest neighbor heuristic, we set the search radius to 4 mm and limited the maximally allowed angle difference between the normals to 60 degrees. The optimization for the template matching used a series of 40 energies, the one for the model fitting applied a series of 10 energies to find the minimizer. For the palate alignment, we decided to use sufficiently long profiles with a length of = 15 and a sampling distance of d = 1 mm.
In the bootstrapping strategy, we applied iterations until a satisfactory visual result was obtained: We used 1 iteration for the hard palate and 5 iterations for the tongue. For the scale factor h in the model fitting, we used 0.5 for the tongue and 1 for the palate in order to prevent overfitting.
The landmarks needed for the hard palate and the tongue were placed on the MRI scans by a user that is not an anatomical expert.

Model Analysis
It is common to evaluate such statistical models by analyzing their compactness, generalization, and specificity (Styner et al., 2003) in order to find the optimal subspace dimensionality.
Compactness investigates how much the individual components of s and p contribute to the description of the used training data. In Figure 8, we see that using m = 5 is sufficient to represent 91 percent of data variability. Approximately the same holds for n = 4.
Generalization measures how well the model can represent data that was not part of the training. To evaluate the speaker generalization, we designed the following experiment: The data of each speaker was once excluded from the training set. The derived model was then used to register this excluded data where we measured the average Euclidean distance between the registered mesh and the original one. Additionally, we analyzed the fitting results for different values of m. The dimensionality of the pose subspace was fixed to n = 4 during these experiments to prevent overfitting caused by this subspace. In the analysis of the pose generalization, we used the same approach. In this case, the dimensionality of the speaker subspace was fixed to m = 5. The results of these experiments are depicted in Figure 8. During this evaluation, we used the scale factor h = 2 in the model fitting optimization. The specificity tries to assess how much the generated tongue shapes of the model differ from the original training data. In particular, we wanted to investigate how large these differences were for the regions of the tongue mesh that are speech related. Figure 7 shows an overview of those regions. To this end, we designed a few experiments where samples from the two subspaces were drawn randomly in order to generate a tongue shape. The first experiment investigated the specificity of the speaker subspace. Here, the pose subspace is again fixed to n = 4 and the speaker subspace size was varied. For each value of m, we generated random tongue shapes and evaluated the average Euclidean distance between the created mesh and the closest one in the mesh collection. In this comparison and distance evaluation, a region consisting of all speech related parts was considered. The same experiment was conducted for analyzing the specificity of the pose subspace where the speaker subspace size was set to m = 5. The results of both experiments can be inspected in Figure 8.
Finally, we wanted to find out how much the tongue shapes belonging to specific phones differ from the corresponding ones generated by the model. Here, we performed for each phone the following experiment: We froze the coordinates in the pose subspace to the ones belonging to the given phone. Moreover, we only allowed the generated meshes to be compared to meshes belonging to that phone. Then, for each dimensionality of the speaker subspace, we generated samples and computed the average Euclidean distance to the closest mesh. This time, we used in the distance evaluation and comparison parts of the tongue that are considered critical for this specific phone, cf. Jackson and Singampalli (2009). For the vowels [i, e, E, a, A, 2, O, o, u, 0, @], we selected a region consisting of the tongue blade, tongue back, and the tongue dorsum. The area for the sibilants [s, S] contains the tongue tip and the tongue blade. The results of these experiments are shown in Figure 9.
In all specificity experiments, we generated 10 6 samples.
The performed experiments provide an interesting insight into the model properties. The results of the generalization experiments show that only a few components of p and s are needed to reliably register unseen data. In particular, for p, 3 components are enough to reach an average error that is slightly above the measurement precision of the MRI scan data. For s, 7 components are needed to reach this kind of precision. Furthermore, we observe that a high number of components leads to errors below the measurement precision of the scan data, which can be considered as overfitting.
Here, we observe that the pose subspace has better generalization abilities than the speaker subspace. We suspect this might be related to redundancies in our training data: For example, the phones  to each other with respect to shape (Ladefoged, 1982). This means that excluding one still provides the model with enough information to capture the related variation. Moreover, we notice that the phone [0] shows a significantly bad result in the fixed phone specificity evaluation, which might be related to its unusual role in the phonology of British English. We suspect that some speakers might have pronounced it inconsistently and applied different strategies, which led to a high variation in the data that is then integrated into the model. From this observation, we draw the conclusion that the multilinear model might be used to detect such inconsistencies by performing the fixed phone experiments.
Overall, we decided that setting m = 5 and n = 4 provides a good compromise between specificity, generalization, and compactness. We note that this choice also limits the effects of overfitting.

Conclusion
In this work, we presented a multilinear tongue model that was derived from volumetric MRI scans in a minimally supervised way. In particular, we saw during the experiments that a model with a low dimensionality can reliably register unknown data with an acceptable precision.
In the future, we plan to investigate whether more shape variations can be obtained using more data. To this end, we want to use additional datasets in our framework. This implies that we also have to extract the shapes of phones like [g, k] that are known for having a contact with the soft palate. Here, we have to address the issue of recovering the surface of the soft palate in the corresponding scans that can also deform in a non-rigid way. Additionally, the datasets we use might differ with respect to the recorded phones, which leads to missing data in our training set. In this case, the simple averaging method for reconstructing missing shapes is no longer sufficient. Furthermore, using more data also increases the risk of encountering falsely labeled or corrupt scans.
Our hypothesis that the multilinear model could be used to find inconsistencies in phone production could be tested in the future by choosing the phones in the training data as follows: One set should consist of phones that show little variance among speakers. The other set should contain phones that show a high variance among speakers because different strategies are available to produce them, for example [l, T] (Keating, 2014). If the hypothesis were true, the fixed phone specificity experiments would show good results for the first set and bad results for the second one. In this case, this experiment could also be used as a heuristic to cluster speakers according to the articulation strategy they use. However, we note that the described experiment would require a dataset with recordings of the phones needed.
Moreover, we want to explore whether the derived model can be used to extract realistic 3D tongue motions from real-time 2D MRI data that was recorded in the mid-sagittal plane. We think that the acquired results can help to understand how the typical transitions between phones appear in the pose subspace and how they are affected by the anatomy of the speaker. Ultimately, this could lead to another multilinear model that could be used to generate these transitions.