The nonverbal structure of patient case discussions in multidisciplinary medical team meetings

Meeting analysis has a long theoretical tradition in social psychology, with established practical ramifications in computer science, especially in computer supported cooperative work. More recently, a good deal of research has focused on the issues of indexing and browsing multimedia records of meetings. Most research in this area, however, is still based on data collected in laboratories, under somewhat artificial conditions. This article presents an analysis of the discourse structure and spontaneous interactions at real-life multidisciplinary medical team meetings held as part of the work routine in a major hospital. It is hypothesized that the conversational structure of these meetings, as indicated by sequencing and duration of vocalizations, enables segmentation into individual patient case discussions. The task of segmenting audio-visual records of multidisciplinary medical team meetings is described as a topic segmentation task, and a method for automatic segmentation is proposed. An empirical evaluation based on hand labelled data is presented, which determines the optimal length of vocalization sequences for segmentation, and establishes the competitiveness of the method with approaches based on more complex knowledge sources. The effectiveness of Bayesian classification as a segmentation method, and its applicability to meeting segmentation in other domains are discussed.


INTRODUCTION
Group dialogue at meetings has been a topic of systematic study from quantitative and qualitative perspectives since at least the 50s, with the works of Bales [1950] and others.This line of work has investigated issues such as group performance [McGrath 1991], group cohesiveness, and the process of verbal and nonverbal activities [Hackman and Morris 1975;Dabbs and Ruback 1987].Advances in computer technology have stimulated research on similar topics in the computer science disciplines of humancomputer interaction (HCI) and computer supported cooperative work (CSCW).This (lexical cohesion statistics as well as dialogue structure, vocalization, and silence statistics) [Galley et al. 2003], prosodic features [Shriberg et al. 2000], video features [Dielmann and Renals 2007;Hsueh and Moore 2007], and other contextual features such as dialogue type and speaker role [Hsueh and Moore 2007].Only a few of these information sources can be reliably extracted from recordings obtained at a real MDTM, where the fast pace of the dialogue, the large number of participants, the diverse composition of the medical teams, and other factors make clean recording of individual speakers a practical challenge.Very high word error rates for automatic speech recognition, for instance, would preclude the use of dialogue acts, lexical features, and lexical cohesion statistics for our MDTM data (even though some topic segmentation systems have been shown to be resilient to moderate word error rates [Garofolo et al. 1999;Hsueh and Moore 2007] in other domains).This is, partly, our motivation for investigating the use of content-free features of talk for segmentation. 1owever, the investigation is also motivated by theoretical interests.Extending early work on Markovian models for dyadic interactions and monologues [Jaffe and Feldstein 1970], Dabbs and Ruback [1987] have argued that such content-free features as patterns of turn-taking (vocalization) and silence can tell an analyst much about the nature and structure of a meeting.In the case of MDTMs (and patient case discussions, in particular), despite being fast-paced and apparently chaotic to an outside observer, the conversations are highly structured events where the participants have very well defined roles, according with their medical specialities, which determine to a great extent their patterns of participation in the meeting.
As a matter of practical relevance to both indexing of meeting content for browsing and information retrieval purposes and the theoretical analysis of meeting process, this article investigates whether segmentation of MDTMs into their constituent PCDs can be reliably performed based on speaker roles and patterns of vocalization and silence.These features form part of what is arguably the simplest account of the sequential structure of dialogue [Dabbs and Ruback 1987] and therefore seem like a promising starting point, from which analyses incorporating richer elements (such as transcription, semantic interpretation, and visual modalities) can be further developed.
The article is structured as follows.The next section outlines the background of this study, describing the PCD and the MDTM in greater detail, presenting basic descriptive statistics of participant roles and interaction, and motivating the research from a practical point of view.Section 3 presents a review of related work on topic segmentation, highlighting the similarities and differences between MDTM segmentation and meeting topic segmentation in general, and tracing back the origins of many dialogue segmentation approaches to early work on text segmentation.This is followed by the presentation of our approach to data representation, the data preparation and annotation procedures adopted for this study.The main segmentation technique is then introduced, followed by the results of a cross validation experiment performed in order to assess the effectiveness of combining Naïve Bayes classification and the proposed content-free representation in detecting PCD boundaries.The analysis of results is complemented by a baseline analysis, a study of the effect of diarization errors on segmentation accuracy and an analysis of the effect of redundancy on the Naïve Bayes classifier.Comparisons with hidden Markov models, decision trees, kernel methods, and nearest neighbor classifiers are presented, along with a discussion of evaluation issues, and other state of the art approaches to topic segmentation.Conclusions, and plans for future work close the article.

BACKGROUND
The meetings analyzed in this article take place in a hospital setting and are attended by a varying number of participants who constitute a multidisciplinary medical team.Multidisciplinary medical team meetings are an established part of the process of diagnosis and treatment of cancer patients, and are a practice recommended by several national health services [Calman-Hine 1995].In an MDTM, health professionals of different specialities meet to discuss diagnosis, treatment options, and patient management.Additionally, these meetings serve educational purposes (training of students and junior doctors) and broader healthcare management and organizational functions [Kane 2008].
The MDTM is structured as a sequence of PCDs, where the patient's medical record is reviewed, evidence from pathology and radiology is presented, the possibility of surgery is discussed, and a patient management plan is agreed upon.In addition to the medical team, the meetings are generally attended by medical students and junior staff, who do not play an active role in the discussions.The presentations and discussions make intensive use of visual aids (e.g.display of pathology slides on a large screen, radiology images on high-resolution displays, etc.), and are often also attended by remote participants connected through teleconferencing.Support for collaboration at MDTMs is a topic that has attracted the interest of the CSCW community lately, and detailed analyses of organizational processes surrounding the meeting, its different functions in the hospital environment, and mechanisms that add dependability to their decision-making have been conducted [Robertson et al. 2010;Groth et al. 2009;Kane and Luz 2006].
Figure 1 shows the physical environment in which the MDTMs recorded for the corpus used in this study take place.It is a dedicated teleconferencing room equipped with projection equipment, a high resolution screen for radiological images, a large plasma screen, as well as microscopes and document readers that can be connected to the large display.The recordings were taken from two separate sources: (a) the existing teleconferencing equipment in the meeting room, which recorded the audio through a pressure-zone microphone and alternating recordings of the video channel between a The Nonverbal Structure of Discussions in Multidisciplinary Medical Team Meetings 17:5 view of the participants and views of the different medical images under discussion, and (b) a high-end camcorder mounted on a tripod, which recorded the audio through a sensitive directional microphone.These two sources were aligned (synchronized) using a multimedia annotation tool.While small and lacking in annotation detail if compared to the major meeting corpora previously mentioned the MDTM corpus is unique in that it was collected in a real-world environment, with the meeting participants engaged in a complex professional task [Kane 2008].
Over 28 hours of meeting data were collected, in total.For the study reported in this article, a dataset of 54 PCDs (approximately 220 minutes) was segmented and annotated using the ELAN Linguistic Annotator [MPI 2005].The original purpose of data collection was to investigate the diagnosis and decision-making processes of multidisciplinary medical teams [Kane andLuz 2006, 2009] within an interaction analysis framework [Jordan and Henderson 1995].This initial work revealed, among other things, that although the medical team works under severe time constraints and consequently the PCDs need to be well structured, the group managed to balance task and socio-emotional exchanges, which as McGrath [1991] suggests, is a means of avoiding tension and negative reactions in group collaboration.
The distribution of talk at MDTMs is very skewed, as evidenced in Figure 2(a), which shows that the vast majority of vocalizations are of short duration.This is in agreement with the general pattern in multiparty dialogues [Dabbs and Ruback 1987].MDTM vocalizations tend to last longer than those in scenario based meetings such as the ones recorded for the AMI corpus [Carletta 2007].The mean duration of an MDTM vocalization is 8.2s (median 3.5s) while the mean duration of an AMI meeting vocalization is 3.9s (median 1.4s) [Luz and Su 2010].The average number of active participants in an MDTM is also greater, as is the number of distinct roles they perform in meetings.We identified 10 distinct participant roles, whose proportional contributions (in terms of amount of talk) to the meetings is summarized in Figure 2(b).This again is in contrast to the AMI data where only four roles are defined: user interface designer, industrial designer, project manager, and marketing expert.Unlike the speakers in the broadcast news data analyzed by Vinciarelli [2007], who can play different roles at different times during the broadcast, each speaker in the MDTM corpus has a unique medical specialist role throughout the PCDs.The mapping from speaker identities to roles is not one-to-one, however, as more than one speaker can perform the same role in the same PCD.A total of 21 different speakers actively participated in the MDTMs.The remaining time is distributed between pauses (silences, 3.4%) and group vocalization (overlapping speech, 1.2%).Averaged with respect to PCDs (Table I) talk duration is consistent with the general MDTM figures.Table I also shows the average number of vocalizations per participant during a PCD, the average number of speakers participating in each case discussion, and the average duration of intervals of silence.The last row contains the mean values for a metric we call participation ratio [Luz and Kane 2009].The participation ratio of a meeting attendee is defined as the ratio of the number of PCDs in which the attendee took active part to the total number of cases discussed.The figures for mean participation ratio (over n speakers) in Table I were calculated as r = n i |C i |/(n|C|), where C i represents the set of PCDs in which speaker s i produced at least one vocalization and C is the entire set of PCDs.
Participation ratio figures summarize variability in the composition of the groups across case discussions.Table I indicates a high degree of variation, showing that a speaker will on average take part in only around 39% of all PCDs.A different measure of variability is given by the Entropy (H) for the probability distribution of vocalizations by n speakers, calculated by averaging over the probabilities p i that speaker s i is speaking at a given time during PCD, in the usual way: H = n i p i log 1/ p i .The H score for speaker transitions (H = 0.78, sd = 0.35) reveals a predictable pattern of speaker transitions while the entropy of the vocalization length indicates a process that is less predictable (H = 2.23, sd = 0.46), though the amount of uncertainty in the distribution of vocalization length is still quite small considering that the we have on average more than eight participants in a PCD.

The Practical Relevance of MDTM Segmentation
It has been acknowledged in group research that descriptive statistics alone such as the ones shown in Table I do not suffice to characterize a meeting.The interaction process that links group task inputs to task outputs also needs to be considered [Hackman and Morris 1975;Dabbs and Ruback 1987].The Interaction Process Analysis proposed by Bales [1950] and related systems provide an account of process based on careful coding of ongoing group interactions.Dabbs and Ruback [1987] argue that, although useful for analysis, the method of coding the content of speech interaction with reference to a system of categories tends to miss important information found in the more general paralinguistic features of meetings.From a CSCW perspective, a system such as the one outlined in this article, capable of automating the collection of content-free (paralinguistic) features and segmenting the recorded data into meaningful subunits would provide a tool for analyzing meetings with respect to their effectiveness, the impact of new meeting-support technologies on the interaction, and so on.
In terms of meeting indexing, searching, and browsing, structuring a multimedia meeting record by topics of discussion could help users access audio content even in the absence of speech transcription, by providing reference points on the time line [Bouamrane and Luz 2007;Moran et al. 1997].While a variety of techniques have been employed to add structure to different kinds of audio recordings, including role identification on radio broadcasts [Barzilay et al. 2000], summarization of broadcast news based solely on prosodic features [Maskey and Hirschberg 2006], and rich transcription of meetings [Fiscus et al. 2008], the approach presented in this article aims to identify reference points on meeting recordings based exclusively on the amount and structure of talk.While it is clear that a complete meeting storage and retrieval system will also require modality translation algorithms such as speech to text conversion and video analysis, as currently pursued by many researchers, we believe that a focus on content-free features will contribute in its own right to a better understanding of the organization of meeting data.

BRIEF REVIEW OF DISCOURSE SEGMENTATION AND RELATED WORK
For the purposes of meeting segmentation PCDs can be regarded as sequences of vocalizations grouped under a common topic (the discussion of a particular patient's case).In this sense, the task of segmenting MDTMs into PCDs is similar to the topic segmentation task as defined by Galley et al. [2003] and tackled in recent work on meeting analysis [Dielmann and Renals 2007;Hsueh 2008;Hsueh and Moore 2007;Banerjee and Rudnicky 2007].
Meeting segmentation has been influenced by early work on text segmentation [Hearst 1997;Beeferman et al. 1999], with which it shares evaluation metrics and methods.Approaches to broadcast news segmentation [Rosenberg et al. 2007;Shriberg et al. 2000] and lecture segmentation [Malioutov et al. 2007] have also influenced meeting-segmentation research.However, it is generally acknowledged that segmentation of spontaneous speech produced by interacting speakers in a group is more challenging than text segmentation, where the topic structured is in most cases carefully designed by the writer [Gruenstein et al. 2005], and segmentation of broadcast news audio, and other nonconversational speech, where the production environment and other contextual factors might provide acoustic clues as to where segment boundaries lie [Rosenberg et al. 2007;Gruenstein et al. 2005].
Related techniques are the identification and clustering of individual group actions [Zhang et al. 2006;McCowan et al. 2005McCowan et al. , 2003] ] and the labelling of topics [Blei and Moreno 2001].While the present work is not concerned with these tasks, we acknowledge that they could play an important role in the automatic structuring of spontaneous speech.We investigated labelling issues (PCD content categorization) elsewhere [Luz and Kane 2009].Here, the meeting is simply treated as a sequence of vocalizations and pauses, and an attempt is made to mark out those vocalizations that signal the beginning of a PCD.
Boundary vocalizations are similarly distributed for PCDs in our MDTM corpus, where only about 3.6% of all vocalizations indicate the start of a PCD, and in the AMI corpus, where about 3.3% of talk spurts indicate a topic change [Hsueh and Moore 2007].However, despite these similarities MDTM segmentation differs from meeting topic segmentation in that the latter seeks to identify segments that are different as they appear in the vocalization sequence, whereas the former aims to segment the stream into essentially similar subsequences.Topics in the AMI corpus, for instance, can be categorized as top-level and functional topics2 [Hsueh and Moore 2007], denoting segments that could also be described as meeting states [Banerjee and Rudnicky 2004], such as presentation, discussion, opening, closing, agenda and so on, which can then be subdivided into subtopics, forming a shallow hierarchy which is usually flattened for the purposes of segmentation.Similarly, Gruenstein et al. [2005] annotated the ICSI corpus [Janin et al. 2003] hierarchically according to topic, identifying, in addition, action items and decision points.For some applications, however, annotation focuses simply on topic changes that produce high interannotator agreement scores, with no further specification of topic label or discourse structure [Galley et al. 2003].
In MDTM segmentation, due to the self-contained nature of PCDs, annotators have little difficulty in identifying case discussion boundaries.The consistency of the manual segmentation of the MDTM corpus was ensured by the close collaboration between the researcher who gathered the data and members of the medical team who reviewed the annotation.It should also be remarked that MDTM segmentation can also be hierarchical, since PCDs exhibit an identifiable set of internal discussion states, including presentation of symptoms and clinical findings, questioning and correlation of pathology, radiology, and examination data, disease stage classification, discussion of patient management options, and articulation of the decision [Kane and Luz 2009].However, this level of topic structure has not been fully annotated in the MDTM corpus and is therefore not addressed in this article.
Different strategies have been employed for conversational topic segmentation.As mentioned in the preceding, Galley et al. [2003] model meeting topic segmentation after a text segmentation approach (namely, TextTiling [Hearst 1997]), relying on transcribed speech to compute lexical cohesion probabilities for adjacent analysis windows.Renals and Ellis [2003], on the other hand, consider nonlexical methods for segmentation, which bear some similarity with our approach in that their data representation is based on patterns of talk spurts encoded as transition matrices.However, their segmentation algorithm, which is analogous to acoustic speaker segmentation using the Bayesian Information Criterion, does not produce satisfactory results, leading the authors to speculate that "turn pattern boundaries are not directly related to discussion topics" [Renals and Ellis 2003].The results presented in this article seem to contradict that conjecture.
More recent approaches to meeting segmentation have tended to work with richer data representation schemes.Banerjee and Rudnicky [2004] define their model's data instances as short time windows over meeting segments whose features are described by low-level conversational statistics (number of speakers, number of speaker changes and speech overlap).They train a decision tree classifier to distinguish between windows that contain topic changes, obtaining an 18% accuracy gain over a baseline (random) classifier.In more recent work, implicit supervision in the form of participant notes has been employed in order to segment meetings into speech intervals that correspond to agenda items [Banerjee and Rudnicky 2007].Dielmann and Renals [2007] segmented meetings from the M4 corpus [McCowan et al. 2003] into a predefined set of five basic group meeting actions.They used dynamic Bayesian networks to integrate different feature streams (prosody, turn-taking, lexical, and video) into a two-level model comprising individual and group actions.Hsueh et al. [2006Hsueh et al. [ , 2007] ] used talk spurts as data instances, assessing the effectiveness of different combinations of features for topic boundary classification, including the previously mentioned features as well as prosody and motion data extracted from the video source.They tested feature integration using a C4.5 (decision tree) classifier [Hsueh et al. 2006] and maximum entropy models [Hsueh and Moore 2007].Although most approaches employ supervised learning, unsupervised learning has also been attempted [Hsueh 2008] using features derived from phonotactic models [Schwarz et al. 2004] or regularities in acoustic patterns [Malioutov et al. 2007] with some degree of success.
The method presented in the following employs vocalization matrices as a data representation mechanism for summarizing conversational history.An attractive aspect of this approach is that it does not rely on transcribed speech, being therefore unaffected by speech recognition errors.A Naïve Bayes classifier is employed on a combination of continuous and discrete variables [John and Langley 1995], yielding promising segmentation results.The method is described in detail and evaluated in the following section.

MDTM SEGMENTATION
There is evidence to suggest that paralinguistic, nonlexical features of speech can be indicative of discourse structure [Grosz and Hirschberg 1992].Prosodic features, for instance, have been employed as the exclusive means of segmenting speech data from the Switchboard and Broadcast News corpora into sentences and topics [Shriberg et al. 2000].There is also evidence that the durations of pauses and speech overlaps have predictive value in terms of topic segmentation.Oliveira [2002]  ).In terms of the roles performed by the various medical specialists, differences have also been observed.Although medical consultants and pathologists tend to speak at the beginning of PCDs more often than their colleagues (over 44% of boundary vocalizations altogether), their boundary vocalizations are about 4s shorter than their other vocalizations ( p < .01).On the other hand, medical registrars who also often open PCD with presentations of symptoms and findings spend about 8.3s longer in their PCD boundary vocalizations than in their other interventions CI = [3.7,14.2], t[14] = 3.65, p < .01), in agreement with informal observations reported by Kane and Luz [2009].These differences suggest that content-free features, combined with participant role information can indeed inform segmentation.
Content-free analysis summarizes dialogues as vocalization matrices, which basically encode the amount of speech produced by a speaker in a continuous talk spurt, the duration of speech pauses, and the probabilities that a particular speaker's vocalization will be followed by another speaker's vocalization.In general, a conversation is modelled as a Markov process with respect to such transition probabilities [Jaffe and Feldstein 1970].This assumption has been shown to be effective for classification of presegmented PCDs according to the nature of the discussion (medical, surgical, referral, etc.) in Luz and Kane [2009], where a graph-based representation of the PCD is adopted.The approach adopted here relaxes this assumption by allowing a number of preceding vocalizations to be encoded as part of the feature set.
The dataset consists of an interval of silences and vocalizations to be classified as either boundary or nonboundary instances.A boundary instance indicates the beginning of a PCD.The features used to describe an instance are encoded as a vector s encompassing duration of silences or vocalizations and the roles of the speakers who uttered the vocalizations, as shown in Equation (1). (1) V i is a nominal variable denoting the speaker role (or a pause type or group speech, in the cases of silences and vocalizations by more than one speaker, respectively).The speaker roles that can instantiate V 0 , . . ., V n range over the values shown in Figure 2(b).Although these roles are specific to MDTMs, other meetings exhibit distinct speaker roles that influence conversational structure [Laskowski et al. 2003].Recent results suggest that more general roles, such as defined in the AMI corpus, for instance, can be employed for topic segmentation in a manner similar to that described in this article [Luz and Su 2010].Since medical roles denote specializations it can be assumed that within a stable group like the multidisciplinary team, instantiation of the role features can be inferred from speaker identities.Where more than one specialist performed the same role in a PCD (e.g. more than one radiologist took part in the discussion) they were represented as a single role feature as a smoothing technique.
L i is a continuous variable for the duration of the speech (or silence) interval, and the pairs V −i , L −i and V i , L i refer to the i th roles and durations of vocalization intervals preceding and following the vocalization described by the instance, respectively.

Data Preparation
As mentioned in the preceding, a data set consisting of 54 PCDs has been segmented and manually annotated for speaker identities and roles.In addition to PCDs, the segmentation involved marking the set of dialogue states specified in Definition 4.1.
Definition 4.1.The following types of dialogue states are distinguished: . (Individual) Vocalization: the length of time that a speaker "has the floor".A speaker takes the floor when they begin speaking to the exclusion of everyone else and speak uninterruptedly without pause for at least 1 second.The vocalization ends when a silence, another individual vocalization or a group vocalization begins.Talk spurts shorter than 1 second (e.g.back channels) are not annotated and are simply incorporated into the main speaker's vocalization. .Group vocalization occurs when an individual has fallen silent and two or more individuals are speaking together.The group vocalization ends when any individual is again speaking alone, or a period of silence begins.Individual speaker identities are lost when a group vocalization state is entered. .Silence represents quiet periods of more than 0.9 seconds between vocalizations (including group vocalizations).A Silence ends when an individual or group vocalization begins.A Silence can be further classified as: -a pause.a silence between two vocalizations by the same participant; -a switching pause.a silence between two vocalizations by different participants; -a group pause.a silence between two group vocalizations; or -a group switching pause.a silence between a group vocalization and an individual vocalization.
Annotation followed the methodology described in the psychology and computersupported cooperative work literature [Dabbs and Ruback 1987;Sellen 1995] and therefore focused mainly on amount and structure of speech activity.The metadata created for this set of 54 PCDs are in fact much more detailed, containing information about artifacts employed during the meeting, use of informal language, roles, and so on.For the purposes of this article, however, only speech activity and speaker roles are considered.The dialogue states specified in Definition 4.1 are similar to the ones used by Sellen [1995], with an adjustment to the minimal duration of a vocalization.Our definition of silence is similar to the concept of switching pauses [Dabbs and Ruback 1987].A richer vocalization event taxonomy could be created through audio sampling at shorter intervals [Brady 1968], but we decided to keep our definition consistent with existing work on group interaction.The threshold of 0.9s in the definition of pause was determined empirically.Sellen [1995], for instance, uses a threshold of 1.5s.However, her data are recorded in 2-participant remote communication scenarios in which pauses tend to be longer due to technology mediation.One could also define simplified notions of turns as sequences of vocalizations and pauses, and analogously group turns as sequences of group vocalizations and group pauses.However, we chose to avoid the word "turn" altogether, as it is used in conversation analysis [Sacks et al. 1974] in a different and more complex sense.
In keeping the basic units of analysis simple, we expect to be able to automate their extraction from recorded audio through existing signal processing techniques [Fiscus et al. 2008].It should be noted, however, that the processing steps necessary to turn the audio signal into sequences of dialogue states labelled by speaker (or silence) are not straightforward.This process is called speaker (or audio) diarization and is usually performed through change detection with the Bayesian information criterion [Chen and Gopalakrishnan 1998] followed by clustering of audio feature vectors using, for instance, Gaussian mixture models as emission probabilities for continuous density hidden Markov models [Ajmera and Wooters 2003].Although progress has been made in this area [Fiscus et al. 2008;Tranter and Reynolds 2006], diarization can be quite error prone especially when the input consists of a single audio stream containing all speaker sources.The method described in the proceding can therefore be regarded as operating under idealized conditions, in this respect.This simplification seems warranted as a strategy for testing PCD segmentation as an individual module (unit testing).It is also compatible with the approaches to meeting topic segmentation reviewed in Section 3, which for the most part are trained on force-aligned transcripts and speaker identification through individual audio sources (as is the case of the AMI corpus and the ICSI meeting corpus) and therefore approach the standard used in the tests reported in this article.Nevertheless, it would be interesting to test the method on data containing simulated levels of diarization error in order to assess its performance under more realistic recording conditions.Asking MDTM participants to wear wireless microphones may also offer a solution that could be tested with the user group in real work contexts.

Boundary Detection
The annotation streams were converted from the ELAN annotation format into an R data frame representing a collection of instances of the form specified in Equation (1).Alternative data sets were generated by varying the size of the window over previous and next dialogue state (a horizon of size n role-length pairs on each side of the target dialogue state) and by distinguishing, or not, between different pause types (see Definition 4.1), in order to assess the effect that these contextual parameters might have on segmentation accuracy.
The segmentation method consisted in training a Naïve Bayes classifier to identify instances marked as boundary dialogue states (i.e.vocalization that starts a new PCD).The conditional probabilities for the nominal variables (speaker roles) are estimated on the training set by maximum likelihood and combined into multinomial models [McCallum and Nigam 1998], while the continuous variables are modelled through Gaussian kernels [John and Langley 1995], as shown in Equation ( 2 (2) For the full model, the probabilities to be estimated are simplified through Bayes' rule and the conditional independence assumption to: where S denotes a random variable ranging over the vector representation of vocalization events, as defined in Equation ( 1).
Since the feature sets used in our experiments contained relatively few features, no further preprocessing or feature selection steps were taken during training or classification.The number of PCDs in the test MDTM segments was assumed to be unknown.A maximum a posteriori (MAP) rule [Yang 2001] was adopted for PCD boundary assignment.Other strategies such as SCut and proportional thresholding could also be explored [Luz and Su 2010].Section 5 discusses thresholding strategies further.

Evaluation
Although IR metrics such as precision, recall, F scores, and accuracy, have been used to evaluate applications that combine topic segmentation and detection [Banerjee and Rudnicky 2004], the usual way to evaluate meeting segmentation is to employ metrics originally developed for text segmentation.For a segmentation task defined in terms of classification, as in this article, accuracy scores are misleadingly high due to the fact that the data set is highly imbalanced.Since only about 3% of instances are positive, a trivial classifier assigning non-boundary labels to all instances would predict accurately about 97% of the time.Precision, recall, and F scores are also difficult to interpret, even if restricted to the positive class, since they penalize near misses (hypothesised boundaries that fall near true boundaries) and predictions that are wide off the mark, equally.Therefore two slightly different error metrics are employed, which originated in text segmentation research but are now widely used in speech topic segmentation: P k [Beeferman et al. 1999] and WindowDiff (WD) [Pevzner and Hearst 2002].
The P k metric gives the probability that two vocalizations occurring k vocalizations apart and otherwise picked randomly from the dataset are incorrectly identified by the algorithm as belonging to the same or to different PCDs.This is formally stated in Equation ( 5), where r and h denote the reference and hypothesis segmentation, respectively.D k stands for a distribution with probability fixed at a distance k (chosen to be half the average segment size, in number of vocalizations), a(i, j) returns 1 if i and j belong to the same PCD and 0 otherwise, and δ returns 1 if its two arguments are equal and 0 otherwise (Kronecker delta).This results in an increment if boundaries are assigned inconsistently within a segment. (5) The WD metric is based on a similar idea.It can also be regarded as an estimate of inconsistencies between reference and hypothesis, obtained by sliding a window of length k segments over the MDTM and counting disagreements.WD, however, takes into account the number of boundaries predicted by the algorithm and the number actually contained in the reference for the calculation of the error score.The score is calculated as shown in Equation ( 6).N is the number of subsegments of size k, as before, and b(i, j) gives the number of PCD boundaries between segments i and j.
In addition to P k and WD, I follow Sherman and Liu [2008] in reporting the mean number of boundaries actually assigned by the classifier.This is relevant to the interpretation of the results since both segmentation metrics tend to favor hypotheses with fewer boundaries.

Results
Table II shows the performance of the segmentation algorithm in a five fold cross validation experiment in which different window sizes and data representations were compared.Two alternative representations were assessed.In one of them the algorithm distinguished between the various types of pauses specified in Definition 4.1.
In the other, it labelled all types of pauses simply as "silence".Results showed that discriminating between pause types (switching pauses, group switching pauses, vocalization pauses and group pauses) and increasing the vocalization context horizon both have a positive effect on segmentation accuracy.As the context horizon is increased past five vocalizations on each side of the current segment, performance degrades as a consequence of data sparsity.Furthermore, it should be remarked that while the single context representation (horizon n = 1) results in WD scores close to the best (five-feature horizon) results, its P k results are clearly inferior.The apparently good performance of the single-feature context in terms of WD is explained by the low average number of PCD boundaries it predicts per segment (see Table II) in conjunction with the fact that the WD metric tends to favor underprediction.Underprediction, as will be seen in Section 4.7, is one of the main challenges in learning from imbalanced data such as the MDTM corpus.The best performing representations are therefore those that have low P k and WD values, and yield a number of boundary predictions close to the number in the reference.The five-feature horizon representation met these criteria better than the alternatives.
A closer analysis of the predictions then reveals that WD scores are considerably higher than P k scores due to the fact that the algorithm overpredicts boundaries around the true boundary (sometimes predicting as many as four hypothetical boundaries adjacent to the true boundary).This is an interesting phenomenon, which further supports the hypothesis that the sequential structure of speech exchanges is indicative of higher level (topic) structures.In addition, from a pragmatic perspective, since adjacent boundaries do not occur in practice, this algorithm's behavior offers a straightforward possibility for improvement simply by filtering the excess boundaries at a postprocessing step.
Figure 3 shows the segmentation profile for an interval of an MDTM.The filled horizontal bars on the line labelled "reference" represent dialogue events (vocalizations or silences) marked as PCD boundaries in the gold standard annotation.The bars on the line marked "hypothesis" represent the events marked as boundaries by the classifier, possibly containing adjacent event clusters, which could not possibly all be true boundaries.An overview of the probabilities assigned to each dialogue event by the classifier is shown at the bottom of the chart.The line marked "hypothesis 2" shows the boundary assignment results after a simple filtering algorithm was applied, which selected among a cluster of adjacent hypothetical boundaries, the one with the highest probability (as assigned by the Naïve Bayes classifier) as the true boundary event.

Baseline Analysis
Having observed that PCD (as well as topic) boundaries coincide with vocalization events that are significantly longer on average than nonboundary events (see page 4), one might wonder whether segmentation based on individual vocalization events rather than the horizon representation proposed in the preceding might not suffice.One would also like to be able to quantify the improvement yielded by the vocalization horizon representation over reasonable baselines.In this section an analysis of alternative baselines is presented and compared to the results reported here.
Although random and majority classifiers are often used as baselines in machine learning research, they are inappropriate for PCD segmentation due to the imbalanced nature of the dataset.Better informed baselines which have been employed in the analysis of transcript-based meeting segmentation analysis include random assignment to the test set of the same number of boundaries found in the training set [Sherman and Liu 2008] and Monte Carlo simulated segments [Hsueh et al. 2006].Employing a Monte Carlo approach and generating a number of segments proportional to the number in a hold-out MDTM interval, averaged over 100 iterations, gives mean PCD segmentation errors of P k = 45.7% and WD = 50.1%.In terms of this baseline, therefore, the optimal results of the horizon technique represent an improvement of about 61.5% for P k and 69% for WD.
Tests showed that even though vocalizations and pauses tend to be longer at segment boundaries, predicting boundaries simply based on vocalization event duration would not necessarily improve upon the Monte Carlo baseline.Predicting a boundary for all vocalization events that exceed the mean duration by the amount reported in Section 4 would overpredict, yielding worse results than the baseline: P k = 40.8%and WD = 51.3%.However, a more selective approach, taking only log-transformed segments two standard deviations greater than the mean, would produce an improvement: The Nonverbal Structure of Discussions in Multidisciplinary Medical Team Meetings 17:15 P k = 38.5% and WD = 47%.The latter scores show the predictive potential of vocalization length, but fall well short of the results obtained with vocalization horizons.Including speaker role information and training, a Bayesian model based only on roleduration pairs (in terms of Table II this would be equivalent to setting n = 0 for the horizon) produces similarly unimpressive results: P k = 42% and WD = 46%.This analysis shows that considering even a small context (only the immediately preceding and following vocalization event) considerably improves the predictive power of the representation.
Finally, in attempting to provide a stronger baseline, a method was chosen, namely Hidden Markov Models (HMM), which is commonly used for sequence analysis and therefore would appear to be a natural choice for topic segmentation.HMM has in fact been employed for segmentation of telephone and news broadcast speech into sentences [Liu et al. 2006], a task that has some characteristics in common with topic segmentation.A model was created in which b (boundary) and ¬b (nonboundary) corresponded to the model states, speaker roles corresponded to observations, and transition and emission probabilities were estimated from the vocalization matrix.A five fold cross validation procedure was employed for evaluation.The best path hypothesis (Viterbi path) underpredicted, yielding P k = 38.2% and WD = 41%.In order to mitigate underprediction, a proportional thresholding strategy [Yang 2001] was applied to the posterior probabilities for b states so as to select a number of boundary instances proportional to the number found in the training set.This strategy resulted in P k = 38.7% and WD = 47.3%.These results are further discussed in Section 4.7 with respect to the class imbalance issue.

The Effect of Diarization Errors
These results were obtained by training the segmentation algorithm on a goldstandard (manually annotated and corrected) dataset where timing information and speaker identity are reliable across the sequence of vocalization events.As noted in Section 4, the task of segmenting an audio stream into a vocalization sequence and assigning these segments speaker labels is known as speaker diarization.The effectiveness of this task is usually measured in terms of an optimum one-to-one mapping of reference speaker labels to system output speaker labels.A diarization error rate (DER) metric is then computed as a fraction of speaker time that is mislabelled [NIST 2011].
Although turn-taking boundaries, pauses, and overlaps can be reliably identified for dyadic dialogues recorded under favorable conditions (cf.Heldner and Edlund [2010]), diarization is very much an active area of research.Progress has been made in recent years on diarization of meeting recordings [Fiscus et al. 2008;Tranter and Reynolds 2006].The problem, however, is still far from solved, and the speaker diarization results from the latest rich transcription meeting recognition evaluations [NIST 2011] vary depending on the type of meeting and the audio capture source.Data captured though single distant microphones, for instance, seem harder to process with error rater ranging from 15 to 30%.Multiple distant microphone data, on the other hand, can exhibit diarization error rates as low as 8%.
Unlike other corpora, the MDTM recordings were taken under challenging acoustic conditions, due to a number of factors.The MDTMs are busy, highly time-constrained events where participants make extensive use of artifacts such as article records and Xray films, which produce considerable noise.In addition, the video recording equipment (from which one of the audio sources was extracted) had to be placed at the back of the room so as not to interfere with the work of the medical participants.This adversely affected sound quality.In a less exploratory setting, where recording would be part of the MDTM routine, multiple microphones could be placed favorably yielding DER levels comparable to those obtained by current diarization systems.This section presents an evaluation of the effects of different levels of DER on out PCD segmentation method.
Diarization consists of various components, including speech detection, change detection, gender and bandwidth classification, clustering, identity finding, and so on.[Tranter and Reynolds 2006].From the perspective of PCD segmentation according to the method proposed in this article, the relevant steps are speech and change detection (to determine the duration of vocalization, pause and overlap events) and clustering/identity finding (to assign speaker labels).As MDTMs have a stable staff membership (the multidisciplinary medical team) with well defined roles, speaker identities map straightforwardly to the role variables (V 0 , . . .V n ).In order to access the effects of different levels of DER on segmentation, two types of noise were added to the MDTM data: change detection errors and speaker labelling errors.These noise types were added to reflect the most typical errors found in current diarization systems.The different errors were assessed both separately and in combination.
Change detection is generally implemented by sliding a window of fixed length over the audio data and looking for changing points within it by using a penalized likelihood ratio criterion (usually the Bayesian information criterion, BIC).A consequence of this approach is that the detector tends to miss short vocalizations (less than 2-5s long) [Tranter and Reynolds 2006].The distribution of vocalization lengths in the MDTM dataset is highly skewed towards shorter vocalizations (Figure 2a), with about 44% of vocalizations being shorter than 3s.These facts were taken into account when adding noise to vocalization boundaries so that sorter segments will be more likely to be affected.Thus, vocalization durations were scaled according to noise drawn from an exponential distribution to target four different levels of DER: 4%, 10%, 17%, and 25%, by varying the λ parameter of the distribution.The resulting noisy data sets were tested for actual DER scores using md-eval [NIST 2011] and yielded DER values of 4.1%, 9.9%, 17.6%, and 26%, respectively.
Speaker error (where system-assigned speaker label differs from reference speaker) accounts to by far the majority of diarisation errors [Tranter and Reynolds 2006;Fiscus et al. 2008].This type of error was modelled on the MDTM data set by randomly reassigning speaker labels to the vocalizations.The target DER levels were the same as in the preceding, and the actual measured scores were: 4.4%, 9.2%, 17.4%, and 28%.Finally, we also generated noisy data sets that combined change detection and speaker errors in the proportions reported in recent diarization evaluations, that is, about 70% of the added noise corresponding to speaker errors and about 30% of noise corresponding to change detection errors.The actual mean DER scores for these data sets were: 5%, 9.2%, 17.5%, and 25.8%.
The best performing representation, the five-vocalization horizon with pause type discrimination, was chosen as the basis for testing.Data Sets containing diarization errors in these defined ranges were converted into that representation and a cross validation procedure was employed for 10 iterations (noise set generation and segmentation was performed 10 times for each level) and the (P k and WD) results were averaged.This procedure was repeated for each of the three conditions: change detection errors only, speaker labelling error only, combined change detection and speaker error.
The results of this evaluation are shown in Figure 4 in terms of P k and WD scores.As expected, diarization errors have an adverse effect on PCD segmentation accuracy.However, the technique seems reasonably robust to moderate levels of diarization error as the deterioration in segmentation accuracy is relatively small (6-9.6% in P k and 14-17% in WD) for DER scores up to 10%.In terms of the effects of different types of errors on segmentation accuracy, change detection had a smaller impact than speaker errors.In the combined error condition, which better reflects errors produced by current systems, segmentation accuracy was similar to that obtained in datasets  containing speaker error only.These results imply that the nominal (role label) features of the horizon representation are more useful than their continuous (vocalization event duration) counterparts.Despite these encouraging results, a usable system for information access and retrieval MDTM will require accurate diarization (as well as other rich transcription functionality) in addition to robust PCD segmentation.In order to address these needs in a more comprehensive way, we are currently seeking permission to gather a greater volume of medical meeting data using individual microphones and well positioned microphone arrays.These data should enable us to investigate diarization performance more realistically.

Redundancy in Data Representation and the Effectiveness of Naïve Bayes
The pattern of boundaries shown in Figure 3, with placement adjacent or clustered around the same regions in the hypothesis (generally around a true boundary) is specific to the use of a Naive Bayes (NB) classifier for the segmentation task.A rerun of the experiment for the best combination of context size 5 and pause type discrimination on three different types of classification algorithms, namely, C4.5 [Quinlan 1993], SVM [Cortes and Vapnik 1995], and nearest neighbour (k-NN) illustrates this point.Table III shows a summary of results in terms of segmentation metrics and boundary numbers.The results for HMM segmentation reported in Section 4.5 have also been included for comparison.As can be seen, NB outperforms all other classification methods by a large margin.The difference between the mean number of boundaries initially hypothesised by NB and the mean number of boundaries actually placed after adjacent boundaries were filtered out, is particularly noteworthy.
The performance of most classifiers degrades under imbalanced class distributions [Japkowicz and Stephen 2002], as is the case of PCD boundaries in the MDTM data.The class imbalance problem was noted in connection with sentence segmentation and HMM by Liu et al. [2006], who attempted different strategies to mitigate it, including a variety of sampling methods in combination with the models.Hsueh et al. [2006] remarked that such strategies appear to be ineffective in meeting topic segmentation where class imbalance is much more severe.In the tests reported here, the decision trees generated by the C4.5 algorithm had to be left unpruned in order for the classifier to avoid trivial classification of all segments as nonboundaries.Similarly the SVM was set to use a simple polynomial kernel, k(s i , s j ) = (s i • s j ) d .Despite these adjustments, all tested classifiers with the exception of NB generated on average fewer boundary hypotheses than the reference.As the figures for pre-and post-filtering in Table III show, the clustered placement of boundary hypotheses only occurs with NB.A possible explanation for the good performance of NB lies in the redundant nature of our data representation scheme.Although the original data representation introduced in Section 4 describes a vector of attributes that correspond to a sequence of vocalizations of a certain length, NB's independence assumption implies that order information is lost when the parameters of the model are estimated.This means that the representations for candidate boundary instances that occur next to each other actually share all but two features.Zhang [2004] analyzed the conditions under which NB can exhibit optimal performance.He concluded that regardless of how strong the dependencies among attributes, good performance can be attained if the dependencies cancel each other out, or are distributed evenly within the classes.In the case of our representation scheme, which always preserves the grouping of roles and vocalization lengths as it slides a window of fixed size (with respect to the number of discrete vocalization events, not time) over the dialogue sequence to generate candidate boundary instances, most dependencies will be cancelled out.Furthermore, the similarity among instances in the neighborhood of a true boundary will have the effect of mitigating the effect of class imbalance.If this is the case, a possibility for improving on the current performance of NB would be to mark nonboundary vocalizations adjacent to true boundaries in the training set as boundaries so as to train the classifier to overpredict around the true boundaries and filter out the excess hypotheses through lower order sequence analysis methods such as HMM.This seems a promising topic for future research.

DISCUSSION AND COMPARISONS
The system presented here attains performance levels comparable to those achieved by state of the art supervised systems for segmentation of meetings by topic, while using much simpler content-free features.The decision tree approach presented in Galley et al. [2003], which is based on lexical cohesion features (LCSeg) extracted from hand-transcribed speech from the ICSI corpus, has error rates of 31.9% (P k ) and 35.9% (WD).The authors report these results to be significantly better than results of other approaches originally designed for text segmentation [Utiyama and Isahara 2001;Choi 2000], whose error scores on the same corpus range from 37.4% to 58%.Sherman and Liu [2008] found that hidden Markov models (over sentence sequences) produces better results than LCSeg on the ICSI corpus, including sub-topics (P k = 32.7%,WD = 42%).Hsueh and Moore [2007] report that a lexical cohesion segmentation approach The Nonverbal Structure of Discussions in Multidisciplinary Medical Team Meetings 17:19 applied to topic segmentation of the AMI corpus produces a P k score of about 40% and a WD score of 47%.Their improved maximum entropy segmentation algorithm, which combines lexical, conversational, prosody, video, and contextual features achieves 34% (P k ) and 36% (WD).These scores were obtained on the task that includes subtopics, whose ratio of boundary segments to total number of segments is similar to the same ratio observed in the MDTM corpus.The authors also show that moderate levels of word error rates in speech recognition cause only slight degradation in performance, and that not all classes of features are equally important.Somewhat in agreement with the hypothesis investigated in this article, they find that conversational features are the most essential nonlexical features for topic segmentation.
Although task and corpus differences do not allow a detailed comparison of our results with the ones reported for these systems, we note that for a comparable proportion of target boundaries, our approach, based solely on amount of speech, speaker transition, and role description features, attains lower error rates (27.6% and 34.7% for P k and WD respectively) than those more elaborate approaches.A similar contentfree approach to the one described in this article, including postfiltering and different threshold strategies as well as further data representation distinctions aimed at better characterizing overlaps and pause events, has been tested on the AMI corpus [Luz and Su 2010] yielding relatively good results (P k = 27.7% and WD = 36%).
It is likely that PCDs are better structured and homogeneous with respect to turntaking than topics in more general meetings (even scenario-based ones) and that this structure is captured by our model.In this regard, a comparison to other nonlexical topic segmentation methods that process better structured data such as news broadcasts [Allan et al. 1998] may be helpful.Shriberg et al. [2000] employed decision trees to estimate topic boundary probabilities in broadcast news audio based solely on pause duration, F0 (pitch) range, turn/no turn at boundary, speaker gender, and turn duration.They reported error results of about 17.3% in terms of the TDT segmentation metric [Allan et al. 1998].The TDT metric is an adaptation of the P k metric [Beeferman et al. 1999] which penalizes false alarms more heavily than misses by assigning a 0.7 weight to instances of the former and a 0.3 weight to instances of the latter.This means that in an imbalanced dataset the TDT metric is more forgiving than the P k metric used in this article (chance in the TDT weighted metric yields a score of 30% while chance on the MDTM yields a P k of about 45%).In spite of these differences, the results of Shriberg et al. are impressive and further corroborate the hypothesis that nonlexical features can be good indicators of topic structure.Another hindrance to comparisons between news and MDTM segmentation results is the fact that, even though in both cases the topics have relatively well defined structure, news data are characterized by a more marked contrast between very frequent speakers (e.g. the news anchor) and very rare ones (e.g.interviewees and guests).This kind of contrast is even more evident in data from lectures, to which nonlexical approaches have also been applied.The unsupervised approach based on audio features only and tested on lecture data by Malioutov et al. [2007] produced a P k score of 35.8% and a WD of 37%.
From the practical point of view of implementing a searchable multimedia archive of MDTMs [Luz and Kane 2009] usable in a real-world application, segmentation is a very initial but important step.Due to their relatively high error rates, it is unlikely that current segmentation methods could be used for storage of PCD discussion records as separate units on a database system, in a medical context.Rather, we envision an interaction mode in which the user, for instance, browses time-based media containing recordings of MDTMs in order to locate the information of interest.The method presented here, even though it clearly overpredicts, could usefully support this interaction mode.In browsing, high recall is often favored over precision [Bouamrane and Luz 2007;Moran et al. 1997].When presented with a misidentified PCD boundary (a false positive), the user can usually identify it as such after a few seconds of listening and skip over to the next boundary.In that regard, it is worth pointing again to Figure 3 and noting that the profile is dominated by zero or very low probabilities (representing true negatives), and that for all missed boundaries (false negatives) the probabilities peak to values greater than those of true negatives.Therefore, if one were to adjust the classification threshold one could optimize the utility of the classifier (in a decision-theoretic sense, valuing recall over precision [Lewis 1995]) for this particular interaction mode.Usability studies to determine and test such parameters are a promising area for future work.
Although the information generated at MDTMs constitutes an invaluable resource for a number of processes in healthcare, from patient management to teaching, the incorporation of MDTM-generated data into existing patient-centered models is far from straightforward.Given that MDT meeting participants work under tight time constraints, automatic recording seems to be the only viable option for data gathering.Recording and storage of multimedia meeting data in digital form have become relatively commonplace in recent years.The challenge consists in finding effective ways of structuring and providing easy access to these data.

CONCLUSION AND FUTURE WORK
Collection of meeting data that are representative of the activities of professionals in the real world is a basic requirement for the analysis of the effects of meeting support technologies on group performance and for the development of systems capable of capturing and indexing of meetings.This article described a small but, we believe, representative corpus of speech interaction data generated during multidisciplinary medical team meetings. 4A novel use of a simple data representation technique inspired by research on dialogue in the fields of social psychology and computer-supported cooperative work has been presented, which produced surprisingly good results in an automatic topic segmentation task.The combination of nominal and continuous features derived from amount and sequence of speech and speaker roles through a Naïve Bayes classifier yielded promising results when applied to the segmentation of MDTMs into patient case discussions, achieving performance levels that compare favorably to state-of-the-art meeting segmentation techniques.
The work described here forms part of an ongoing study aimed at understanding the task and process at play in MDTMs with a view to identifying ways in which computer technology might be deployed in such settings.This includes an investigation into the possibility of enriching existing electronic health records with automatic segmentation and indexing of patient case discussions.Such a system would potentially allow users to easily retrieve PCDs for teaching and healthcare management purposes.In order to achieve these goals, in addition to segmentation, we are currently tackling issues such as automatic categorization PCDs [Luz and Kane 2009], as well as carrying out further fieldwork studies with the cooperation of the medical teams.In parallel, we are also conducting a controlled user study of the usefulness of topic segmentation outputs at different levels of accuracy on a browsing task, using AMI corpus data.
As we reach a better understanding of the information needs of the different people involved in MDTM work, we plan on extending the evaluation of our segmentation 4 Due to the confidential and sensitive nature of the material gathered, the audio and video recordings cannot be distributed.However, the anonymized vocalization sequence data sets and the software used for segmentation and evaluations reported in this article can be made available on request.In the future we hope to obtain approval to gather and distribute privacy-preserving audio features from which content cannot be recovered but that can be used for segmentation at different levels [Parthasarathi et al. 2009] so as to extend the range of possible content-free studies based on the data.
The Nonverbal Structure of Discussions in Multidisciplinary Medical Team Meetings 17:21 techniques to encompass the study of empirical correlations between performance at typical information access tasks by, say, senior consultants reviewing similar cases, and the existing segmentation metrics.This could help establish utility criteria based on which parameters such as segmentation thresholds can be tuned.Further work could also explore the detection of specific salient events related to PCD stages.An example of a salient event is the TNM (Tumour, Nodes, Metastases) categorization by the meeting participants.In addition to MDTM-specific research, we are also carrying out further evaluations on standard meeting corpora.

Fig. 2 .
Fig. 2. Amount of talk in MDTMs.(a) according to duration of individual vocalizations, and (b) distributed by medical roles.
), where μ b and σ 2 b are the mean and variance of the values taken by the features L i in the dataset, given a PCD boundary, represented here as Boolean variable b.

Fig. 3 .
Fig. 3. Profile for a MDTM interval showing segmentation results before (hypothesis) and after (hypothesis 2) removal of spurious adjacent boundaries, in relation to the gold standard (reference).The probabilities assigned to each vocalization event by the classifier, P(b|S), are shown in the series below the horizontal bars.

Fig. 4 .
Fig. 4. Segmentation results in P k (left) and WD (right) for datasets containing diarization errors.The traced line shows error scores obtained for representation horizon n = 5 built on gold standard data (data containing no diarization errors).

Table II .
ACM Transactions on Information Systems, Vol. 30, No. 3, Article 17, Publication date: August 2012.Mean Results for Cross-Validated Segmentation Experiments for 1 ≤ n ≤ 7 Vocalization Horizons, with and without Pause Type Discrimination Mean number of boundaries per segment fold is 10 in reference.

Table III .
Performance of Segmentation Based on Different Types of Classifiers Data representation set to a context of five vocalizations, including pause type discrimination.