icon

DanPASS - Danish Phonetically Annotated Spontaneous Speech

Nina Grønnum

Goal

The intention was to supply a corpus for acoustic and perceptual phonetic investigations. That is, the primary goal is neither syntactic, pragmatic, socio-linguistic, psychological, nor whichever other aspect of spoken language one might wish to investigate. There are therefore a considerable number of discourse variables that have not been taken into account in the choice of elicitation material. Nevertheless, the corpus may serve as a basis for any number of linguistic and/or speech technological investigations.

The project is financed by a grant from Carlsbergfondet.

The Corpus

The corpus consists of monologues, dialogues and word lists. Apart from the word lists, the corpus represents an approximation to speech in a natural setting: On the one hand the material for elicitation is controlled in the sense that the speakers are given specific tasks to talk about. This is to facilitate comparison across speakers and ensure sufficiently uniform materials for subsequent analyses. On the other hand the speech is non-scripted.

Monologues

The monologues were recorded in 1996. They represent various types of instructions. The speaker was seated alone in the professional recording studio in the department and could communicate with the experimenter only via microphone and headphone. Once the speaker had been instructed in the specific task, (s)he could no longer address the experimenter with questions or comments. In other words, the monologues were recorded in one-way communication with an unseen partner who offered no feedback, neither in the form of questions nor confirmation. Speakers were recorded with professional equipment (Sennheiser Microphone ME64, Revox A700, Agfa PEM368 tape). The analog recordings were digitized later, at a sampling frequency of 48kHz, and transferred to cd-roms.

The speaker first described a network consisting of various geometrical shapes in various colours. It is an elaboration of Swerts and Collier's (1992) network. It was specifically intended to reveal whether or not speakers look ahead and signal prosodically an upcoming utterance boundary prior to its actual occurrence.

See the network.

The speaker then guided the experimenter through four different routes in a virtual city map, inspired by Swerts (1994).

See the map of "Slotsby."

Finally, the speaker – who had a model – told the experimenter how to assemble a house from its individual pieces. This house is an almost exact copy of Terken's (1984) edifice.

See the house.

See the instructions as they were read to the speakers.

Dialogues

The dialogues were recorded in the summer of 2004. They are replicas of the Human Communication Research Centre's Map Tasks (cf. Anderson et al., 1991; Brown et al., 1984; http://www.hcrc.ed.ac.uk/maptask/).

The exercise involved the cooperation of two participants. They were seated in separate locations, one the department's recording studio, the other a recording facility established for the purpose in the main control room, with curtains of very heavy material surrounding the speaker. The speakers communicated via headsets.

A laboratory set-up like this is hardly the most natural environment for communication, but it turned out to be necessary in order to obtain recordings of sufficiently good quality for subsequent acoustic analysis: Seated in the same room, across from each other with eye-contact, speaker A could invariably be heard over speaker B's microphone, and vice-versa; whereas we got clean acoustic signals when the speakers were separated, with no appreciable difference in quality from the studio proper and the ad hoc studio established in the control room. Given the setting, i.e. the lack of visual and direct auditory contact, the participants would presumable be most comfortable if they were not also to communicate with a stranger. Accordingly, the two members of a pair knew each other well. They were recorded via professional headset microphones (Voice Technologies VT700), directly onto cd-roms (HHB Professional Compact Disc Recorder CDR-850) to separate channels in a stereo recording.

Each participant had a map. One, the giver, had a route on his or her map; the other, the follower, did not. Their goal was to collaborate so as to reproduce the giver's route on the follower's map. The maps are not exactly identical: Landmarks are missing on one or the other map, a landmark may appear twice – in two different locations – on one map but in only one location on the other; and the same landmark may have slightly different names on the two maps. This gives rise to a true negotiation, with questions and answers, backtracks, etc. Participants were explicitly informed about these irregularities in written instructions prior to the recording. It was left to them, however, to discover how and where the maps or the designations differed, and to supply the missing items and correct names on their respective maps. Each pair of speakers completed four different sets of maps.

See the first set of maps.

See the second set of maps.

See the third set of maps.

See the fourth set of maps.

Word Lists

After completion of the map sessions subjects were asked to read a word list containing all the feature names from the maps they had encountered. Each name appeared twice, in random order, and subjects were asked to read the list in a distinct speech mode. The lists provide citation forms for comparison with the less distinct dialogue forms. Landmarks and names in the original English maps were designed with specific phonological phenomena and processes in mind. I was more or less bound by the landmarks and their translation into Danish, with only moderate influence over phonological structure.

See the wordlists.

See the written instructions to speakers.

Speakers

There were 27 speakers, all of them (former) students or colleagues in the (former) Department of General and Applied Linguistics, and all except one originating in the greater Copenhagen area. None had any known speech or language deficits.

18 speakers recorded the monologues, 13 men and 5 women. 22 speakers recorded the dialogues, 13 of whom also recorded the monologues.

The speakers appeared to be comfortable with the task(s) and the experimental setting. They produced fluent speech for both monologues and dialogues and were not in any obvious way influenced by the non-naturalness of the circumstances.

See information about the speakers.

Video Recording

In the studio proper a video-recorder was mounted. The camera was placed as close as possible, and as nearly perpendicular as possible, to the frontal plane of the speaker's face without impeding his/her view of the map. The videos are intended as analysis material for whomever should want to attempt to accompany synthetic Danish speech with a model talking face.

Each speaker had to serve as giver as well as follower, in alternation. Each speaker also had to be video-recorded in both roles. The logistics of running two video-cameras were prohibitive and we had only one. Accordingly, after two map sessions, with speaker A being giver and follower, respectively, the speakers changed places in order for speaker B to be video-recorded as well. Thus, each pair of speakers had a run through four different sets of maps. A complete recording session lasted 30-40 minutes.

Statistics

There are 9 hours and 46 minutes of speech altogether, 2 hours and 51 minutes of monologues, 6 hours and 55 minutes of dialogues, including the word lists. There are 2119 different word forms in the corpus as a whole,1074 in the monologues and 1557 in the dialogues. The total number of words is 21.146 in the monologues and 52.530 in the dialogues, i.e. a grand total of 73.676 words.

A lexicon comprising the words from monologues and dialogues was extracted from the texts and supplied with a phonological transcription and their frequency of occurrence.

See the lexicon.

Processing

Monologues and dialogues were transcribed orthographically in standard orthography, without punctuation, with capital letters for proper names only, with indication of empty PAUSES and filled pauses, and with marks for articulatory hesitation. The orthographical representation is supplemented with stress marks (commas directly before the vowel letter representing the vowel of the stressed syllable), intended for researchers who are interested only in the distribution of stress across the texts, regardless of the pronunciation.

See the orthographical transcription of dialogues and monologues.

The speech signals are segmented and annotated in Praat. The acoustic signal is segmented into prosodic phrases, words and syllables, always to the nearest zero-crossing in the waveform.

See the conventions.

There are nine separate interval tiers for (1) the orthographical representation, (2) a detailed part-of-speech (PoS) tagging, (3) a simplified PoS-tagging, (4) an abstract phonological representation, (5) a semi-narrow phonetic transcription with the same domain boundaries as tiers 1-4 (this is to facilitate combined searches in the phonological and phonetic representations), (6) the same semi-narrow phonetic transcription but in a syllable-sized domain, (7) a symbolic representation of the pitch relation between each stressed and its first post-tonic syllable, (8) a symbolic representation of the phrasal intonation contour, and (9) a tier for comments.

In a project headed by Patrizia Paggio at the Centre for Language Technology, University of Copenhagen, the information structure of the monologues was analysed, and topic and focus tags have been added to the orthographical representation in a separate tier (10).

See an example of a Praat screen.

Annotation

The PoS-tagging in tiers 2 and 3 is automated. The tagger, developed by Peter Juel Henrichsen, Department of Computer Linguistics at Copenhagen Business School, was trained on written language, not spontaneous speech (Henrichsen, 2002). At the outset there was no way to predict how well the tagger would perform on non-scripted speech. However, although the tagger does make mistakes, they are not random. They are more or less confined to certain types, as revealed in the subsequent manual proof-reading process, and on the whole the tagger is efficient and reliable.

However, the tag set (which was originally designed to be applicable to many more languages than Danish) is not quite appropriate in two respects. For one thing, there is no category 'article' and it over-generalizes the category 'demonstrative pronoun' to contain also definite articles. Accordingly, I have added indefinite and definite articles to the tags, thus:

See a list of the complete set of part-of-speech tags.

And see the full versus reduced tag correspondences here.

The phonological representation in tier 4 is fairly abstract where the segments are concerned, in accordance with the phonological analysis of Danish in Grønnum (2005), but stress marks are added to polysyllables, and stød is designated as well, although both stress and stød are to a very large extent predictable from the segmental and morphological structure and thus – strictly speaking – phonologically redundant. Adding stress and stød, however, will presumably facilitate certain search procedures at a later stage. (Stød is a special kind of creaky voice characterizing certain syllable types under certain morphological conditions. See, e.g., Grønnum and Basbøll (2007).

The phonetic transcription in tiers 5 and 6 is semi-narrow, with a fairly liberal use of relevant diacritics.

See the conversion between IPA's symbols and Praat's transcription codes.

See the phonetic segmental transcription conventions.

The phonetic transcription has been extracted from tier 5 to running text.

See the phonetic segmental transcription of dialogues and monologues.

See the phonetic segmental transcription of dialogues and monologues, broken into prosodic phrases.

A "pronouncing lexicon" has been extracted, (a) with frequency of occurrence of each pronounced form and (b) with reference to every occurrence in the texts.

See the pronouncing lexicon with frequency of occurrence.

See the pronouncing lexicon with reference to occurrence in the texts.

The symbolic representation in tier 7 of the pitch relation between stressed and first post-tonic syllable is graded in seven steps: The post-tonic is much higher (H/), higher (H), a little higher (h), equal to (=), a little lower (l), lower (L), or much lower (L\) than the stressed syllable. The interval is specified to such relatively fine degree, because in its magnitude lies a correlate to perceived prominence (Grønnum, 1990; Jensen and Tøndering, 2005).

Syllables which are perceived to have more than "normal" stress are marked with an exclamation mark before the star.

See the prosodic transcription conventions.

Phrasal intonationen in Danish, tier 8, is characterized by, firstly, the way the stressed syllables are pitch scaled throughout the phrase, i.e. by their mutual relationship, and, secondly, presumably also by the way the phrase onsets and offsets, i.e. by the pitch of the very first and very last syllable in the phrase, be it stressed or unstressed. The pitch of the stressed syllables and the syllables at the phrasal boundaries is represented on a coarse scale of high (h), mid (m) or low (l). However, the means also exist to a finer gradation within a succession of stressed syllables in a given range (between high and mid, high and low, or mid and low). E.g., h_>_>_>_m designates a succession of five stressed syllables which descend gradually from high to mid.

Readers familiar with the ToBI convention for transcribing prosody (e.g., Silverman et al., 1992), should note that any similarity with our annotation is merely superficial. For the description of Danish intonation the phonological assumptions behind ToBI are inappropriate, and as a phonetic transcription system it is not sufficiently fine grained for our purpose (Grønnum 1985, 1986, 1995). For a general critique of ToBI, see Kohler (2005, 2006, 2007).

Note that, for reasons to do with time and resources, the pitch relation between successive prosodic phrases is not represented. Given the flexibility of Praat, it can easily be added to the grid if and when the need arises.

As mentioned above, the monologues have been supplied with tags for information topic (T) and information focus (F) in tier 10, cf. Paggio (2006).

See Patrizia Paggio's guidelines for focus and topic tagging.

Although topic and focus have been tagged in the text grids, text files are available for easier reading. In the texts, stress marks are omitted, but pauses retained from the orthographical representation in tier 1. If you open the files in Wordpad or emacs (unix) sentence boundaries - indicated in the textgrids by "boundary" - are shown as line shifts. Please note that the texts were focus-tagged prior to the final proof-reading and minor - immaterial - differences may appear between tiers 1 and 10.

See the focus and topic tagged text files.

Tier 9 is for ad hoc comments.

The segmental and prosodic annotation in tiers 5-8 was performed independently and in parallel by two assistants. Disagreements between them were resolved in conferences with Nina Grønnum. Subsequently, NG proof-read the entire file. This procedure is repeated through every step: first the phonetic transcription, then the stress-and-pitch relation and finally the phrasal intonation.

Note especially that searching for strings in tiers 4 and 5 (phonemic and phonetic transcriptions, respectively) you need to write Praat's codes for IPA symbols, cf. the link above to conversion between IPA symbols and Praat codes, and when searching in tier 10 (Focus and Topic) you must leave out the stress marks (the commas). Further search tips here.

Link to the search engine.

Access

The zipped folders with sound files and text grids, respectively, are here for downloading. The dialogue sound files exist i stereo-sound (both speakers) as mono-sound (each speaker in a pair). Note that to open the sound files you need a password. Contact ninag @ hum.ku.dk.

Sound files - monologues.
Sound files - dialogues; stereo-sound.
Sound files - dialogues; mono-sound.
Sound files - word lists.
Text grids. (Last update March 4th, 2010.)

Publications

Grønnum, Nina & Tøndering, John (2007), "Question intonation in non-scripted Danish dialogues", in Proceedings of the XVIth International Congress of Phonetic Sciences 2007, Saarbrücken, Saarbrücken, Saarland University, pp. 1229-1232.

Jensen, Christian & Tøndering, John (2005a), "Perceived prominence and scale types", in A. Eriksson & J. Lindh (eds.), Proceedings FONETIK 2005, The XVIIIth Swedish Phonetics Conference, May 25-27 2005, Göteborg, Sweden: Göteborg University, Department of Linguistics, pp. 111-114.

Jensen, Christian & Tøndering, John (2005b), "Choosing a Scale for Measuring Perceived Prominence", in Proceedings of Interspeech 2005, September, 4-8, Lisbon, Portugal, pp. 2385-2388.

Pharao, Nicolai (2009) Consonant Reduction in Copenhagen Danish - A study of linguistic and extra-linguistic factors in phonetic variation and change dialoger. Ph.D. dissertation.

Tøndering, John (2003), "Intonation contours in Danish spontaneous speech", in M.J. Solé D. Recasens and J. Romero (eds.), Proceedings of the 15th International Congress of Phonetic Sciences, Barcelona 3-9 August 2003. Barcelona, Spain: Universitat Autònoma de Barcelona, pp. 1241-1244.

Tøndering, John (2008), Skitser af prosodi i spontant dansk, Ph.d.-afhandling. Det Humanistiske Fakultet, Københavns Universitet.

Acknowledgements

This project would not be possible without extensive help from many people, and not without external funding either. The Carlsberg Foundation provided a grant of 1.08 million Danish kroner.

A number of individuals have each contributed invaluable assistance: Preben Dømler and Svend-Erik Lystlund assisted at the recordings. Gert Foget Hansen segmented a part of the monologues. Maja Dyrby and Line Burholt proof-read the PoS-tags. John Tøndering transcribed orthographically all the monologues. He has written a number of immensely useful scripts for Praat, to locate mistakes, to move boundaries etc. He also uses the corpus for his own ph.d. project and liberally shares his results with me. Nicolai Pharao supplied the 2140 word forms with an abstract phonological representation. The major and most tedious work, however, is the responsibility of the transcribers, Cem Avus, Jeppe Beck, Andreas Geisler, Louise Astrid Johansson, Ruben Schachtenhaufen and Thit Wange Stærkær. Line Burholt Kristensen and Tina Ringkjær performed the topic and focus annotation of the monologues. Finally, without the twenty-seven speakers who gave liberally of their time and enthusiasm, none of this would have been possible.

References

Anderson, A.H., Bader, M., Bard, E.G., Boyle, E., Doherty, G., Garrod, S., Isard, S., Kowtko, J., McAllister, J., Miller, J., Sotillo, C., Thompson, H.S., Weinert, R., 1991. The HCRC Map Task Corpus. Language and Speech 34, 351-366.

Brown, G., Anderson, A., Shillcock, R., Yule, G., 1984. Teaching Talk. Cambridge University Press, Cambridge.

Grønnum, N., 1985. Intonation and text in Standard Danish, J. Acoust. Soc. Am. 77, 1205-1216.

Grønnum, N., 1986. Sentence intonation in textual context – supplementary data, J. Acoust. Soc. Am. 80, 1040-1047.

Grønnum, N., 1990. Prosodic parameters in a variety of regional Danish standard languages, with a view towards Swedish and German. Phonetica 47, 182-214.

Grønnum, N., 1995. Superposition and subordination in intonation – a nonlinear approach. In Elenius, K., Branderud, P. (Eds.) Proc. XIIIth Int. Cong. Phonetic Sc. Stockholm 1995, vol. II. KTH and Stockholm University, Stockholm, pp. 124-131.

Grønnum, N., 2005. Fonetik og Fonologi, 3. udg., Akademisk Forlag, København.

Grønnum, N. and Basbøll, H., 2007. Danish Stød – Phonological and Cognitive Issues. In Solé Sabater, M.-J., Beddor, P.S. and Ohala, M. (Eds.) Experimental approaches to Phonology. Oxford University Press, Oxford, 192-206.

Henrichsen, P. Juel, 2002. Sidste Års Aviser – Grammatisk opmærkning af et stort dansk aviskorpus. Lambda 27. Institut for Datalingvistik, Handelshøjskolen i København, København.

Jensen, C. and Tøndering, J., 2005. Choosing a Scale for Measuring Perceived Prominence, in Isabel Trancoso (Ed.) Proceedings of Interspeech 2005, September 4-8, Lisbon, Portugal, pp. 2385-2388.

Kohler, K.J., 2005. Timing and Communicative Functions of Pitch Contours. Phonetica 62, 88-105.

Kohler, K.J., 2006. Paradigms in Experimental Prosodic Analysis – From Measurement to Function, in Sudhoff, S., Lenertová, D., Meyer, R., Pappert, S., Augurzky, P.,, Mleinek, I., Richter, N., Schließer, J. (Eds.) Methods in Empirical Prosody Research. (= Language, context, and cognition, 3). de Gruyter, Berlin, New York.

Kohler, K.J. 2007. Beyond Laboratory Phonology. In Solé Sabater, M.-J., Beddor, P.S. and Ohala, M., (Eds.) Experimental approaches to Phonology. Oxford University Press, Oxford, 41-53.

Paggio, P., 2006. Annotating Information Structure in a Corpus of Spoken Danish, in: Calzolari, N., Choukri, K., Gangemi, A., Maegaard, B., Mariani, J., Odijk, J., Tapias, D. (Eds.), Proceedings from the 5th International Conference on Language Resources and Evaluation, Genova 24-24 May 2006 (cd-rom).

Silverman K., Beckman, M., Pitrelli, J., Ostendorf, M., Wightman, C., Price, P., Pierrehumbert, J., Hirschberg, J., 1992. ToBI: A standard for Labeling English Prosody, in Proceedings of the International Conference on Spoken Language Processing, pp. 867-870.

Swerts, M., 1994. Prosodic features of discourse units. Technische Universiteit Eindhoven, Eindhoven.

Swerts, M.. and Collier, R., 1992. On the controlled elicitation of spontaneous speech. Speech Communication 121, 463-468.

Terken, J.M.B., 1984. The distribution of pitch accents in instructions as a function of discourse structure. Language and Speech 27, 269-289.