Weist-Jarosz Corpus

Richard Weist
Department of Psychology
SUNY Fredonia


Gaja Jarosz
Department of Linguistics
UMass Amherst


Participants: 4
Type of Study: naturalistic, longitudinal
Location: Poland
Media type: audio
DOI: doi:10.21415/T51974

In accordance with TalkBank rules, any use of data from this corpus must be accompanied by at least one of the above references. In this case, it would be good to cite one from Weist and one from Jarosz.

Project Description

Participant NameAge RangeSessionsSex
Marta1;7-1;106 (3 audio)F
Wawrzon2;2-3;220 (19 audio)M

All of the children were from middle-class families raised in the urban environment of Poznań, Poland. In general, their parents were highly educated. The children were recorded in their homes (typically an apartment) by two experimenters. One of the experimenters carried a small bag containing the tape recorder and the other took context notes, which were integrated during transcription.

Phonetic Transcription Description

The children’s productions were transcribed using broad phonetic transcription with the help of the open-source Phon software (Rose et al. 2006). The orthographic transcripts were used as the basis for creating phonetic transcriptions of the children’s target pronunciations, and the audio recordings were used to phonetically transcribe the children’s actual productions and align them with the target transcriptions word by word. The transcription of all child productions was first performed independently by two transcribers trained in phonetic transcription, at least one of whom was a native speaker of Polish. Then, two Polish speakers trained in phonetic transcription worked together to create a consensus transcription of all productions, relying on a third phonetically trained native speaker of Polish to adjudicate in cases when agreement could not be reached. The resulting corpus includes phonetic transcriptions of the children’s productions in all the available audio files, providing word-by-word alignment of target pronunciations and actual pronunciations.

Transcription Conventions

Boundaries: We use word groups to delineate phonological word boundaries. In all cases except one, orthographic word boundaries correspond to phonological word boundaries. The only exception is the proclitics 'w' [v]/[f] and 'z' [z]/[s] which attach to the following word and cannot be pronounced independently. In this case, the orthography tier encodes the orthographic word boundaries, putting the proclitic in its own word group, while the IPA Target and IPA Actual tiers encode the proclitic together with the next word. So for example. '[z][kotem]' would be '[][skotɛm]' on the Target tier and potentially something like '[][sotɛm]' on the Actual tier.

Tier Conventions: We have maintained many of the conventions from the original CHAT transcripts and introduced several codes to denote special situations regarding phonetic transcription.

The following codes were used on the orthography tier:

IPA Conventions