We apologize in advance that we do not have a writable ftp site
at this point (any volunteers?) and also that we don't yet
have customized tools developed to do this coding. (We are working
on that with the other subgroups for the second homework.) So
please bear with us and use the instructions below.
Send any and all questions to both chn@research.att.com and
traum@cs.umd.edu. As noted in a previous msg, do *NOT* mail questions
on coding choices for the homework dialogues to the whole group, so as
to not bias other coders. Appropriate question-answer summaries of
general value will be posted to the whole group by the co-chairs.
Happy tagging!
Christine and David
Retrieve the files by using ftp.
1a. Using a web browser: goto ftp://ftp.research.att.com/dist/chn
SUN users: goto SUN dir, download hw1sun.tar.gz,
trains.au.gz, toot.au.gz
PC users: goto PC dir, download "hw1pc.tar.gz",
trains.wav.gz, toot.wav.gz
SGI users: goto SGI dir, download hw1sgi.tar.gz,
trains.sd.gz, toot.sd.gz
To uncompress: type "gunzip .gz"
To untar: type "tar -xvf .tar"
1b. Using anonymous ftp: type "ftp ftp.research.att.com'
login as anonymous
type "cd dist/chn"
type "binary"
You need to ftp three files:
SUN users, goto SUN dir, get: hw1sun.tar.gz, trains.au.gz, toot.au.gz
SGI users, goto SGI dir, get: hw1sgi.tar.gz, trains.sd.gz, toot.sd.gz
PC users, goto PC dir, get: hw1pc.tar.gz, trains.wav.gz, toot.wav.gz
[Step-by-step:
SUN users: type "cd SUN"
type "get hw1sun.tar.gz"
type "get trains.au.gz"
type "get toot.au.gz"
SGI users: type "cd SGI"
type "get hw1sgi.tar.gz"
type "get trains.sd.gz"
type "get toot.sd.gz"
PC users: type "cd PC"
type "get hw1pc.tar.gz"
type "get trains.wav.gz"
type "get toot.wav.gz"
To quit ftp: type "quit"
]
To uncompress: type "gunzip .gz"
To untar: type "tar -xvf .tar"
Overview of the files
The homework consists of doing CGU analysis on two dialogues, one from
TRAINS-93 (from the TRAINS
project at the University of Rochester, and one from TOOT
(courtesy of Diane Litman, AT&T Labs). Thanks to Peter Heeman
and Mark Core for help with TRAINS, and Diane for help with TOOT.
There are two sets of files, trains files and toot files. Both
dialogues are to be coded the same way.
Here are brief descriptions of the file contents, where
dialogue is trains or toot:
TEXT files:
INSTRUCTIONS an plaintext copy of this file you are reading
.toklabs column of token labels to edit for CGU analysis
.times transcript with time indices into speech file
.pretty transcript *without* time indices into speech file
SPEECH files:
.au speech files from SUN package
.wav speech files from PC package
.sd speech files from SGI package
.words xwaves label files from SGI package
listen xwaves script from SGI package
for playing audio with aligned words
Steps for doing CGU analysis:
-
Read/listen to the entire dialogue once through before starting
analysis. BEWARE*: the sound levels for the trains file are much
higher than for the toot file, so adjust your volume setting
accordingly. Various parts of both recordings may be difficult
to understand; that's why you have the transcriptions ;)
- view/print out .times or .pretty
[.pretty is easier to read if you don't need time indices].
Please note: overlapping speech is demarcated by special characters,
not simply formatting. The "interrupted" speaker's speech is
marked off by the symbol "=". The "interrupting" speaker's speech
is marked off by the symbol "+", as usual.
e.g
S.11.13 or =say re=peat to hear...
U.12.1 +relax+
- SUN users: use audiotool or auplay to play the trains.au/toot.au
files. The time indices in .times are meant to help you
keep track of where you are in the audio/transcript. (If you do not
have either audio player tool, please email chn@research.att.com
immediately so you can be supplied with one).
- PC users: play the .wav files. Use .times to help you
align the audio with the transcript.
- SGI users: use xwaves and the accompanying files by typing
> listen toot
or > listen trains
On the default settings, hitting the right button in the
<dialogue>.words label tier should play a single utterance
token at a time. To begin, however, simply select Play entire
file from the waveform menu (hit right button in speechform window).
Hit CONTINUE in the control panel to end the script.
- Assign utterance tokens to CGUs
Edit the file, <dialogue>.toklabs, a column of utterance token labels,
to produce your CGU analysis file, <dialogue>.cgu.<your_user_login>,
e.g. toot.cgu.chn
The CGU analysis file should be formatted as follows:
, , ..., ""
e.g.
14 S.5.1, U.6.1, U.6.2, U.6.3 "decide to move oranges"
PLEASE REMEMBER to separate CGU labels by commas, and to put the
descriptive text in quotation marks.
Steps for doing IU analysis:
You will start with your CGU file as input, and create
the file, .iu..
Listen to the audio as necessary as described above.
IU analysis files should be formatted as follows:
iu. "IU description"
e.g.
iu.2.5.1 "plan to move oranges to bath"
22 S.22.1, U.23.1 "locate oranges"
23 U.23.2, S.24.1 "decide to move oranges to bath"
iu.2.6 "..."
24 S.24.2, S.24.3 "..."
25 U.25.1, S.26.1 "..."
PLEASE REMEMBER to use Gorn numbering (explained the the manual)
to number each IU segment (you can easily add these at the end if
you find it tedious to do while analysing the structure). And
also remember to enclose the IU description in quotation marks.
Also, use TAB markers to indent.
Extra data to report
-
Utterance tokenization: please make a note of any utterance tokens
that you would have liked to split up. The tokens provided for both
dialogues represent full or major intonational phrase units. If you
find multiple tokens you think should be merged, note them too.
- Time: please time yourself separately for CGU and IU analysis,
for each dialogue, so we can estimate coding rates for these
schemes. If you forget, rough estimates will still be helpful.
- Log: for debugging the coding scheme, it's useful to keep
a log of difficult cases you encounter, or problems using
the coding manual guidelines. It will help us organize
revisions/open issues for the workshop if you could provide
concrete examples from your coding experiences. Organize
your log as you see fit.
You can report these data in a separate file, misc.,
or simply email them to chn@research.att.com.
Last updated on March 9th 1998