DRI Discourse Structure In Dialogue
Homework 2 CGU/IU Coding Instructions
Links:
1. Retrieve the files by using ftp.
1a. Using a web browser: goto ftp://ftp.research.att.com/dist/chn/hw2
SUN users: download "hw2sun.tar.gz"
download "maptasksun.tar.gz"
SGI users: download "hw2sgi.tar.gz"
download "maptasksgi.tar.gz"
To uncompress: type "gunzip .gz"
To untar: type "tar -xvf .tar"
----------------------------------------------------------------
1b. Using anonymous ftp: type "ftp ftp.research.att.com'
login as anonymous
type "cd dist/chn/hw2"
type "binary"
You need to ftp two files:
SUN users, get hw2sun.tar.gz
get maptasksun.tar.gz
SGI users, get hw2sgi.tar.gz
get maptasksgi.tar.gz
[Step-by-step:
SUN users: type "get hw2sun.tar.gz"
type "get maptasksun.tar.gz"
SGI users: type "get hw2sgi.tar.gz"
type "get maptasksgi.tar.gz"
To quit ftp: type "quit"
]
To uncompress: type "gunzip .gz"
To untar: type "tar -xvf .tar"
-------------------------------------------------------------------------
2. Overview of the files
The homework consists of doing CGU analysis on one dialogue from
VERBMOBIL (see their homepage at
http://www.dfki.uni-sb.de/verbmobil/)
(and one from MAPTASK). Thanks to Norbert Reithinger and
Mark Core for help with VERBMOBIL, Mark Core and Matthew Aylett
for Maptask.
Here are brief descriptions of the file contents, where
dialogue=verbmobil (or maptask in the future).
TEXT files:
instructions an exact copy of this file you are reading
.utttoks file of utterance tokens
.toklabs column of token labels to edit for CGU analysis
.fixed.cgus fixed CGUs for IU analysis ***do not look at this
until done with your own CGU analysis**]
SPEECH files:
.au speech files from SUN package
.au.gz compressed speech files - uncompress according to above
.sd speech files from SGI package
-------------------------------------------------------------------------
3. Steps for doing CGU analysis:
a. Read/listen to the entire dialogue once through before starting
analysis. *BEWARE*: there is a clipping noise roughly between
speaker turns in the verbmobil audio file. Where you hear a clip, the
actual silence duration is only approximate (due to speech
processing details).
- view/print out .utttoks
- SUN users: use audiotool or auplay to play the speech
files.
- SGI users: use xwaves and the accompanying files by typing
xwaves verbmobil.sd
b. Assign utterance tokens to CGUs
Edit the file, .toklabs, a column of utterance token labels,
to produce your CGU analysis file, .cgu.,
e.g. toot.cgu.chn
The CGU analysis file should be formatted as follows:
, , ..., ""
e.g.
14 S.5.1, U.6.1, U.6.2, U.6.3 "decide to move oranges"
*PLEASE REMEMBER* to separate CGU labels by commas, and to put the
descriptive text in quotation marks.
-------------------------------------------------------------------------
4. Steps for doing IU analysis (Verbmobil only)
You will start with the *fixed* CGU file, verbmobil.fixed.cgu,
and create the file, .iu..
Listen to the audio as necessary as described above.
IU analysis files should be formatted as follows:
iu. "IU description"
e.g.
iu.2.5.1 "plan to move oranges to bath"
22 S.22.1, U.23.1 "locate oranges"
23 U.23.2, S.24.1 "decide to move oranges to bath"
iu.2.6 "..."
24 S.24.2, S.24.3 "..."
25 U.25.1, S.26.1 "..."
*PLEASE REMEMBER* to use Gorn numbering (explained the the manual)
to number each IU segment (you can easily add these at the end if
you find it tedious to do while analysing the structure). And
also remember to enclose the IU description in quotation marks.
Also, use TAB markers to indent.
-------------------------------------------------------------------------
5. Extra data to report
a. Utterance tokenization: please make a note of any utterance tokens
that you would have liked to split up. The tokens provided for both
dialogues represent full or major intonational phrase units. If you
find multiple tokens you think should be merged, note them too.
b. Time: please time yourself separately for CGU and IU analysis,
for each dialogue, so we can estimate coding rates for these
schemes. If you forget, rough estimates will still be helpful.
c. Log: for debugging the coding scheme, it's useful to keep
a log of difficult cases you encounter, or problems using
the coding manual guidelines. It will help us organize
revisions/open issues for the workshop if you could provide
concrete examples from your coding experiences. Organize
your log as you see fit.
You can report these data in a separate file, misc.,
or simply email them to chn@research.att.com *and* traum@cs.umd.edu.
-------------------------------------------------------------------------
Files to turn in:
1. verbmobil.cgu.
2. verbmobil.iu.
3. maptask.cgu.
4. misc.
[How to email uuncoded tar files in UNIX:
> tar -cvf .hw2.tar
> uuencode .hw2.tar .hw2.tar >tmp
> Mail chn@research.att.com,traum@cs.umd.edu