DRI Discourse Structure In Dialogue
Homework 1 Instructions


Contents


General Instructions

Please do the CGU and IU analyses for the TRAINS and TOOT homework dialogues as outlined below. The codings are due March 30, 1998. Five files should be sent to chn@research.att.com (tarred and uuencoded or emailed separately as they are completed is fine too).

Files to turn in: 1. toot.cgu.<coder's_login> 2. toot.iu.<coder's_login> 3. trains.cgu.<coder's_login> 4. trains.iu.<coder's_login> 5. misc.<coder's_login> [How to email uuncoded tar files in UNIX: > tar -cvf codings.tar <file1> <file2> .... <file5> > uuencode codings.tar codings.<coder's login> >tmp > Mail chn@research.att.com <tmp ]

We apologize in advance that we do not have a writable ftp site at this point (any volunteers?) and also that we don't yet have customized tools developed to do this coding. (We are working on that with the other subgroups for the second homework.) So please bear with us and use the instructions below.

Send any and all questions to both chn@research.att.com and traum@cs.umd.edu. As noted in a previous msg, do *NOT* mail questions on coding choices for the homework dialogues to the whole group, so as to not bias other coders. Appropriate question-answer summaries of general value will be posted to the whole group by the co-chairs.

Happy tagging!

Christine and David


Retrieve the files by using ftp.

1a. Using a web browser: goto ftp://ftp.research.att.com/dist/chn

	SUN users:	goto SUN dir, download hw1sun.tar.gz,
			trains.au.gz, toot.au.gz

	PC  users:	goto PC dir, download "hw1pc.tar.gz",
			trains.wav.gz, toot.wav.gz

	SGI users:	goto SGI dir, download hw1sgi.tar.gz,
			trains.sd.gz, toot.sd.gz

    To uncompress: 	type "gunzip .gz"
    To untar: 		type "tar -xvf .tar"

1b. Using anonymous ftp: type "ftp ftp.research.att.com' login as anonymous type "cd dist/chn" type "binary" You need to ftp three files: SUN users, goto SUN dir, get: hw1sun.tar.gz, trains.au.gz, toot.au.gz SGI users, goto SGI dir, get: hw1sgi.tar.gz, trains.sd.gz, toot.sd.gz PC users, goto PC dir, get: hw1pc.tar.gz, trains.wav.gz, toot.wav.gz [Step-by-step: SUN users: type "cd SUN" type "get hw1sun.tar.gz" type "get trains.au.gz" type "get toot.au.gz" SGI users: type "cd SGI" type "get hw1sgi.tar.gz" type "get trains.sd.gz" type "get toot.sd.gz" PC users: type "cd PC" type "get hw1pc.tar.gz" type "get trains.wav.gz" type "get toot.wav.gz" To quit ftp: type "quit" ] To uncompress: type "gunzip .gz" To untar: type "tar -xvf .tar"

Overview of the files

The homework consists of doing CGU analysis on two dialogues, one from TRAINS-93 (from the TRAINS project at the University of Rochester, and one from TOOT (courtesy of Diane Litman, AT&T Labs). Thanks to Peter Heeman and Mark Core for help with TRAINS, and Diane for help with TOOT.

There are two sets of files, trains files and toot files. Both dialogues are to be coded the same way. Here are brief descriptions of the file contents, where dialogue is trains or toot:

TEXT files: INSTRUCTIONS an plaintext copy of this file you are reading <dialogue>.toklabs column of token labels to edit for CGU analysis <dialogue>.times transcript with time indices into speech file <dialogue>.pretty transcript *without* time indices into speech file SPEECH files: <dialogue>.au speech files from SUN package <dialogue>.wav speech files from PC package <dialogue>.sd speech files from SGI package <dialogue>.words xwaves label files from SGI package listen xwaves script from SGI package for playing audio with aligned words

Steps for doing CGU analysis:

  1. Read/listen to the entire dialogue once through before starting analysis. BEWARE*: the sound levels for the trains file are much higher than for the toot file, so adjust your volume setting accordingly. Various parts of both recordings may be difficult to understand; that's why you have the transcriptions ;)

    1. view/print out .times or .pretty [.pretty is easier to read if you don't need time indices].

      Please note: overlapping speech is demarcated by special characters, not simply formatting. The "interrupted" speaker's speech is marked off by the symbol "=". The "interrupting" speaker's speech is marked off by the symbol "+", as usual. e.g

      S.11.13 or =say re=peat to hear... U.12.1 +relax+
    2. SUN users: use audiotool or auplay to play the trains.au/toot.au files. The time indices in .times are meant to help you keep track of where you are in the audio/transcript. (If you do not have either audio player tool, please email chn@research.att.com immediately so you can be supplied with one).
    3. PC users: play the .wav files. Use .times to help you align the audio with the transcript.
    4. SGI users: use xwaves and the accompanying files by typing > listen toot or > listen trains On the default settings, hitting the right button in the <dialogue>.words label tier should play a single utterance token at a time. To begin, however, simply select Play entire file from the waveform menu (hit right button in speechform window). Hit CONTINUE in the control panel to end the script.
  2. Assign utterance tokens to CGUs
    Edit the file, <dialogue>.toklabs, a column of utterance token labels, to produce your CGU analysis file, <dialogue>.cgu.<your_user_login>, e.g. toot.cgu.chn The CGU analysis file should be formatted as follows: <token label> <tok1>, <tok2>, ..., <tokn> "<brief CGU description>" e.g. 14 S.5.1, U.6.1, U.6.2, U.6.3 "decide to move oranges" PLEASE REMEMBER to separate CGU labels by commas, and to put the descriptive text in quotation marks.

Steps for doing IU analysis:

You will start with your CGU file as input, and create the file, <dialogue>.iu.<your_user_login>. Listen to the audio as necessary as described above. IU analysis files should be formatted as follows: iu.<n> "IU description" <cgu no> <cgu no> <cgu no> <cgu no> e.g. iu.2.5.1 "plan to move oranges to bath" 22 S.22.1, U.23.1 "locate oranges" 23 U.23.2, S.24.1 "decide to move oranges to bath" iu.2.6 "..." 24 S.24.2, S.24.3 "..." 25 U.25.1, S.26.1 "..." PLEASE REMEMBER to use Gorn numbering (explained the the manual) to number each IU segment (you can easily add these at the end if you find it tedious to do while analysing the structure). And also remember to enclose the IU description in quotation marks. Also, use TAB markers to indent.

Extra data to report

  1. Utterance tokenization: please make a note of any utterance tokens that you would have liked to split up. The tokens provided for both dialogues represent full or major intonational phrase units. If you find multiple tokens you think should be merged, note them too.
  2. Time: please time yourself separately for CGU and IU analysis, for each dialogue, so we can estimate coding rates for these schemes. If you forget, rough estimates will still be helpful.
  3. Log: for debugging the coding scheme, it's useful to keep a log of difficult cases you encounter, or problems using the coding manual guidelines. It will help us organize revisions/open issues for the workshop if you could provide concrete examples from your coding experiences. Organize your log as you see fit.

You can report these data in a separate file, misc., or simply email them to chn@research.att.com.


Last updated on March 9th 1998