DRI Discourse Structure In Dialogue
Homework 2 CGU/IU Coding Instructions


Links:


1. Retrieve the files by using ftp.

1a. Using a web browser: goto ftp://ftp.research.att.com/dist/chn/hw2

        SUN users:	download  "hw2sun.tar.gz"
	                download "maptasksun.tar.gz"
	SGI users:	download "hw2sgi.tar.gz"
	        	download "maptasksgi.tar.gz"

    To uncompress: 	type "gunzip .gz"
    To untar: 		type "tar -xvf .tar"
   ----------------------------------------------------------------
1b. Using anonymous ftp: type "ftp ftp.research.att.com'
			login as anonymous
			type "cd dist/chn/hw2"
			type "binary"

You need to ftp two files:
	SUN users, get hw2sun.tar.gz
                   get maptasksun.tar.gz
	SGI users, get hw2sgi.tar.gz
                   get maptasksgi.tar.gz

[Step-by-step: SUN users: type "get hw2sun.tar.gz" type "get maptasksun.tar.gz" SGI users: type "get hw2sgi.tar.gz" type "get maptasksgi.tar.gz" To quit ftp: type "quit" ] To uncompress: type "gunzip <filename>.gz" To untar: type "tar -xvf <filename>.tar"
-------------------------------------------------------------------------
2. Overview of the files

The homework consists of doing CGU analysis on one dialogue from
VERBMOBIL (see their homepage at 
http://www.dfki.uni-sb.de/verbmobil/)
(and one from MAPTASK).  Thanks to Norbert Reithinger and
Mark Core for help with VERBMOBIL, Mark Core and Matthew Aylett
 for Maptask.

Here are brief descriptions of the file contents, where
dialogue=verbmobil (or maptask in the future).
TEXT files: instructions an exact copy of this file you are reading <dialogue>.utttoks file of utterance tokens <dialogue>.toklabs column of token labels to edit for CGU analysis <dialogue>.fixed.cgus fixed CGUs for IU analysis ***do not look at this until done with your own CGU analysis**] SPEECH files: <dialogue>.au speech files from SUN package <dialogue>.au.gz compressed speech files - uncompress according to above <dialogue>.sd speech files from SGI package ------------------------------------------------------------------------- 3. Steps for doing CGU analysis: a. Read/listen to the entire dialogue once through before starting analysis. *BEWARE*: there is a clipping noise roughly between speaker turns in the verbmobil audio file. Where you hear a clip, the actual silence duration is only approximate (due to speech processing details). - view/print out <dialogue>.utttoks - SUN users: use audiotool or auplay to play the speech files. - SGI users: use xwaves and the accompanying files by typing xwaves verbmobil.sd b. Assign utterance tokens to CGUs Edit the file, <dialogue>.toklabs, a column of utterance token labels, to produce your CGU analysis file, <dialogue>.cgu.<your_user_login>, e.g. toot.cgu.chn The CGU analysis file should be formatted as follows: <token label> <tok1>, <tok2>, ..., <tokn> "<brief CGU description>" e.g. 14 S.5.1, U.6.1, U.6.2, U.6.3 "decide to move oranges" *PLEASE REMEMBER* to separate CGU labels by commas, and to put the descriptive text in quotation marks. ------------------------------------------------------------------------- 4. Steps for doing IU analysis (Verbmobil only) You will start with the *fixed* CGU file, verbmobil.fixed.cgu, and create the file, <dialogue>.iu.<your_user_login>. Listen to the audio as necessary as described above. IU analysis files should be formatted as follows: iu.<n> "IU description" <cgu no> <cgu no> <cgu no> <cgu no> e.g. iu.2.5.1 "plan to move oranges to bath" 22 S.22.1, U.23.1 "locate oranges" 23 U.23.2, S.24.1 "decide to move oranges to bath" iu.2.6 "..." 24 S.24.2, S.24.3 "..." 25 U.25.1, S.26.1 "..." *PLEASE REMEMBER* to use Gorn numbering (explained the the manual) to number each IU segment (you can easily add these at the end if you find it tedious to do while analysing the structure). And also remember to enclose the IU description in quotation marks. Also, use TAB markers to indent. ------------------------------------------------------------------------- 5. Extra data to report a. Utterance tokenization: please make a note of any utterance tokens that you would have liked to split up. The tokens provided for both dialogues represent full or major intonational phrase units. If you find multiple tokens you think should be merged, note them too. b. Time: please time yourself separately for CGU and IU analysis, for each dialogue, so we can estimate coding rates for these schemes. If you forget, rough estimates will still be helpful. c. Log: for debugging the coding scheme, it's useful to keep a log of difficult cases you encounter, or problems using the coding manual guidelines. It will help us organize revisions/open issues for the workshop if you could provide concrete examples from your coding experiences. Organize your log as you see fit. You can report these data in a separate file, misc.<user_login>, or simply email them to chn@research.att.com *and* traum@cs.umd.edu. ------------------------------------------------------------------------- Files to turn in: 1. verbmobil.cgu.<coder's_login> 2. verbmobil.iu.<coder's_login> 3. maptask.cgu.<coder's_login> 4. misc.<coder's_login> [How to email uuncoded tar files in UNIX: > tar -cvf <login>.hw2.tar <file1> <file2> <file3> <file4> > uuencode <login>.hw2.tar <login>.hw2.tar >tmp > Mail chn@research.att.com,traum@cs.umd.edu <tmp ]