I computed some numbers on the cgu coding - the easy thing was directly comparing the coded units. Thus in the files cgu's are annotated with each coder who identified that unit (along with what identifier they assigned to it). Computing reliability from these is not so easy, however, since it's non-trivial to find a sample set or expected agreement. I did note down two numbers, however - the average ratio of coders who identified a unit and then for each coder, the average ratio of other coders who also identified that unit. These numbers are not particularly meaningful, however, because the former over-penalizes if e.g., only one person picked out a unit, while the latter doesn't punish people for not including popular units. I couldn't easily adopt the standard boundary point markings for CGUs, since CGUs can be both overlapping and discontinuous. I did figure out one way to compute Kappa, however. This is by reference to implicit "psuedo-grounding acts". What I did was just consider whether an utterance token begins, continues or completes a CGU. This is fueled by the observation that, while a token could appear in multiple CGUs, it doesn't generally perform the same function in each of them. This is not explicitly ruled out but does seem to be the case, perhaps with one or two exceptions. So, I scored each token as to whether or not it appeared (1) as the first token in a CGU (2) as the last token in a CGU and/or (3) in a CGU in neither the first or last position. As an example, if someone coded a dialogue consisting of the following utterance tokens: 1.1 1.2 2.1 2.2 2.3 3.1 Into the following CGUs: 1 1.1, 1.2, 2.1 2 2.1, 2.3, 3.1 The the following "acts" would be assigned: begin middle end 1.1 1 0 0 1.2 0 1 0 2.1 1 0 1 2.2 0 0 0 2.3 0 1 0 3.1 0 0 1 I think this system is sufficient to count as the same all identified CGUs that are the same, and to assess penalties for all codings that differ, though I'm not sure the weighting of penalties is necessarily optimal (e.g., leaving out a "middle" counts only one point of disagreement, but leaving out an "end" counts as two, since the next to last, gets counted as an "end" rather than a "middle"). From this, I was able to compute agreement and expected agreement (by examining the relative frequencies of these tags), and thus Kappa. The numbers are not too bad for the whole groupas a first time excercise, and some of the pairwise numbers (e.g., heeman and traum) are above 0.8.