Andrew S. Gordon

Choice of Plausible Alternatives (COPA)

An evaluation of commonsense causal reasoning

The Choice Of Plausible Alternatives (COPA) evaluation provides researchers with a tool for assessing progress in open-domain commonsense causal reasoning. COPA consists of 1000 questions, split equally into development and test sets of 500 questions each. Each question is composed of a premise and two alternatives, where the task is to select the alternative that more plausibly has a causal relation with the premise. The correct alternative is randomized so that the expected performance of randomly guessing is 50%. More details about the creation of the COPA evaluation are described in the following paper:

1. Examples

Premise: The man broke his toe. What was the CAUSE of this?
Alternative 1: He got a hole in his sock.
Alternative 2: He dropped a hammer on his foot.

Premise: I tipped the bottle. What happened as a RESULT?
Alternative 1: The liquid in the bottle froze.
Alternative 2: The liquid in the bottle poured out.

Premise: I knocked on my neighbor's door. What happened as a RESULT?
Alternative 1: My neighbor invited me in.
Alternative 2: My neighbor left his house.

For the humans on your team, the set of 500 questions in the development set can be found here: COPA-questions-dev.txt

2. The complete COPA package

The entire collection of resources for the COPA evaluation can be downloaded here: COPA-resources.tgz

Included in this package are the following resources:

3. Statistical significance test

To determine whether two systems are significantly different in their performance on the COPA datasets, we make available a software that implements a significance test. The significance test is based on approximate randomization (Noreen, 1989) that involves a stratified shuffling approach in order to build a distribution of differences in performance between the two systems.

The software is implemented in java and can be executed with a bash script ( that receives as arguments the file with the gold-standard answers and the files with the choices of two reasoning systems to compare. The files with the systems' choices can be in any order. For example, try the following command:

./ results/ results/ results/

In this example, the function calculates the significance level (p-value) to determine whether or not the results obtained by the baselineFirst and PMIgutenbergW5 systems are significantly different on the COPA development set. P-values of .05, .01, and .001 are commonly viewed as statistically significant, where the lower the significance level, the more likely the two systems are significantly different.

The choices file is a text file where each line encodes a system's choice for one item. The format of such a line is either [item_id 1 0] or [item_id 0 1]. The format [item_id 1 0] indicates that the response for the item item_id from the dataset was selected to be the first alternative; similarly, [item_id 0 1] indicates that the second alternative was selected as response for the item item_id.

4. Research ethics

In providing both the development and test sets, we are relying on competitors to exercise ethical research practices.

Findings of unethical research practices can ruin your career, so obey the rules.

5. SemEval 2012 Task 7

COPA was used a shared task (Task 7) in the 6th International Workshop on Semantic Evaluation (SemEval 2012). The winning system was created by Travis Goodwin, Bryan Rink, Kirk Roberts, and Sanda M. Harabagiu from the University of Texas at Dallas, Human Language Technology Research Institute. Details about this shared task and the performance of competing systems are provided in the following paper:

6. Competitive results

To encourage progress in this research area, we would like to offer here all of the results generated by competing systems. This will allow us to benchmark different approaches to the problem, and allow us to compute the statistical significance of the differences between systems without having to re-implement them.