Andrew S. Gordon: Choice of Plausible Alternatives

Choice of Plausible Alternatives (COPA)

An evaluation of commonsense causal reasoning

The Choice Of Plausible Alternatives (COPA) evaluation provides researchers with a tool for assessing progress in open-domain commonsense causal reasoning. COPA consists of 1000 questions, split equally into development and test sets of 500 questions each. Each question is composed of a premise and two alternatives, where the task is to select the alternative that more plausibly has a causal relation with the premise. The correct alternative is randomized so that the expected performance of randomly guessing is 50%. More details about the creation of the COPA evaluation are described in the following paper:

Roemmele, M., Bejan, C., and Gordon, A. (2011) Choice of Plausible Alternatives: An Evaluation of Commonsense Causal Reasoning. AAAI Spring Symposium on Logical Formalizations of Commonsense Reasoning, Stanford University, March 21-23, 2011. pdf

1. Examples

Premise: The man broke his toe. What was the CAUSE of this?
Alternative 1: He got a hole in his sock.
Alternative 2: He dropped a hammer on his foot.

Premise: I tipped the bottle. What happened as a RESULT?
Alternative 1: The liquid in the bottle froze.
Alternative 2: The liquid in the bottle poured out.

Premise: I knocked on my neighbor's door. What happened as a RESULT?
Alternative 1: My neighbor invited me in.
Alternative 2: My neighbor left his house.

For the humans on your team, the set of 500 questions in the development set can be found here: COPA-questions-dev.txt

2. The complete COPA package

The entire collection of resources for the COPA evaluation can be downloaded here: COPA-resources.tgz

Included in this package are the following resources:

datasets/copa-dev.xml : 500 questions of the development set
datasets/copa-test.xml : 500 questions of the test set
datasets/copa-all.xml : 1000 questions of both the development and test sets
datasets/copa.dtd : The format of the XML question files

results/gold.* : Correct answers for each set of questions
results/baselineFirst.* : Choices where the first alternative is always selected
results/PMIgutenbergW5.* : Choices made by the best-performing baseline system of Roemmele et al, 2011.

copa-eval.jar : A java package for computing statistical significance of differences in answer sets
copa-eval.sh : A simple shell script for using the java package

3. Statistical significance test

To determine whether two systems are significantly different in their performance on the COPA datasets, we make available a software that implements a significance test. The significance test is based on approximate randomization (Noreen, 1989) that involves a stratified shuffling approach in order to build a distribution of differences in performance between the two systems.

The software is implemented in java and can be executed with a bash script (copa-eval.sh) that receives as arguments the file with the gold-standard answers and the files with the choices of two reasoning systems to compare. The files with the systems' choices can be in any order. For example, try the following command:

./copa-eval.sh results/gold.dev results/baselineFirst.dev results/PMIgutenbergW5.dev

In this example, the function calculates the significance level (p-value) to determine whether or not the results obtained by the baselineFirst and PMIgutenbergW5 systems are significantly different on the COPA development set. P-values of .05, .01, and .001 are commonly viewed as statistically significant, where the lower the significance level, the more likely the two systems are significantly different.

The choices file is a text file where each line encodes a system's choice for one item. The format of such a line is either [item_id 1 0] or [item_id 0 1]. The format [item_id 1 0] indicates that the response for the item item_id from the dataset was selected to be the first alternative; similarly, [item_id 0 1] indicates that the second alternative was selected as response for the item item_id.

4. Research ethics

In providing both the development and test sets, we are relying on competitors to exercise ethical research practices.

Researchers should not study the 500 questions in the COPA test set, and avoid any temptation to alter their systems toward the content of this particular set of questions.
Researchers should evaluate the performance of their systems on the COPA test set only once, after they have concluded all of their efforts to improve perfromance on the COPA development set.

Findings of unethical research practices can ruin your career, so obey the rules.

5. SemEval 2012 Task 7

COPA was used a shared task (Task 7) in the 6th International Workshop on Semantic Evaluation (SemEval 2012). The winning system was created by Travis Goodwin, Bryan Rink, Kirk Roberts, and Sanda M. Harabagiu from the University of Texas at Dallas, Human Language Technology Research Institute. Details about this shared task and the performance of competing systems are provided in the following paper:

Gordon, A., Kozareva, Z., and Roemmele, M. (2012) SemEval-2012 Task 7: Choice of Plausible Alternatives: An Evaluation of Commonsense Causal Reasoning. Proceedings of the 6th International Workshop on Semantic Evaluation (SemEval 2012), June 7-8, 2012, Montreal, Canada. pdf

6. SuperGLUE benchmark for general-purpose language understanding

As of 2019, the COPA evaluation is included in the "SuperGLUE" benchmark for general-purpose language understanding, as one of ten tasks. Website: https://super.gluebenchmark.com/ and Leaderboard: https://super.gluebenchmark.com/leaderboard

7. Competitive results

To encourage progress in this research area, we would like to offer here all of the results generated by competing systems. This will allow us to benchmark different approaches to the problem, and allow us to compute the statistical significance of the differences between systems without having to re-implement them.

Baseline results (provided in the COPA-resources package): PMIgutenbergW5 achieves 58.8% on the test set
Gordon, A., Bejan, C., and Sagae, K. (2011) Commonsense Causal Reasoning Using Millions of Personal Stories. Twenty-Fifth Conference on Artificial Intelligence (AAAI-11), August 7–11, 2011, San Francisco, CA. pdf (COPA-Gordon-2011.tgz): Best system achieves 65.4% on the test set
Goodwin, T., Rink, B., Roberts, K., and Harabagiu, S. (2012) UTDHLT: COPACETIC System for Choosing Plausible Alternatives. Proceedings of the 6th International Workshop on Semantic Evaluation (SemEval 2012), June 7-8, 2012, Montreal, Canada. (COPA-Goodwin-2012.tgz): Best system achieves 63.4% on the test set
Jabeen, S., Gao, X., Andreae, P. (2014). Using Asymmetric Associations for Commonsense Causality Detection. 13th Pacific Rim International Conference on Artificial Intelligence, Gold Coast, QLD, Australia, December 1-5, 2014. Best system achieves 58.8% on the test set
Luo, Z., Sha, Y., Zhu, K., Hwang, S., and Wang, Z. (2016) Commonsense Causal Reasoning between Short Texts. 15th International Conference on Principles of Knowledge Representation and Reasoning (KR-2016), April 25-29, 2016, Cape Town, South Africa. Achieves 70.2% on test set.
Sasaki, S., Takase, S., Inoue, N., Okazaki, N., and Inui, K. (2017) Handling Multiword Expressions in Causality Estimation. 12th International Conference on Computational Semantics (IWCS), 19-22 September 2017 Montpellier, France. (COPA-Sasaki-2017.tgz): Achieves 71.2% on COPA test set. PDF at ACL Anthology
Roemmele, M. and Gordon, A. (2018) An Encoder-decoder Approach to Predicting Causal Relations in Stories. Storytelling Workshop at the 16th Annual Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL 2018), New Orleans, LA, June 5, 2018. pdf. Achieves 66.2% on the COPA test set.
Radford, A. (2018) Improving Language Understanding with Unsupervised Learning. Blog post, June 11, 2018, and related paper. Reports 78.6% on the COPA test set.
Li, Z., Chen, T., and Van Durme, B. (2019) Learning to Rank for Plausible Plausibility. Proceedings of ACL-2019: 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, July 28-August 2, 2019. ArXiV page. Achieves 75.4% on the COPA test set.
Sap, M., Rashkin, H., Chen, D., Le Bras, R., and Choi, Y. (2019) Social IQa: Commonsense Reasoning about Social Interactions. 2019 Conference on Empirical Methods in Natural Language Processing, Hong Kong. ArXiV page. Achieves 84.4% using BERT tuned with crowdsourced commonsense social knowledge.
< your results here >

8. Other related publications

Several other papers discuss the COPA evalution and describe novel approaches to its solution.

Maslan, N., Roemmele, M., and Gordon, A. (2015) One Hundred Challenge Problems for Logical Formalizations of Commonsense Psychology. Twelfth International Symposium on Logical Formalizations of Commonsense Reasoning (Commonsense-2015), March 23-25, 2015, Stanford, CA. pdf
Blass, J. & Forbus, K. (2017). Analogical Chaining with Natural Language Instruction for Commonsense Reasoning. Thirty-First AAAI Conference on Artificial Intelligence (AAAI-17), February 4–9, 2017, San Francisco, CA.
Zhang, S., Rudinger, R., Duh, K., and Van Durme, B. (2017) Ordinal Common-sense Inference. Sixteenth conference on Theoretical Aspects of Rationality and Knowledge (TARK 2017), Liverpool, UK, July 24-26, 2017.

9. License (BSD 2-Clause License)

Copyright (c) 2010, University of Southern California
All rights reserved.

Redistribution and use in source and binary forms, with or without
modification, are permitted provided that the following conditions are met:

* Redistributions of source code must retain the above copyright notice, this
list of conditions and the following disclaimer.

* Redistributions in binary form must reproduce the above copyright notice,
this list of conditions and the following disclaimer in the documentation
and/or other materials provided with the distribution.

THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.