Choice of Plausible Alternatives (COPA)
An evaluation of commonsense causal reasoning
The Choice Of Plausible Alternatives (COPA) evaluation provides researchers with a tool for assessing progress in open-domain commonsense causal reasoning. COPA consists of 1000 questions, split equally into development and test sets of 500 questions each. Each question is composed of a premise and two alternatives, where the task is to select the alternative that more plausibly has a causal relation with the premise. The correct alternative is randomized so that the expected performance of randomly guessing is 50%. More details about the creation of the COPA evaluation are described in the following paper:
- Roemmele, M., Bejan, C., and Gordon, A. (2011) Choice of Plausible Alternatives: An Evaluation of Commonsense Causal Reasoning. AAAI Spring Symposium on Logical Formalizations of Commonsense Reasoning, Stanford University, March 21-23, 2011. pdf
Premise: The man broke his toe. What was the CAUSE of this?
Alternative 1: He got a hole in his sock.
Alternative 2: He dropped a hammer on his foot.
Premise: I tipped the bottle. What happened as a RESULT?
Alternative 1: The liquid in the bottle froze.
Alternative 2: The liquid in the bottle poured out.
Premise: I knocked on my neighbor's door. What happened as a RESULT?
Alternative 1: My neighbor invited me in.
Alternative 2: My neighbor left his house.
For the humans on your team, the set of 500 questions in the development set can be found here: COPA-questions-dev.txt
2. The complete COPA package
The entire collection of resources for the COPA evaluation can be downloaded here: COPA-resources.tgz
Included in this package are the following resources:
- datasets/copa-dev.xml : 500 questions of the development set
- datasets/copa-test.xml : 500 questions of the test set
- datasets/copa-all.xml : 1000 questions of both the development and test sets
- datasets/copa.dtd : The format of the XML question files
- results/gold.* : Correct answers for each set of questions
- results/baselineFirst.* : Choices where the first alternative is always selected
- results/PMIgutenbergW5.* : Choices made by the best-performing baseline system of Roemmele et al, 2011.
- copa-eval.jar : A java package for computing statistical significance of differences in answer sets
- copa-eval.sh : A simple shell script for using the java package
3. Statistical significance test
To determine whether two systems are significantly different in their performance on the COPA datasets, we make available a software that implements a significance test. The significance test is based on approximate randomization (Noreen, 1989) that involves a stratified shuffling approach in order to build a distribution of differences in performance between the two systems.
The software is implemented in java and can be executed with a bash script (copa-eval.sh) that receives as arguments the file with the gold-standard answers and the files with the choices of two reasoning systems to compare. The files with the systems' choices can be in any order. For example, try the following command:
./copa-eval.sh results/gold.dev results/baselineFirst.dev results/PMIgutenbergW5.dev
In this example, the function calculates the significance level (p-value) to determine whether or not the results obtained by the baselineFirst and PMIgutenbergW5 systems are significantly different on the COPA development set. P-values of .05, .01, and .001 are commonly viewed as statistically significant, where the lower the significance level, the more likely the two systems are significantly different.
The choices file is a text file where each line encodes a system's choice for one item. The format of such a line is either [item_id 1 0] or [item_id 0 1]. The format [item_id 1 0] indicates that the response for the item item_id from the dataset was selected to be the first alternative; similarly, [item_id 0 1] indicates that the second alternative was selected as response for the item item_id.
4. Research ethics
In providing both the development and test sets, we are relying on competitors to exercise ethical research practices.
- Researchers should not study the 500 questions in the COPA test set, and avoid any temptation to alter their systems toward the content of this particular set of questions.
- Researchers should evaluate the performance of their systems on the COPA test set only once, after they have concluded all of their efforts to improve perfromance on the COPA development set.
Findings of unethical research practices can ruin your career, so obey the rules.
5. SemEval 2012 Task 7
COPA was used a shared task (Task 7) in the 6th International Workshop on Semantic Evaluation (SemEval 2012). The winning system was created by Travis Goodwin, Bryan Rink, Kirk Roberts, and Sanda M. Harabagiu from the University of Texas at Dallas, Human Language Technology Research Institute. Details about this shared task and the performance of competing systems are provided in the following paper:
- Gordon, A., Kozareva, Z., and Roemmele, M. (2012) SemEval-2012 Task 7: Choice of Plausible Alternatives: An Evaluation of Commonsense Causal Reasoning. Proceedings of the 6th International Workshop on Semantic Evaluation (SemEval 2012), June 7-8, 2012, Montreal, Canada. pdf
6. SuperGLUE benchmark for general-purpose language understanding
As of 2019, the COPA evaluation is included in the "SuperGLUE" benchmark for general-purpose language understanding, as one of ten tasks. Website: https://super.gluebenchmark.com/ and Leaderboard: https://super.gluebenchmark.com/leaderboard
7. Competitive results
To encourage progress in this research area, we would like to offer here all of the results generated by competing systems. This will allow us to benchmark different approaches to the problem, and allow us to compute the statistical significance of the differences between systems without having to re-implement them.
- Baseline results (provided in the COPA-resources package): PMIgutenbergW5 achieves 58.8% on the test set
- Gordon, A., Bejan, C., and Sagae, K. (2011) Commonsense Causal Reasoning Using Millions of Personal Stories. Twenty-Fifth Conference on Artificial Intelligence (AAAI-11), August 7–11, 2011, San Francisco, CA. pdf (COPA-Gordon-2011.tgz): Best system achieves 65.4% on the test set
- Goodwin, T., Rink, B., Roberts, K., and Harabagiu, S. (2012) UTDHLT: COPACETIC System for Choosing Plausible Alternatives. Proceedings of the 6th International Workshop on Semantic Evaluation (SemEval 2012), June 7-8, 2012, Montreal, Canada. (COPA-Goodwin-2012.tgz): Best system achieves 63.4% on the test set
- Jabeen, S., Gao, X., Andreae, P. (2014). Using Asymmetric Associations for Commonsense Causality Detection. 13th Pacific Rim International Conference on Artificial Intelligence, Gold Coast, QLD, Australia, December 1-5, 2014. Best system achieves 58.8% on the test set
- Luo, Z., Sha, Y., Zhu, K., Hwang, S., and Wang, Z. (2016) Commonsense Causal Reasoning between Short Texts. 15th International Conference on Principles of Knowledge Representation and Reasoning (KR-2016), April 25-29, 2016, Cape Town, South Africa. Achieves 70.2% on test set.
- Sasaki, S., Takase, S., Inoue, N., Okazaki, N., and Inui, K. (2017) Handling Multiword Expressions in Causality Estimation. 12th International Conference on Computational Semantics (IWCS), 19-22 September 2017 Montpellier, France. (COPA-Sasaki-2017.tgz): Achieves 71.2% on COPA test set. PDF at ACL Anthology
- Roemmele, M. and Gordon, A. (2018) An Encoder-decoder Approach to Predicting Causal Relations in Stories. Storytelling Workshop at the 16th Annual Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL 2018), New Orleans, LA, June 5, 2018. pdf. Achieves 66.2% on the COPA test set.
- Radford, A. (2018) Improving Language Understanding with Unsupervised Learning. Blog post, June 11, 2018, and related paper. Reports 78.6% on the COPA test set.
- Li, Z., Chen, T., and Van Durme, B. (2019) Learning to Rank for Plausible Plausibility. Proceedings of ACL-2019: 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, July 28-August 2, 2019. ArXiV page. Achieves 75.4% on the COPA test set.
- Sap, M., Rashkin, H., Chen, D., Le Bras, R., and Choi, Y. (2019) Social IQa: Commonsense Reasoning about Social Interactions. 2019 Conference on Empirical Methods in Natural Language Processing, Hong Kong. ArXiV page. Achieves 84.4% using BERT tuned with crowdsourced commonsense social knowledge.
- < your results here >
8. Other related publications
Several other papers discuss the COPA evalution and describe novel approaches to its solution.
- Maslan, N., Roemmele, M., and Gordon, A. (2015) One Hundred Challenge Problems for Logical Formalizations of Commonsense Psychology. Twelfth International Symposium on Logical Formalizations of Commonsense Reasoning (Commonsense-2015), March 23-25, 2015, Stanford, CA. pdf
- Blass, J. & Forbus, K. (2017). Analogical Chaining with Natural Language Instruction for Commonsense Reasoning. Thirty-First AAAI Conference on Artificial Intelligence (AAAI-17), February 4–9, 2017, San Francisco, CA.
- Zhang, S., Rudinger, R., Duh, K., and Van Durme, B. (2017) Ordinal Common-sense Inference. Sixteenth conference on Theoretical Aspects of Rationality and Knowledge (TARK 2017), Liverpool, UK, July 24-26, 2017.