The Syntactic Similarity Resource Page
Computing syntactic similarity between words using frequencies of parse tree paths
1. Overview
Usually when people talk about the similarity of words, they are thinking about semantic similarity (e.g. synonomy). However, sometimes is useful to think about the syntactic similarity of words, i.e. how similar are two words with respect to their syntactic function or role? You can think of traditional part-of-speech tags as a coarse theory of syntactic similarity, e.g. all personal pronouns have similar syntactic roles. Still, it would be nice to have a quantitative measure of the exact degree of syntactic similarity between two words.
The method that I have explored in my work is to compute syntactic similarity as the cosine distance between the syntactic behavior of words represented as normalized feature vectors of the frequency of unique parse tree paths in large corpora of syntactically parsed text. Using this approach, we can compute that the syntactic similarity between the words "whisk" and "pluck" is 0.849.
In the first paper that I wrote about this technique, we calculated the syntactic similarity between verbs in PropBank in order to improve the performance of a simple semantic role labeling system.
- Gordon, A. and Swanson, R. (2007) Generalizing semantic role annotations across syntactically similar verbs. Proceedings of the 2007 meeting of the Association for Computational Linguistics (ACL-07), Prague, Czech Republic, June 23-30, 2007. pdf
In the second paper, I calculated the syntactic similarity between the top five thousand words in the Brown WSJ corpus, clustered them, then Kenji Sagae used the cluster labels to improve semantic argument labeling of HPSG trees.
- Sagae, K. and Gordon, A. (2009) Clustering Words by Syntactic Similarity Improves Dependency Parsing of Predicate-Argument Structures. International Conference on Parsing Technologies (IWPT-09), Paris, France, October 7-9, 2009. pdf
2. Syntactic Similarity Tables
If you are a natural language processing researcher, you probably can think of a few good applications of this measure of syntactic similarity. However, you may not have the time or resources needed to parse millions of words of text, tabulate parse tree path frequencies, compute pairwise cosine similarity scores, and identify clusters using hierarchical agglomerate clustering techniques. You are probably better off just getting this data from me directly, and kindly thanking me for it in your published work.
Here is what I have available:
- PropBank verb tables: Pairwise similarity and hierarchical clustering of 3241 PropBank verbs using labeled up-down constituency parse tree paths from Charniak-parsed GigaWord sentences, 1000 for each verb, where profiles include all lemmas of a particular verb ignoring case.
- News/weblog-story tables: Pairwise similarity, hierarchical clustering, and cross-genre mappings for the 4077 most frequent GigaWord words and 3060 most frequent words in a corpus of personal stories extracted from weblogs, using unlabeled left-right up-down binary parse tree paths from an unsupervised constituency parser, 1000 sentences for each word distinguished by assigned part-of-speech tag, ignoring case.
- Wall Street Journal tables: Pairwise similarity, hierarchical clustering, and multi-granular part-of-speech tags for the 5000 most frequent words in the parsed BLLIP corpus of 30 million WSJ words, using left-middle-right up-down consituency parse tree paths, 1000 sentneces for each word distinguised by assigned part-of-speech tag and case.
- More PropBank verb tables: Pairwise similarity and hierarchical clustering of 3241 PropBank verbs, using left-middle-right up-down constituency parse tree paths, 1000 for each verb, where profiles include all lemmas of a particular verb, ignoring part-of-speech tag and case. Clustered using single-link, average-link, and total-link clustering methods.
3. Contact
For information about this research, please contact Andrew S. Gordon (gordon @ ict.usc.edu) of the USC Institute for Creative Technologies.