The Syntactic Similarity Resource Page

Computing syntactic similarity between words using frequencies of parse tree paths

1. Overview

Usually when people talk about the similarity of words, they are thinking about semantic similarity (e.g. synonomy). However, sometimes is useful to think about the syntactic similarity of words, i.e. how similar are two words with respect to their syntactic function or role? You can think of traditional part-of-speech tags as a coarse theory of syntactic similarity, e.g. all personal pronouns have similar syntactic roles. Still, it would be nice to have a quantitative measure of the exact degree of syntactic similarity between two words.

The method that I have explored in my work is to compute syntactic similarity as the cosine distance between the syntactic behavior of words represented as normalized feature vectors of the frequency of unique parse tree paths in large corpora of syntactically parsed text. Using this approach, we can compute that the syntactic similarity between the words "whisk" and "pluck" is 0.849.

In the first paper that I wrote about this technique, we calculated the syntactic similarity between verbs in PropBank in order to improve the performance of a simple semantic role labeling system.

In the second paper, I calculated the syntactic similarity between the top five thousand words in the Brown WSJ corpus, clustered them, then Kenji Sagae used the cluster labels to improve semantic argument labeling of HPSG trees.

2. Syntactic Similarity Tables

If you are a natural language processing researcher, you probably can think of a few good applications of this measure of syntactic similarity. However, you may not have the time or resources needed to parse millions of words of text, tabulate parse tree path frequencies, compute pairwise cosine similarity scores, and identify clusters using hierarchical agglomerate clustering techniques. You are probably better off just getting this data from me directly, and kindly thanking me for it in your published work.

Here is what I have available:

3. Contact

For information about this research, please contact Andrew S. Gordon (gordon @ ict.usc.edu) of the USC Institute for Creative Technologies.