Dawn Lawrie constructed a system that extracts the most important terms and phrases, which she calls concepts, from document texts and establishes a subsumption relationship among them. The result is a tree-based structure where the most general concept term or phrases occupy the highest level, more specific concepts serve as intermediate nodes and documents that contain those concepts are assigned to the leaves of the tree. Dawn calls this tree-like structure the concept hierarchy.
I experimented with different interface designs for visualizing the concept hierarchy and this page briefly describes one of the designs.
The following figure shows the concept hierarchy created from the top 500 documents returned in response to a query “America Cup”. The hierarchy contains four levels: three top levels contain the concepts and the bottom level (or the leaf level) contains the documents. Each level is represented as a column in the picture with the topmost level occupying the leftmost column and the document level visible in the rightmost column.
There are 10 concepts in the top level of the hierarchy. These are the most general concepts that describe the documents and we call them topics. Each of the topics is assigned an individual node in the tree and a unique color. The colors are shown as bars in the background of each concept. The length of each bar is proportional to the number of documents this topic appears in. One can quickly assess that the concept “cup” appears almost in every document (it has a very long yellow bar). While the concepts “popular” and “race” are not very frequent.
The gray vertical lines in the column that resemble a bar code marks designate individual documents in which the concept appears. The position of the gray line is proportional to the rank value of the document, -- the closer the line to the left of the column the higher the document’s rank. For example, the concepts “America” and “food” appear in about the same number of documents (the corresponding bars have similar length). However, “America” is much more frequent in documents that appear at the top of the ranked list, while “food” is more often present in the documents that are close to the end of the list. This can be observed from the relative frequency of the gray marks for both concepts in the uncolored region of the column.
Each topic from the top level connects to 10 concepts in the second level (“Level 1” in the figure). Several topics may link to the same concept in the second level as indicated by several colors assigned to the concept, e.g., see the word “butter”.
Each concept from the second layer connects to 10 more specific concepts in the third layer, which in turn connect to documents in the leaf level.
The document column shows the retrieved documents ordered by their rank. The length of the color bars adjacent to each document title is proportional to the number of individual concepts the document is linked to. Specifically it is equal to the number of different tree paths we can take from this document node to reach one of the topic nodes.
This visualization supports dynamic queries. If the user selects one or several concepts in the individual columns, the system reorders the visualization to present only the nodes connected to the selected concepts. For example, if I select “race” only the nodes that are connected to that topic and their children and their children’s children will be visible. If I select a couple of documents, then only the concepts that are associated with those documents will be visible.
The second screenshot contains a concept hierarchy created for 1000 documents returned from a Chinese text collection in response to an English query. The concepts were automatically translated into English. Here a user selected document 431 and the visualization dynamically reorganized the top levels of the hierarchy to display only the terms that are linked to this document. We believe, that an non-Chinese-speaking user can quickly assess the document content from this display. She can also locate similar documents by selecting the terms.