foldereArchivarius

eArchivarius is a tentative name for a system that provides interactive access to archives of electronic mail. The intended users of the system are historians, social scientists, and others who are looking to understand the content of email collections from dignitaries or government organizations. The main focus is to visualize the relationships between individual messages, people, and events that are described in the messages.

eArchivarius combines ranked retrieval with cluster-based and time-based navigation. The system consists of the following main components:

  1. Each email collection is stored as two separate indexes: one index contains the messages and the other holds information about the people. Both messages and people descriptions are viewed as semi-structured documents — structured documents with fields containing free text. The open-source search engine Lucene is used to store and index the information.

    The following fields are indexed for email messages: from, to, cc, recipient (the “to” field combined with the “cc” field), audience (the “from”, “to”, and “cc” fields combined together), date, subject, body, and contents (the “subject” field combined with the “body” field plus any quoted text present in the message).

    The following fields are indexed for the people descriptions: email, first name, last name, and description.
  2. I extended the Lucene search capabilities to provide for relevance feedback and example searches. The two indexes of the collection are tightly linked so the user can search messages using people information and search for people using message content. For example, the user can retrieve all people that seen a particular set of messages or find all messages that have a similar audience to a given email.
  3. The Lighthouse visualization and clustering system is used to present the results of the search and helps the user to formulate the queries. The Lighthouse system was extended to provide for multiple meanings of inter-object similarity that can be found in an email collection. For example, messages can be similar based on content, date, or audience. Lighthouse allows the user to define the total inter-message similarity as a weighted sum of the individual similarities and provides controls to dynamically adjust each component.
  4. An infinitely-zoomable timeline presentation is added to the Lighthouse system to represent the distribution of messages sent and received by different people in the collection. The timeline scale varies from one pixel per second to one pixel per month.

More information about eArchivarius:

  1. eArchivarius overview
  2. eArchivarius walk-through contains some screenshots of the system
  3. Poster presented at USC/ISI retreat
  4. SIGIR’03 presentation.