|
eArchivarius is a tentative name
for a system that provides interactive access to archives of electronic mail.
The intended users of the system are historians, social scientists, and others
who are looking to understand the content of email collections from dignitaries
or government organizations. The main focus is to visualize the relationships
between individual messages, people, and events that are described in the messages.
eArchivarius combines ranked retrieval with
cluster-based and time-based navigation. The system consists of the following
main components:
- Each email collection is stored as two separate indexes: one index contains
the messages and the other holds information about the people. Both
messages and people descriptions are viewed as semi-structured documents
— structured documents with fields containing free text. The open-source
search engine Lucene is used to store and index the information.
The following fields are indexed for email messages:
from, to, cc, recipient (the "to" field
combined with the "cc" field),
audience (the "from", "to", and "cc" fields
combined together),
date, subject, body, and contents (the "subject" field
combined with the "body" field plus
any quoted text present in the message).
The following fields are indexed for the people descriptions: email, first
name, last name, and description.
- I extended the Lucene search capabilities to provide for relevance feedback
and example searches. The two indexes of the collection are tightly linked
so the user can search messages using people information and search for
people using message content. For example, the user can retrieve
all people that seen a particular set of messages or find all messages
that have a similar audience to a given email.
- The Lighthouse visualization and clustering system is used to present
the results of the search and helps the user to formulate the queries.
The Lighthouse system was extended to provide for multiple meanings of
inter-object similarity that can be found in an email collection. For
example, messages can be similar based on content, date, or audience.
Lighthouse allows the user to define the total inter-message similarity
as a weighted sum of the individual similarities and provides controls
to dynamically adjust each component.
- An infinitely-zoomable timeline presentation is added to the Lighthouse
system to represent the distribution of messages sent and received by
different people in the collection. The timeline scale varies from one
pixel per second to one pixel per month.
More information about eArchivarius:
- eArchivarius overview
- eArchivarius walk-through contains some screenshots
of the system
- Poster presented at USC/ISI retreat
- SIGIR'03 presentation.
|