IBM®
Skip to main content
    United States [change]    Terms of use
 
 
 
    Home    Products    Services & solutions    Support & downloads    My account    
IBM Research

IBM Almaden Research Center

Avatar Semantic Search


Overview

New Release!

The Avatar team is pleased to announce the AlphaWorks release of System Text for Information Extraction, a platform that brings declarative information extraction to the enterprise. Based on the SystemT research prototype, System Text makes it easier to build enterprise applications that need to extract structured information from unstructed text.

This alpha release features:

  • AQL, a rule language that combines the familiar declarative syntax of SQL with the expressive power of IBM's algebraic extraction technology.
  • The System Text Development Environment, a tool for building extraction rules and testing them over user-defined document collections.

The goal of the Avatar project is two fold: (i) to enable the discovery and extraction of structured information buried in volumes of unstructured text (such as emails, web pages, and blogs), and (ii) to exploit this information to drive the next generation of search and business intelligence applications. Ongoing research in Avatar is at the cusp of a number of disciplines ranging from search and information retrieval to machine learning, information extraction, and probabilistic databases.

Members

  • Shivakumar Vaithyanathan (Manager)
  • Huaiyu Zhu
  • Sriram Raghavan
  • Rajasekar Krishnamurthy
  • Frederick Reiss
  • Eser Kandogan
  • Yunyao Li
  • Fatma Ozcan

Alumni

  • Alexander Loeser (post-doc visitor)
  • Erik Vee (post-doc visitor, now at Yahoo! Research)

Collaborators within IBM

  • Almaden Computer Science Theory Group
  • IBM India Research Lab
    • Prasad Deshpande
    • Ganesh Ramakrishnan

External collaborators

Avatar Team releases IBM OmniFind Personal Email Search

Simple keyword or text search is not always effective for quickly finding what you need. IBM® has gone beyond keywords by inventing a fast and accurate semantic search system for personal e-mail. IBM OmniFind Personal E-mail Search enables semantic searching by extracting and organizing concepts and relationships from personal e-mail. Any business e-mail user who must search in order to accomplish a business purpose will find this tool invaluable. Customization of semantic concepts and the ability to share these concepts with colleagues make this tool especially useful for large enterprise customers. IBM OmniFind Personal E-mail Search is easy to install and configure and automatically adjusts to desktop load.

Screenshot of IBM OmniFind Personal Email Search

Try it out by downloading an installer from the IBM AlphaWorks website.

Avatar Technology in Lotus Notes 8.01

Avatar Information Extraction technology in action in Lotus Notes 8.01!!

Some of the core information extraction technology that powers the Lotus Notes Live Text feature (http://www-306.ibm.com/software/lotus/products/notes/productivitytools.html) announced in Lotus Notes 8.01 (http://ross.typepad.com/blog/2008/01/lotusphere-open.html) was provided by the Avatar project.

There are three major research efforts that constitute the core of the Avatar project:

Avatar Information Extraction System

Many enterprises maintain large repositories of unstructured text data, ranging from email and web pages to call-center records and business reports. Unfortunately, this data is of limited use as long as it remains in its unstructured form. Existing software can index unstructured text for keyword search, but deeper analysis is not possible without the ability to derive useful structured information from the raw text. Consequently, there has been an increasing interest in the problem of building annotators that extract structured information from unstructured enterprise data.

Although the value of efficient algorithms for information extraction has long been recognized, comparatively little attention has been paid to the problem of building scalable, easy-to-use systems that can leverage these algorithms. To fill this need, we are developing the Avatar Information Extraction System (Avatar IES) that enables relatively unsophisticated users to build powerful rule-based annotators that can operate over very large corpora. We view the task of building an annotator as a declarative query specification over an annotation database. By adapting the knowledge gained from query evaluation and optimization in database systems, we are building a scalable annotation development infrastructure.

Avatar Semantic Search

An increasingly important class of keyword search tasks are those where users are looking for a specific piece of information buried within a few documents in a large collection. Examples include searching for (a) someone's phone number or a package tracking URL, within a personal email collection, (b) reviews from blogs and (c) internal homepage for a person or a group within the company intranet. While modern information extraction techniques can be used to extract the concepts involved in these tasks (persons, phone numbers, restaurant reviews, etc.), since users only provide keywords as input, the problem of identifying the documents that contain the information of interest remains a challenge.

In Avatar Semantic Search, we are building a solution to this problem based on the concept of automatically generating ``interpretations'' of keyword queries. Interpretations are precise structured queries, over the extracted concepts, that model the real intent behind a keyword query. We have formalized the notion of interpretations and are addressing the various challenges in identifying the most likely interpretations for a given keyword query. The resulting interpretations are presented in an intuitive interface resulting in a dialogue between the user and the system to determine the true user intent (as shown in the screenshots below).

[Screen shot 1] [Screenshot 2]

Managing Uncertainty and Probabilistic Databases

The rule-based framework in Avatar IES allows users to handcraft rules with varying degrees of precision. In addition, the extracted annotations that are stored in the annotation database can themselves be used to match other rules. Managing the uncertainty resulting from this complex interaction between the rule matches is critical to improving the recall of extracted annotations. This presents a significant challenge for probabilistic database systems, an emerging area that uses probability theory as the formal underpinnings of query evaluation over uncertain data. In our work, we have described the several issues that arise while mapping rules to queries over a probabilistic database. Solving these challenges leads to a system where the confidences associated with the annotations have precise semantics, and results in improved the recall of extracted annotations.

The output of the extracted annotations along with their confidences can also be used to improve the quality of data analysis in business intelligence applications such as OLAP. This leads to a more general problem of handling different types of data ambiguity in OLAP. Our work is the first to formally address this problem and propose a probabilistic database solution for handling uncertainty in OLAP. Building such a probabilistic database from the underlying data involves manipulating the data in a non-sequential fashion and we have designed sophisticated algorithms that can overcome the resulting I/O-bottleneck. Next, evaluating aggregation queries over the probabilistic database becomes challenging even for simple operators such as average and min/max. We have designed fast stream-based algorithms for this problems that use a combination of novel techniques involving generating functions and analytic methods from approximation theory.

 

Enterprise Search (ES2) and Cloud Computing

Despite the success of Web search engines, search over large enterprise intranets still suffers from poor result quality. In the ES2 project, we are applying some of the techniques and the tools developed as part of our work on information extraction and semantic search to the domain of intranet search. We have developed a prototype search engine that is specifically designed from the ground up to provide high quality answers to "navigational" intranet queries. Navigational queries are characterized by the fact that there are a small number of pages (often exactly one) that form the "correct" answer to such queries. Typical examples include queries to locate the "home pages of entities" such as products, people, services, research labs, etc. The ES2 solution involves a combination of sophisticated offline text analytics-based page analysis, a specially designed navigational index, and intelligent runtime matching of search terms against this index. In addition, ES2 incorporates techniques to automatically personalize search results based on the work location ("geography") of the search user. Our ongoing research agenda includes (i) Designing our core analysis and indexing infrastructure to scale further by exploiting publicly available cloud computing infrastructure (such as Hadoop) (ii) Extending our work on personalization to support personalization based on job profiles (iii) Extending our system architecture and algorithms to support incremental updates to the document collection, (iv) Design of tools and administration utilities to help maintain and improve such a search engine as the intranet evolves.

Selected Publications and Presentations

Selected Publications and Presentations


    About IBMPrivacyContact