Open Domain Question Answering on the WWW

By Jason Kantz, August 2001


The appeal of being able to type a question into the computer and receive an answer has been renewed by the broad availability of information on the World Wide Web (WWW).  Ideally, a question answering system that uses the WWW as its knowledge base would be able to answer a broad range of questions.  Practically, the size and unstructured nature of the WWW makes this a very difficult task.  Though question answering has been around for some time in different forms, it is still a challenging topic that is actively researched.

The goal of Question Answering systems is to take a natural language question, and return the correct answer as a short natural language response, but the answer could also refer to a document where one can read the answer.  Question Answering systems have taken different forms since the sixties.  The first Question Answering systems were natural language query interfaces to relational databases (Androutsopoulos et al., 1995).  Question answering systems from the seventies included the text understanding programs such as SAM (Schank and Colby, 1973), Ms. Malaprop (Charniak, 1977), PAM (Wilensky, 1978) and POLITICS (Carbonell, 1979).  These programs all parsed natural language text and created a knowledge base augmented by a knowledge-engineered structure like a situational script, a frame, or a plan.  The programs demonstrated their understanding of a text by producing a paraphrase or by answering questions.  Because of the amount of knowledge engineering involved, the domains of these systems were all limited to certain topics.  SAM dealt with stereotyped episodes like riding a subway or eating in restaurants; PAM dealt with goal-following individuals; POLITICS dealt with newspaper articles about foreign affairs.

Examining three recent systems, FAQ Finder, START, and MULDER will reveal different approaches and challenges to using WWW resources as a knowledge base for a QA system.  The FAQ Finder System (Burke, et al., 1997), developed at the University of Chicago, uses FAQ files and the WordNet lexicon as its knowledge base.  START was developed by Boris Katz et al. and uses sentence level natural language processing to match questions with sentence representations stored in its knowledge base.  The digested sentences can be derived from and refer to resources on the WWW (Katz, 1997).  MULDER (Cody Kwok et al., 2001) developed at the University of Washington, attempts to find answers using search engines that index the WWW.  The next three sections detail how these systems create a knowledge base using WWW resources and match natural language questions with structures in the knowledge base to generate answers.

FAQ Finder

In order to structure FAQ information for question answering, the FAQ Finder system requires an editing process that is helped along by two tools, FAQ Minder and FAQ grinder.  These two tools are used to create a knowledge base by preprocessing FAQ files[1].  FAQ grinder is a subsystem which attempts to automatically match and tag the Question/Answer pairs in documents by trying formats that have already been seen.  Using regular expressions to extract the QA pairs, FAQ grinder integrates the questions into the rest of the system.  FAQ Grinder tags and returns “about 90% of the number of Q&A pairs as hand-tagging”, but with potential parsing errors (Burke et al., 1997, pg 11).  In order to monitor this process, the researchers built FAQ Minder, a user interface that enables inspection of the tagged and untagged version of a FAQ file before it is accepted into the system.

Internally FAQ Finder represents Q&A pairs in a format called a "qtext" or "question text", consisting of a string containing the question, a list of word data structures representing the question with pointers to WordNet information, a term vector, and a pointer to the question's location in the FAQ file.

The first step in Question matching uses the SMART information retrieval system (Buckley, 1985) to match the input question with a FAQ file in the set.  The QA pairs in the top-ranking FAQ file are then matched with the input question.  The matching of a question input and QA pairs in a file consists of computing a score with two components.  One component is a statistical similarity score based on the vector space model (Salton and Yang, 1975) where each term in a vector has a significance value based on the well-known tfidf measure.  The other scoring component is a semantic similarity score.  The combination of a statistical score with a semantic score reflects recent interest in researching how NLP techniques can improve upon current Information Retrieval methods (Voorhees, 1999).

Since term-vector comparison does not take the meaning of a question into account, synonyms and homographs can skew retrieval results.  Homographs are words that can refer to two or more distinct concepts.  For example the word 'bank' can refer to the sides of a river, or it can refer to a financial institution.  Homographs reduce the precision of answer retrieval because false matches might be made.  With synonyms two distinct words represent the same concept, so if a synonymous word appears in the input question, recall of an answer will be worse because true matches may be missed.  In conceptual indexing systems this is called "the paraphrase problem"--the situation that arises when terms used in a request are different from terminology used in the information sought (Woods, 1997).

To illustrate the problem of synonyms and homographs, I broke 72 questions from the Common Lisp FAQ into separate documents.  I then indexed these documents by stemming the words in each document and computing the tfidf value for each word.  I then implemented a vector space model search using the web server included in the Allegro Common Lisp distribution from Franz, Inc.  Figure 1 shows how the query “Is Lisp faster than C?” returns an accurate result.  However, Figure 2 shows that a similar query, “Is Lisp quicker than C?” does not match the correct terms because of the synonym “quicker”.

Figure 1

Figure 2

I tested the problem of homographs with the question “What should I read?” (Figure 3) There is no information to distinguish the term “read” from it’s two senses: read, as in read a book, and read, as in the built-in Common Lisp function.  So the search returns results relevant to both senses of the word.

Figure 3

To address this problem, FAQ Finder uses shallow lexical semantics taken from WordNet and a technique called marker passing based on algorithms developed by Quillian (1968).  To determine the semantic relationship between two words, FAQ Finder traverses hyponym links and synonym sets in WordNet (Miller, 1995).  Hyponym links serve as is-a relations for nouns and verbs, and the synonym sets are used to relate adjectives and adverbs.  Two questions are then scored for similarity by creating a score matrix of all possible combinations of word-to-word comparisons.  Each comparison results in a score based on the scaling of the path length between the two words into a range.  The word-to-word comparison matrix is then reduced to a single number that measures the semantic similarity of two questions.

To solve the problem of homographs that represent different parts of speech, Burke looked at three approaches.  The first approach was to perform unrestricted marker passing.  This meant that if a word had more than one sense, all senses were scored.  However, this approach led to too may false matches.  The second approach was to use an existing part-of-speech tagger.  They tested the Xerox Tagger (Cutting, 1992), which tags words based on their statistically most frequent word sense.  The third approach was to write a bottom-up chart parser that used a context-free grammar built from an on-line dictionary.  The tagging and parsing methods worked equally well when Burke measured their effects on precision (Burke et al., 1997, pg 20).

Since most words take on different forms, it can be hard to match a term as it appears in a question with its entry in WordNet.  For example, the two different terms "informative" and "informant" have the same stem "inform", but the two terms have an entirely different meaning (Burke et al., 1997, pg 19).  The FAQ Finder system performs a morphological reduction of terms using a rule-based algorithm.  The rules specify term endings to match and remove, and endings to add back on to the term.  Burke notes that since this is an open research problem, the rules used by the system are not a comprehensive set of morphological rules for English.

Classifying questions is one way of improving the question matching process.  Questions can have different types based on the beginning word: who, what, when, where, why, how, did.  FAQ Finder leaves this as an open problem, because some questions aren't clearly classified.  Burke gives the following questions as an example,

How often should I change the oil on a shovelhead Harley? 
What is the proper oil change interval for a shovelhead Harley?
(Burke et al., 1997, pg 21)

"How often" is an obvious indicator that the question is temporal, but "What is" can indicate many different types of questions.  In the section that examines the system called MULDER we'll see another approach to question classification.

START: SynTactic Analysis using Reversable Transformations

Two broad approaches to question answering are the federated approach and the distributed approach (Lin, 2002).  In the distributed approach the WWW is viewed as a large corpus of unstructured text with tremendous amounts of data redundancy.  This is the approach taken by MULDER, the system to be examined in the next section.  The federated approach takes portions of the WWW and treats them as if they were databases using techniques for managing semi-structured data.  The FAQ Finder loosely falls under this category as it relies on the semi-structured FAQ files available over the WWW.  START is another system that takes a federated approach to extracting answers from the WWW.  START stores digested representations of English sentences.  These representations then serve as annotations to other sources of information on the WWW.

Modules that compose START include:

1. Blitz, a hybrid heuristic--and corpus-based NL preprocessor that eases the burden on the parser by recognizing human names, institutions, places, addresses, currencies, etc. (Katz et al., 1998).

2. A lexicon for storing information for particular words--hyponyms, synonyms, semantic classes of verbs, etc.

3. An understanding module--a parser that produces embedded representational structures, which are indexed in a discrimination network that composes the knowledge base.

5. A generating module.

6. Omnibase, a "virtual" database that provides a uniform interface to multiple Web knowledge sources (Katz et al., 1999).

The main feature of the embedded representational structures is the ternary expression.  A parse tree is generated from natural language sentences and the parse tree is broken down into kernel sentences that usually contain only one subject and one verb.  These kernel sentences serve as a basis for a set of ternary expressions, the most important containing a subject-verb-object relation.  The following sentence is a simple example of a ternary expression.

Bill Surprised Hillary with his answer.

is transformed into

<<Bill surprise Hillary> with answer>
<answer related-to Bill>    (Katz, 1997)

Notice how the ternary expressions are embeddable within one another.  This is an elegant representation that allows START to store sentences of arbitrary complexity.  Alternative forms of sentences can be generated using an inference module based on S-Rules either at the time sentences are entered into the knowledge base or at the time a query is made without a match.

The rest of the representation consists of a history--a structure for storing adverbs and their position, tense, auxiliaries, voice, negation, etc., and a page that points to other sentences in the knowledge base that yield the same ternary expression.  The embedded representation structures are indexed by the subject-verb-object relations and stored in a discrimination network.

START stands for SyTactic Analysis using Reversable Transformations, and question matching in the generating module is largely the reverse of the understanding module.  Questions are parsed into ternary expressions that serve as templates used to search the knowledge base.  For example,

Whom did Bill surprise with his answer?

is transformed into

Bill surprised whom with his answer.

which is transformed into a template used for matching:

<<Bill surprise whom> with answer>
<answer related-to Bill>   (Katz, 1997)

Once a structure is matched in the knowledge base, the English response is generated from the ternary expressions and the additional information in the history.

As noted before, START takes the federated approach to integrating WWW resources into its knowledge base.  For resources that are unstructured and very difficult to parse into a meaningful representation, natural language annotations can serve as an index into the resource.  For example a long passage about John Adams discovery of Neptune can be annotated by the index sentence:

John Adams discovered Neptune using mathematics. (Katz, 1997)

START can then analyze the sentence and incorporate it into its knowledge base along with a pointer to the text fragment it annotates.  Then START can generate answers to a variety of questions on the topic and present a reference to the passage that was annotated.  This technique can be applied to text, images, sound, video, web pages and more.

For WWW resources that are more structured--like movie databases, weather web pages, map collections, etc.--START implements inference rules that invoke wrapper scripts on certain questions to extract the information from a previously identified source on the WWW.  This enables START use a movie database on the WWW to answer questions like,

Who directed "Raiders of the Lost Ark"?

To extract information from database sites, the developers of START created a tool called LaMeTH (Katz et al., 1999) that consists of a scripting language for matching and referencing structures in HTML and a parser that annotates an HTML document to indicate table boundaries.  LaMeTH is addresses the same problem as the FAQ Grinder and FAQ Minder tools used in the FAQ Finder system.  For semi-structured text, some tool is required to find the structure in the text.  Ideally this identification should be automated, but realistically structures have to be hand-tagged using some sort annotation tool.


The distributed nature of the information source is a major problem in the federated approach to WWW based question-answering systems.  In FAQ Finder the FAQ files were updated and re-parsed weekly.  There is no telling what changes might have been made to the structure of any file.  Likewise, information extracted from Web sites maintained by people who aren't explicitly coordinating their efforts with a particular question answering system are bound to change in a way that will break the system.  For example, a recent test of the START system with the question "What is the weather like in Ann Arbor?" led to a USA Today page announcing that the weather forecast section had changed its URL.  The distributed approach solves this problem by taking advantage of the redundancy of information on the WWW.  The next section will look at MULDER, a question answering system, which uses search engines to pull answers from resources on the WWW.


MULDER (Kwok et al., 2001) limits its scope to answering fact-based questions, and uses all of the WWW (as much of it that is indexed by Google)[2] as its knowledge base.  The system works by parsing questions and formulating specific queries for submission to a search engine.  The modules of MULDER include:

1.    Input: Natural Language question

2.    Parser: parse question to produce a parse tree

3.    Classifier: determine question type

4.    Query Formulator: create search engine query

5.    Search Engine: submit query to Google

6.    Answer Extraction: parse summaries in search result

7.    Answer Selection: determine credibility by redundancy

The classifier helps MULDER choose better candidate answers based on the question type.  MULDER parses questions with a parser based on statistical techniques called the Maximum Entropy-Inspired (MEI) parser (Charniak, 1999).  MULDER uses three question types tagged by the MEI parser to decide what type of answer it is trying to find.

An example of a "wh-adjective" phrase is "how many", which indicates MULDER should look for a number.  An example of a "wh-adverb" phrase is "when did", which indicates MULDER should look for a date.  As noted by Burke in his research with FAQ Finder, "wh-noun" phrases, such as "what year", "what height", or "what car", can start all types of questions and it is hard to tell what kind of answer is needed.  To solve the problem of ambiguous "wh-noun" questions, MULDER uses WordNet to classify the object of the question by tracing the hyponyms of a the object until the "measure" or "time" terms are reached.

MULDER creates a specific search that is a reformulation of the question, and extracts the answer from summaries in the search page based on information from its question classifier.  While the federated approach suffers from the possibility of an answer no longer being available, the distributed approach in a system like MULDER suffers from the possibility that an answer is unreliable.  Since the source of the knowledge might any document returned, it is hard to know if the answer is correct.  One way of measuring the correctness of an answer is to look for multiple occurrences of an answer in different documents.  The voting for an answer by multiple sources lends credibility to an answer, and allows possible answers to be ranked.  In this way MULDER attempts to use redundancy as a possible alternative to structured knowledge bases and understanding. 

MULDER was tested on 200 questions of varying type, topic, and difficulty from the TREC-8 question set.  34% of the TREC-8 questions input to MULDER had the correct answer as the top-ranked result.  1% of the question entered directly into Google displayed an answer in the top ranked results.

Research in Question Answering

New approaches and technologies for building robust open domain question answering systems based on the WWW are being researched and developed.  Three systems were explored here to reveal several challenges faced by these types of systems.  The resources on the WWW are largely unstructured.  Thus, hand editing and knowledge engineering are still required in order to structure these resources for a question answering system that will search and draw inferences.  The resources on the WWW are dynamic and changing, which places a maintenance requirement on the knowledge engineering tasks.  Systems that take a distributed approach and rely on the redundancy of information to avoid knowledge engineering face a credibility problem.  What other methods besides voting will establish the truth of answers scraped from search engines in an environment where anyone can say anything? 

Recent work and interest in developing question answering technology is growing out the Semantic Web Project and the Text Retrieval Conference (TREC).  TREC is an annual conference in Maryland sponsored by the National Institute of Standards and technology.  Each year since 1999 there has been a question answering track at TREC that seeks to produce standardized large-scale evaluation metrics for question answering systems.  The goal of the QA track is “to foster research on systems that retrieve answers rather than documents in response to a question (Voorhees 2001).  Future research into QA systems may also grow out of the Semantic Web project at W3C[3].  While the Semantic Web project hasn't coalesced around a single goal, the opportunity exists to build Question Answering systems should the use of the Resource Description Framework (RDF) become widespread.  RDF along with the language, RDF Schema, is a way of making assertions about Uniform Resource Identifiers (URI).  A well know URI is the URL, which identifies a web page, but a URI can be used to identify anything one wants to reason about.  RDF has been shown to translate readily into conceptual graphs, and it is also supportive of annotation techniques similar to those in the START system.  Thus the technology of the Semantic Web is poised to add the needed structure to the WWW.  This structure may support the semantic inference needs of QA researchers.  Examining systems like FAQ Finder, START, and MULDER provides a good start at understanding the challenges of building an open domain QA system.  Future TREC conferences, the growing popularity of the semantic web, and the appeal of being able to type a natural language question into the computer and receive a meaningful answer makes this an interesting topic for further research.


Ion Androutsopoulos, G. Ritchie, and P. Thanisch.  Natural language interfaces to databases--an introduction.  Natural Language Engineering, 1(1):29-81, 1995.


C. Buckley. 1985.  Implementationof the SMART Information Retrieval Retrieval System.  Thecnical Report 85-686, Cornall University.


Robin D. Burke, Kristian J. Hammond, Vladimir A. Kulyukin, Steven L. Lytinen, Noriko Tomuro, and Scott Schoenberg.  1997.  Question answering from frequently-asked question files: Experiences with the FAQ finder system.  Technical Report TR-97-05, University of Chicago.


J.G. Carbonell. "Subjective Understanding: Computer Models of Belief Revision." Ph.D. dissertation, TR-150. Yale university, Department of Computer Science, 1979.


E. Charniak. "Ms. Malaprop, A language Comprehension Program."  Proceedings of the National Conference on Artificial Intelligence, AAAI-77. Cambirdge, Mass. 1977.


E. Charniak. A Maximum-Entropy-Inspired Parser.  Technical Report CS-99-12, Brown University, Computer Science Department, August 1999.


D. Cutting, J. Kupiec, J. Pedersen and P. Sibun. 1992. A Practical Part-of-Speech Tagger. Proc. 3rd ANLP, Trento, Italy, pp. 133-140.


Boris Katz. "From Sentence Processing to Information Access on the World Wide Web" Online. Internet. 27Feb 1997. Available


Boris Katz, et al. "Blitz: A Preprocessor for Detecting Context-Independent Linguistic Structures" 1998. Online. Internet. 25 June 2002. Available


Boris Katz, Sue Felshin, Deniz Yuret, Ali Ibrahim, Jimmy Lin, Gregory Marton, Alton Jerome McFarland, and Baris Temelkuran. 1999. "Omnibase: Uniform Access to - Heterogeneous Data For Question Answering. Internet. 25 June 2002. Available


B. Katz, D. Yuret, J. Lin, S. Felshin, R. Schulman, A. Ilik, A. Ibrahim, and P. Philip Osafo-Kwaako. 1999. "Integrating Web Resources and Lexicons into a Natural Language Query System, IEEE International Conference on Multimedia Computing and Systems" 


Cody Kwok, Oren Etzioni, and Daniel S. Weld. 2001. "Scaling question answering to the Web" In Proceedings of teh Tenth International World Wide Web Conference (WWW10).



Jimmy Lin. "The Web as a Resource for Question Answering: - Perspectives And Challenges" Online. Internet. 25 June 2002. Available


G. Miller. "WordNet: A Lexical Database for English", Communications of the ACM 38(11) pp. 39-41, 1995.


M. Quillian. Semantic Memory. In M. Minsky, editor, Semantic Information Processing, pages 216 270. MIT Press, Cambridge, MA, 1968.


G. Salton, A. Wong, C.S. Yang.: A vector Space Model for Automatic Indexing.  Communications of the ACM. 18 (1975) 613-620


G. Salton. McGill, M. 1983. Introduction to modern information retrieval. New York: McGraw-Hill.


M. Sanderson: Word Sense Disambiguation and Information Retrieval. Proceedings of the Seventeenth Annual International ACM-SIGIR Conference on Research and Development in Information Retrieval. Springer-Verlag (1994) 142-151.


R.C. Schank, and K.M. Colby, eds. Computer Models of Thought and Language.  San Francisco: W.H. Freeman, 1973.


C. Towell, Leacock, G., Voorhess, E.M.: Towards Building Contextual Representations of Word Senses Using Statistical Models. In: Boguraev, B., Pustejovsky, j. (eds.): Corpus Processing for Lexical Acquisition. MIT Press (1996) 98-113.


Ellen M. Voorhees, "Natural Language Processing and Information Retrieval," in Pazienza, MT (ed.), Information Extraction: Towards Scalable, Adaptable Systems, New York: Springer, 1999, pp. 32-48.


Ellen M. Voorhees, "Overview of the TREC 2001 Question Answering Track" National Institute of Standards and Technology. 2001. Online. Internet. 25 June 2002 Available


R.W Wilensky. "Understanding Goal-Based Stories." Ph.D. dissertation, TR-240. Yale University, Department of Computer Science, 1978.


W.A. Woods. Conceptual indexing: A better way to organize knowledge. Technical Report TR-97-61, Sun Microsystems Laboratories, April 1997.

[1] Obtained from

[2]Google claimed to have indexed 3 billion Web documents in December of 2001.


Copyright © 2005 Jason Kantz