Question Answering Systems Project: Semantic Technologies in IBM Watson SS 2014

Question Answering Systems Project: Semantic Technologies in IBM Watson

In this lab we will construct a question answering system similar to IBM Watson. Starting from a basic system that is provided, we will build single components in student projects, and combine them into a complete system. This encompasses several sources of knowledge (structured, unstructured), several language technology pre-processing steps, as well as machine learning methods for generating answers.
In the concurrent seminar "Knowledge Engineering for Question-Answering Systems", the architecture of question answering systems will be explained in detail, exemplified by IBM Watson. Participation in this seminar is highly recommended.

Important Dates

The project lab is largely conduced by individual appointments. There are a few compulsory events:




List of projects

List of projects and descriptions

Implementation is based on the references provided with the project descriptions
Implement the project, each fitting one of the QA components (see the pipeline here, page 4)
Autonomous Input/output dependency should be minimized
Collaboration with other groups is encouraged
    Write a report
    Related works
    System architecture
    Implementation details

1. Question focus identification

Identify the focus of the question. Refer slide 21, 79, 80... of the tutorial
Annotate specific categories of Questions from jeopardy! using WebAnno and train a classifier – some 256600 questions from Jeopardy! archive exists in a database
Razvan Bunescu and Yunfeng Huang (2010), Towards a General Model of Answer Typing: Question Focus Identification, CICLing 2010 , Iasi, Romania, March 2010  
Rani Pinchuk, Tiphaine Dalmas, and Alexander Mikhailian (2009), Automated Focus Extraction for Question Answering over Topic Maps, Fifth International Conference on Topic Maps Research and Applications, TMRA 2009, pp. 93-106

2. Abbreviations and QA
Get answers for “what does X stand for?” type of questions
Abbreviations in context
Example: what does TM stand for in machine learning?
The JobimText API or calculated model can be used to get abbreviations.
The Wikipedia disambiguation page can be used
Use Acronyms & Abbreviations
Evaluation with acronyms & abbreviations
Biemann, C., Riedl, M. (2013): Text: Now in 2D! A Framework for  Lexical Expansion with Contextual Similarity. Journal of Language Modelling 1(1) pp. 55-95
Waleed Ammar, Kareem Darwish, Ali El Kahki, and Khaled Hafez (2011), ICE-TEA: In-Context Expansion and Translation of English Abbreviations, CICLing'11 pp 41-54

3. Named Entity Recognition (NER) for question answering
Simple QA system using large collections
Based on data redundancy instead of complex analysis
Only person and location related question (who/where)
Use NER for passage processing

who is the founder of Google?
Google was founded by [Larry Page]PER and [Sergey Brin]PER while …
[Larry Page]PER  and  [Sergey Brin]PER  meet at Stanford….
Google began in January 1996, as a research project by  [Larry Page]PER ….
The story behind Google co-founder  [Sergey Brin]PER 's liaison….
What is the highest peak in Africa ?
Mount [Kilimanjaro]LOC,…. It is the highest mountain in Africa
The highest mountain in Africa is Mt [Kilimanjaro]LOC which is found in Tanzania
Wikipedia index
Freebase Triples
Stanford Named Entity Recognizer
Eric Brill, Susan Dumais and Michele Banko (2002), An Analysis of the AskMSR Question-Answering System, EMNLP '02 pp. 257—264

4. Question answering over linked data
Translate natural language questions into linked data queries and retrieve answers
Approach  - Using dependency graph
Triple Pattern Extraction using Stanford CoreNLP6 (SCNLP)
Use SPARQL query to retrieve data
Sherzod Hakimov, Hakan Tunc, Marlen Akimaliev, Erdogan Dogdu (2013), Semantic Question Answering System over Linked Data using Relational Patterns, EDBT '13, pp.  83—88

5.Candidate answer extraction using sequence tagging
The goal of the project is to extract candidate answers using sequence tagging
NER, POS tagging, Dependency parsing
Edit feature + edit distance
Following Section 3 of the reference paper (see reference)
use the POS/NER/DEP tags directly just as one would in a chunking task
Based on the description on the paper, implement edit script and align distance
Use distant supervision approaches to semi-automatically create training data
The Jeopardy! dataset to extract relevant documents from Wikipedia and tag correct answers in CRF format

Xuchen Yao and Benjamin Van Durme (2013), Answer Extraction as Sequence Tagging with Tree Edit Distance, Proceedings of NAACL-HLT 2013, pp. 858–867

Your own project idea?
Should fit the IB Watson QA pipeline given in slide 3
Should be implemented in Java
Training/evaluation dataset should be available or easy to produce
Approach us before starting the project

Resources and datasets
Jeopardy! Question + Answer archive
    In sqlite database
    Including categories, clues, answers
    About 260,000
    Get yourself here
Wikipedia as Solr Index
UIUC question dataset
    Includes taxonomy, training/test data (labeled)
    Can be used for training Question classification
    Get yourself here and here , and here
Question-Answer Dataset
    Manually-generated factoid question/answer pairs with difficulty ratings from Wikipedia articles.
    Dataset includes articles, questions, and answers.

Software and Toolbox
The Open source IBM Watson QA software package
BART co-reference resolution
JWNL (Java WordNet Library)
Stanford or Mate dependency parser

A A A | Drucken Print | Impressum Impressum | Sitemap Sitemap | Suche Search | Kontakt Contact | Webseitenanalyse: Mehr Informationen
zum Seitenanfangzum Seitenanfang