Last update: 2003-08-13

More projects like this

See all themes

ReX proposal Cambridge - KUB

international exchange of scholars for software projects

Research EXchange Proposal
University of Cambridge and Tilburg University
Ted Briscoe, April 2001

Overview

We propose an exchange between Sabine Buchholz of the Centre for the Induction of Linguistic Knowledge (ILK) at Tilburg University, The Netherlands to the Natural Language and Information Processing (NLIP) Group at the Computer Laboratory, University of Cambridge, UK. The exchange would be for a period of 4 months, to conduct research on the automatic construction of electronic dictionaries for use in text mining and related applications using memory-based learning techniques.

Description of the Sending Institute

The ILK (Induction of Linguistic Knowledge) group at Tilburg University was founded in 1995 by Walter Daelemans to study the combination of machine learning and language processing from the perspectives of language technology, linguistic discovery, and cognitive modeling. The group (currently with a staff of 14) is co-directed by Walter and Antal van den Bosch, has received funding for about 15 projects funded by NWO (Dutch NSF), SOBU, STW, and other agencies, and has a high publication output, mainly in the areas of computational linguistics and machine learning.

Description of the Receiving Institute

The NLIP group in the Computer Laboratory, University of Cambridge has been working on natural language processing and information retrieval for nearly 40 years. Ted has been a member of the group for over 15 years, has been PI/Coordinator of 9 national or EU funded grants, and has published around 50 articles. The automatic construction of electronic dictionaries for use in text processing applications has been a research theme for much of this time, beginning with Karen Sparck Jones' work on machine-readable thesauri in the 60s.

Description of the Student

Sabine studied Computational Linguistics at the University of Saarbrücken, Germany. During her studies, she worked as a student assistant at the Language Technology Lab of the German Research Center for Artificial Intelligence in Saarbrücken on an electronic German dictionary. Her master's thesis was on building a tool for the semi-automatic validation of the subcategorisation information contained in this dictionary. After graduating in 1997, she started work as a PhD student in the ILK group in Tilburg. Her research focus is on the use of memory-based learning techniques for shallow parsing, especially grammatical relations finding, in English texts. A shallow parse can be the basis for such different applications as question-answering or subcategorisation extraction. The memory-based learning algorithm has not been used for relation finding before, so research includes finding the best task definition, algorithmic settings and feature representation.

Research Exchange Plan

One strand of work on electronic dictionary construction at Cambridge involves automatically subcategorising verbs according to the number and type of syntactic arguments they take and how these arguments contribute to meaning. An electronic dictionary containing such subcategories of verbs is a critical component for text mining, information extraction, web-based question-answering and many other text processing systems. However, extant systems make little or no use of subcategories because of the difficulties of learning them reliably. Such dictionaries must be dynamic objects which are continuously modified via processing of new input as subcategorisation is closely linked to verb sense which varies between topic and genre as well as over time.

Constructing verb subcategorisation dictionaries automatically from free text is non-trivial because it requires partial parsing of the phrases around occurrences of a given verb to identify potential arguments and their heads, differentiation of arguments from adjuncts, and identification of the mapping of arguments to the representation of the event or situation denoted by the verb. For example, given the input Most UK readers of broadsheet newspapers probably believed the American election to have been so grossly mishandled that a constitutional crisis was imminent, we would like to recover the facts that readers is head of the noun phrase subject of believed, that election is head of its syntactic noun phrase object but semantically object of mishandled, and that the infinitive verb phrase to...mishandled is its third syntactic argument. From this we would learn that one subcategory of believe can take 3 syntactic arguments where the last two semantically combine to denote the proposition believed.

We have developed a system that classifies verb occurrences to one of the 160 or so extant verbal subcategories of English. However, this system is inherently noisy because of pervasive ambiguity in text. We have extensively explored statistical techniques to filter out misclassifications and can currently build dictionaries with around 85% precision (ratio of correct classes recovered to all classes recovered) and 65% recall (ratio of correct classes recovered to all correct classes). However, we may have reached a performance ceiling using statistical thresholding, smoothing and/or hypothesis testing because of the large number of rare combinations of specific verbs and subcategories and the lack of correlation between the unconditional probability of a subcategory and its conditional probability given specific verbs.

Sabine has applied the memory-based learning techniques developed for text processing applications by Walter Daelemans to the related problem of recognising individual arguments and adjuncts of verbs. In memory-based learning there is no thresholding or filtering of instances which may ameliorate the handling or rare combinations if the similarity measures for matching existing combinations to new instances are powerful enough. In this project, we want to apply memory-based learning to the very large number of verb-subcategory combinations in the Cambridge database (built from over 20 million words of text). We will attempt to discover the algorithmic settings and feature representation which gives best performance in the memory-based framework and then compare its classification performance to our current best statistically-based system.

Deliverables

We propose to let Sabine submit a brief email report at the end of the first month and the end of the exchange. A final report detailing the work undertaken will be due a month after the exchange is over. This will take the form of a jointly authored technical report in the Computer Laboratory series. Our intention is also to submit a paper to a refereed conference or journal, summarising the outcome of the experiments.

The Computer Laboratory will be responsible for making brief email status reports at bi-monthly intervals and will report any problems or failures to meet expectations as they occur.

Futures

Both the ILK and NLIP groups have undertaken substantial work in the areas of robust wide-coverage language processing and its application to text mining and related tasks. Both groups are now focussing on the critical issue of automatic induction of linguistic knowledge from large textual corpora. However, we have been exploring quite different approaches and developed complementary resources. We hope that the experiments we undertake will allow us to understand better the relative strengths and weaknesses of our approaches and will lead directly to more substantial collaboration, perhaps within the framework of an EU IST framework project on a task such as web-based question answering, where effective dynamic acquisition and deployment of electronic subcategorisation dictionaries would substantially improve the performance of current systems.

Time Frame

Sabine would like to come to Cambridge for 4 months starting in late July or early August 2001 and ending at the latest by December 2001. We think this is the minimum time needed for her to familiarise herself with the Cambridge database and infrastructure, port the ILK memory-based learning software, and run the experiments.