Automatic Word Categorization: An Information-theoretic Approach

M.M. Lankhorst and R. Moddemeijer

University of Groningen, Department of Computing Science,
P.O. Box 800, NL-9700 AV Groningen, The Netherlands,
phone: +31.50.363 3940 - fax: +31.50.363 38005 - e-mail: rudy@cs.rug.nl

Abstract

This paper presents a novel approach to the automatic categorization of words from raw data. We count occurrences of word pairs in text and use a hierarchical clustering technique on this frequency data to obtain a classification of words into linguistic categories. As a distance criterion in the clustering process, we use the loss of mutual information caused by combining two clusters into a single new cluster.

The main advantage of such a method is the ability to construct linguistic categories that combine both syntactic and semantic/pragmatic factors, which may help in reducing the number of nonsensical analyses that a language processing system will produce.

Results of the application of this method on letter data and on a word corpus are shown, which clearly demonstrate this blending of syntactic and semantic aspects that are related to their meanings.

The method can form the basis of a system that uses a much finer categorization than is feasible using traditional grammer-based approaches. We plan to use it in a layered system of artificial neural networks that are trained to recognize higher-level consistuents.


Keywords

linguistic categorization, information theory, mutual information.

Full paper


Published

University of Groningen, Department of Computing Science, Groningen, Computing Science Report: CS-9209, BibTeX
other publications