Automatic Word Categorization: An Information-theoretic Approach

M.M. Lankhorst and R. Moddemeijer

University of Groningen, Department of Computing Science,
P.O. Box 800, NL-9700 AV Groningen, The Netherlands,
phone: +31.50.363 3940 - fax: +31.50.363 38005 - e-mail: rudy@cs.rug.nl

Abstract

This paper presents a novel approach to the automatic categorization of words from raw dara. We count occurrences of word pairs in text and use a hierarchical clustering technique on this frequency data to obtain a classification of words into linguistic categories. As a distance criterion in the clustering process, we use the loss of mutual information caused by combining two clusters into a single new cluster.

With this method, words are not only classified on basis of their syntactic categories, but also with respect to aspects that are related to meanings. The method can form the basis of a system that uses a much finer categorization of words than is feasible using traditional grammaer-based approaches. We plan to use it in a layered system of artificial neural networks that are trained to recognize higher-level constituents.


Full paper


Published

Forteenth Symposium on Information Theory in the Benelux, May 17-18, 1993, Veldhoven, The Netherlands, pp. 62-69, Eds. Schouwhamer Immink, K.A. and Bot, P.G.M., Werkgemeenschap Informatie- en Communicatietheorie, Enschede, and IEEE Benelux Chapter on Information Theory, ISBN 90-71048-09-8, BibTeX
other publications