Clustering and Visualization of Large Dissimilarity Datasets
Barbara Hammer
TU Clausthal, Germany

Abstract:

Clustering and Visualization constitute key issues in computer-supporteddata inspection, and a variety of promising tools exist for such tasks such as the self-organizing map and variations thereof. Real life data, however, pose severe problems to standard tools: on the one hand, data are given by complex objects such as sequences of possibly different length, temporal signals, images, text data, graph structures, etc. and standard methods proposed for finite dimenional vectors in euclidean space cannot be applied. On the other hand, massive data have to be dealt with, such that data do neither fit into main memory nor more than one pass over the data is still affordable, i.e. standard methods can simply not be applied due to the sheer amount of data. We present two recent extensions of topographic mappings which can deal with more general proximity data given by pairwise distances, and which can process streaming data of arbitrary size in patches, thus resulting in an efficient linear time data visualization method for quite general data structures. We present the theoretical background as well as large scale applications to the areas of text and multimedia processing based on the generalized compression distance.

back to the list of talks