Affiliation:
1. LARODEC, ISG Tunis, University of Tunis, 41 Avenue de la Liberté, Cité Bouchoucha, 2000 le Bardo, Tunisie
Abstract
Grouping documents based on their textual content is an important application of clustering referred to as text clustering. This paper deals with two issues in text clustering which are the detection of non-disjoint groups and the representation of textual data. In fact, a text document can discuss several topics and then, it must belong to several groups. The learning algorithm must be able to produce non-disjoint clusters and assigns documents to several clusters. Given that text documents are considered as unstructured data, the application of a learning algorithm requires to prepare a set of documents for numerical analysis by using the vector space model (VSM). This representation of text avoids correlation between terms and does not give importance to the order of words in the text. Therefore, we present in this paper an unsupervised learning method, based on the word sequence kernel, where the correlation between adjacent words in text and the possibility of document to belong to more than one cluster are not ignored. In addition, to facilitate the use of this method in text-analytic practice, we present the "DocCO" software which is publicly available. Experiments performed on several text collections show that the proposed method outperforms existing overlapping methods using VSM representation in terms of clustering accuracy.
Publisher
World Scientific Pub Co Pte Lt
Subject
Artificial Intelligence,Computer Vision and Pattern Recognition,Software
Cited by
9 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献