Using Sequences of Words for Non-Disjoint Grouping of Documents-Reference-Cited by-同舟云学术

Using Sequences of Words for Non-Disjoint Grouping of Documents

Published:2015-04-27 Issue:03 Volume:29 Page:1550013
ISSN:0218-0014
Container-title:International Journal of Pattern Recognition and Artificial Intelligence
language:en
Short-container-title:Int. J. Patt. Recogn. Artif. Intell.

Author:

Ben N'Cir Chiheb-Eddine¹,Essoussi Nadia¹

Affiliation:

1. LARODEC, ISG Tunis, University of Tunis, 41 Avenue de la Liberté, Cité Bouchoucha, 2000 le Bardo, Tunisie

Abstract

Grouping documents based on their textual content is an important application of clustering referred to as text clustering. This paper deals with two issues in text clustering which are the detection of non-disjoint groups and the representation of textual data. In fact, a text document can discuss several topics and then, it must belong to several groups. The learning algorithm must be able to produce non-disjoint clusters and assigns documents to several clusters. Given that text documents are considered as unstructured data, the application of a learning algorithm requires to prepare a set of documents for numerical analysis by using the vector space model (VSM). This representation of text avoids correlation between terms and does not give importance to the order of words in the text. Therefore, we present in this paper an unsupervised learning method, based on the word sequence kernel, where the correlation between adjacent words in text and the possibility of document to belong to more than one cluster are not ignored. In addition, to facilitate the use of this method in text-analytic practice, we present the "DocCO" software which is publicly available. Experiments performed on several text collections show that the proposed method outperforms existing overlapping methods using VSM representation in terms of clustering accuracy.

Publisher

World Scientific Pub Co Pte Lt

Subject

Artificial Intelligence,Computer Vision and Pattern Recognition,Software

Link

https://www.worldscientific.com/doi/pdf/10.1142/S0218001415500135

Reference20 articles.

1. Clustering of document collection – A weighting approach

2. A comparative study of efficient initialization methods for the k-means clustering algorithm

3. Research of fast SOM clustering for text information

Cited by 9 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. H-mrk-means: Enhanced Heuristic mrk-means for Linear Time Clustering of Big Data Using Hybrid Meta-heuristic Algorithm;Journal of Information & Knowledge Management;2024-05-11

2. A novel linear time clustering using heuristically improved mrk-medoids based on modified squirrel search algorithm;Australian Journal of Electrical and Electronics Engineering;2024-04-21

3. A parallel text clustering method using Spark and hashing;Computing;2021-04-07

4. PSubCLUS: A Parallel Subspace Clustering Algorithm Based On Spark;IEEE Access;2021

5. On the use of ensemble method for multi view textual data;Journal of Information and Telecommunication;2020-05-26