Semantic Scholar Open Access 2012 6 sitasi

Heuristic Frequent Term-Based Clustering of News Headlines

Nibir Bora B. P. Mishra Satchidananda Dehuri

Lihat Sumber DOI

Abstrak

Abstract Document clustering deals with assigning documents to groups (called clusters) in accordance with the general clustering rule, ‘high intra-cluster document similarity and low inter-cluster document similarity’. In this study, we propose a novel heuristics for clustering news headlines. News headlines are grammatically and semantically different from larger bodies of text, like blog posts and reviews. Based on the heuristics, we implemented versions of the frequent term-based and frequent noun-based clustering algorithms. Both these algorithms, along with k-means, regular frequent term and frequent noun clustering were evaluated using five datasets -Reuters343, Reuters2388 (news headlines), CICLing-2002, Hep-ex and KnCr (scientific abstracts). On interpreting the results based on common external cluster quality evaluation measures (purity, entropy and F measure), it was found that the heuristics performed at par with, or even better than, traditional clustering algorithms and few other intuitive algorithms, when tested using the datasets comprising of news headlines. However, on using the datasets comprising of scientific abstracts, the results were not favorable.

Topik & Kata Kunci

Computer Science

Penulis (3)

Nibir Bora

B. P. Mishra

Satchidananda Dehuri

Format Sitasi

APA MLA BibTeX

Bora, N., Mishra, B.P., Dehuri, S. (2012). Heuristic Frequent Term-Based Clustering of News Headlines. https://doi.org/10.1016/J.PROTCY.2012.10.052

Akses Cepat

Lihat di Sumber doi.org/10.1016/J.PROTCY.2012.10.052

Informasi Jurnal

Tahun Terbit: 2012
Bahasa: en
Total Sitasi: 6×
Sumber Database: Semantic Scholar
DOI: 10.1016/J.PROTCY.2012.10.052
Akses: Open Access ✓