Semantic Scholar Open Access 2012 6 sitasi

Heuristic Frequent Term-Based Clustering of News Headlines

Nibir Bora B. P. Mishra Satchidananda Dehuri

Abstrak

Abstract Document clustering deals with assigning documents to groups (called clusters) in accordance with the general clustering rule, ‘high intra-cluster document similarity and low inter-cluster document similarity’. In this study, we propose a novel heuristics for clustering news headlines. News headlines are grammatically and semantically different from larger bodies of text, like blog posts and reviews. Based on the heuristics, we implemented versions of the frequent term-based and frequent noun-based clustering algorithms. Both these algorithms, along with k-means, regular frequent term and frequent noun clustering were evaluated using five datasets -Reuters343, Reuters2388 (news headlines), CICLing-2002, Hep-ex and KnCr (scientific abstracts). On interpreting the results based on common external cluster quality evaluation measures (purity, entropy and F measure), it was found that the heuristics performed at par with, or even better than, traditional clustering algorithms and few other intuitive algorithms, when tested using the datasets comprising of news headlines. However, on using the datasets comprising of scientific abstracts, the results were not favorable.

Topik & Kata Kunci

Penulis (3)

N

Nibir Bora

B

B. P. Mishra

S

Satchidananda Dehuri

Format Sitasi

Bora, N., Mishra, B.P., Dehuri, S. (2012). Heuristic Frequent Term-Based Clustering of News Headlines. https://doi.org/10.1016/J.PROTCY.2012.10.052

Akses Cepat

Informasi Jurnal
Tahun Terbit
2012
Bahasa
en
Total Sitasi
Sumber Database
Semantic Scholar
DOI
10.1016/J.PROTCY.2012.10.052
Akses
Open Access ✓