Name:     Jonathan Koberstein School:  Brigham Young University
Relationship: son Country:     United States of America
First International Conference on Knowledge Science, Engineering and Management (KSEM'06)

August 5-8, 2006, Guilin, China (Co-located with PRICAI'06)

KSEM'06 Accepted Papers

551. Jonathan Koberstein and Yiu-Kai Ng. Using Word Clusters to Detect Similar Web Documents

Jonathan Koberstein and Yiu-Kai Ng, Using Word Clusters to Detect Similar Web Documents. In Proceedings of the First International Conference on Knowledge Science, Engineering and Management (KSEM'06), LNAI 4092, pp. 215-228, August 5-8, 2006, Guilin, China.


Using Word Clusters to Detect Similar Web Documents 
Book Series Lecture Notes in Computer Science 
Publisher Springer Berlin / Heidelberg 
ISSN 0302-9743 
Subject Computer Science 
Volume Volume 4092/2006 
Book Knowledge Science, Engineering and Management 
DOI 10.1007/11811220 
Copyright 2006 
ISBN 978-3-540-37033-8 
DOI 10.1007/11811220_19 
Pages 215-228 
SpringerLink Date Wednesday, July 26, 2006 

Jonathan Koberstein1 and Yiu-Kai Ng1

(1) Computer Science Department, Brigham Young University, Provo, UT 84602, USA

Abstract

It is relatively easy to detect exact matches in Web documents; however, detecting similar content in distinct Web documents with different words and sentence structures is a much more difficult task. A reliable tool for determining the degree of similarity between any two Web documents could help filter or retain Web documents with similar content. Most methods for detecting similarity between documents rely on some kind of textual fingerprinting or a process of looking for exactly matched substrings. This may not be sufficient as changing the sentence structure or replacing words with synonyms can cause sentences with similar/same content to be treated as different. In this paper, we develop a sentence-based Fuzzy Set Information Retrieval (IR) approach, using word clusters that capture the similarity between different words for discovering similar documents. Our approach has the advantages of detecting documents with similar, but not necessarily the same, sentences based on fuzzy-word sets. The three different fuzzy-word clustering techniques that we have considered include the correlation cluster, the association cluster, and the metric cluster, which generate the word-to-word correlation values. Experimental results show that by adopting the metric cluster, our similarity detection approach has high accurate rate in detecting similar documents and improves previous Fuzzy Set IR approaches based solely on the correlation cluster.