期刊文献+

基于重复检测的多摘要消重方法 被引量:1

Multi abstract remove repeat method for web mining research
下载PDF
导出
摘要 针对目前Web信息挖掘中存在大量页面重复的问题,从Web信息的组织角度对其中的一些关键问题进行深入分析,提出了基于关键词的部分相似页面消重算法——Web多摘要消重方法(multiabstractremoverepeat,MARR)。MARR方法对传统基于词表和倒排文件的Web信息数据库进行改装,增加一个字段用于记录关键词所对应的摘要块号,采用文本摘要算法,按倒排文件方式索引,根据内容基于查询词目的相似程度,在检索过程中过滤或标识与查询词目相关的部分内部重复现象,以获得更合理的检索结果组织形式。MARR方法还将传统按PageRank值顺序排列改成树型组织方式,以方便用户信息发现的需要。该方法在基于三明钢铁集团MES智能信息代理的原型化Web检索系统中得到应用。 With regard to the organization of web information retrieval, some pivotal problems of web information mining are analyzed and an arithmetic to remove repeats of similar pages searched by keyword Multi abstract remove repeat (MARR) is presented, which changes the traditional web information database composed of words tables and converse files, adds a field to record the abstract number corresponded with key words, text-abstract arithmetic, sorts is adopted by the index ofconverse file, the repeats are filtered and marked according to the similarity of content from retrieved entry in order to obtain a more reasonable retrieval result, and normal structure sorted is substituted by PageRank for users' needs in information mining. This arithmetic is applied to the archetypal web retrieval system originated from MES information system agent of Sanming steel company.
出处 《计算机工程与设计》 CSCD 北大核心 2006年第23期4521-4524,4555,共5页 Computer Engineering and Design
关键词 信息检索 消重方法 文本摘要 倒排文件 树型组织 information retrieval remove repeat method text abstract converse file tree structure
  • 相关文献

参考文献10

  • 1Zhang Ling.Web mining research in intelligent information retrieval[D].Shanghai:Department of Computer Science and Engineering Shanghai Jiaotong University,2003.
  • 2Manber U.Finding similar files in a large file system[R].Tuscon,Arizona:Technical Report TR 93-33,University of Arizona,1993.
  • 3Junghoo Cho,Narayanan Shivakumar,Hector Garcia-Molina.Finding replicated web collections[C].Department of Computer Science Stanford University,the Digit Library,Stanford,1999.39-45.
  • 4Narayanan Shivakumar,Hector Garcia-Molina.The SCAM approach to copy detection in digital libraries[C].Department of Computer Science Stanford University,the Digit Library,Stanford,1995.83-89.
  • 5Narayanan Shivakumar.Finding near-replicas of documents on the web[DB/OL].Http://dbpubs.stanford.edu/pub/1998-31.
  • 6Brin S,Davis J,Garcia-Molina H.Copy detection mechanisms for digital documents[C].San Francisco,CA:Proceedings of the ACM SIGMOD Annual Conference,1995.
  • 7Calvin Chan Hai-Hua Lu.CMPUT690 term project fingerprinting using polynomial (Rabin's method)[DB/OL].http://www.cs.ualberta.ca/~calvinc/690.ps.
  • 8Mckenzie.Selecting a hashing algorithm[J].SP&E,1990,20(2):209-224.
  • 9刘艳青,田萱,苏桂莲.基于Internet的个性化信息检索技术的研究[J].计算机工程与设计,2004,25(5):772-775. 被引量:12
  • 10耿玉良,陈家琪,王咏梅.中文Web检索中聚类算法的改进[J].计算机工程与设计,2005,26(10):2685-2687. 被引量:9

二级参考文献30

  • 1Brin S,Page L. The anatomy of a large-scale hypertextual web search engine [C]. Proceedings of 7th WWW Conference. Amsterdam: Elsevier Science, 1998. 107-117.
  • 2Pazzani M, Muramatsu J, Billsus D. Syskill & webert identifying interesting web sites[C]. Proc. 13th Natl. Conf on Artificial Intelligence, 1996.
  • 3Malone T W, Grant K R, Turbak F A,et al. Intelligent information sharing systems [J]. Communications of the ACM,1987,30(5).
  • 4Culliss G. User popularity ranked search engineer[EB/OL].www. infornortics.com/searchengines/boston 1999/culliss/index.htm.
  • 5Lang K. News weeder: Learning to filter netnews [Z]. Proceedings of Machine Learning, 1995.
  • 6Kroon H C M, Mitchell T M, Kerckhoffs E J H. Improving learning accuracy in information filtering[Z].
  • 7Jennings A, Higuchi H. A personal news services based on a user model neural networking[J]. IEICE Transactions on Information and Systems, 1992, (3).
  • 8Cooley R, Srivastava J. Data prepariation for mining world wide web browsing patterns[J]. Journal of Knowledge and Information Systems 1999,1(1): 5-32.
  • 9Wu K L, Yu P S,Ballman A. Speed Tracer: A web usage mining and analysis tool[J]. IBM System Journal 1998,37(1):89-105.
  • 10Pazzani M. Billsus D. Learning and revising user profiles:The identification of interesting web sites [J]. Machine Learning 1997, 27 (5): 313-331.

共引文献19

同被引文献4

引证文献1

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部