期刊文献+

基于向量空间的网页内容相似度计算方法研究 被引量:4

Research on Webpage Content' Similarity Calculation Method Based on Vector Space Model
下载PDF
导出
摘要 针对海量网页数据挖掘问题,提出基于向量空间的网页内容相似计算算法和软件系统框架。利用搜索引擎从海量网页中提取中文编码的网页URL,在此基础上提取网页的中文字符并分析提取出中文实词,建立向量空间模型计算网页内容间的相似度。该系统缩小了需要进行相似度计算的网页文档范围,节约大量时间和空间资源,为网络信息的分类、查询、智能化等奠定了良好的基础。 Aiming to data mining in great mass of Web pages,this paper puts forward Web page content' similarity calculation method based on vector space model and software system framework.This system extracts massive Web pages from search engines and distinguishes the URL pages coded in Chinese,then extracts this page out Chinese characters and selects out Chinese notional words,establishes vector space model to calculate the similarity between Web pages' contents.The system reduces the Web document range,saves a lot of time and space,and lays a good foundation for the classification,search and intellectualization for network information.
出处 《计算机与现代化》 2010年第9期53-55,58,共4页 Computer and Modernization
基金 西华大学人才培养基金(R0820208)
关键词 向量空间 网页内容相似度 vector space model webpage content' similarity
  • 相关文献

参考文献14

二级参考文献96

共引文献660

同被引文献33

  • 1中国互联网络中心(CNNIC).第33次中国互联网络发展状况统计报告[EB/OL].http://www.cnnic.net.cn/hlwfzyj/hlwxzbg/hlwtjbg/201403/t20140305-46240.htm,2014旬3_05.
  • 2Fiol-Roig G, Mir6-Juli:t M, Herraiz E. Data mining tech- niques for Web page classification[ J]. Highlights in Prac- tical Applications of Agents and Multiagent Systems, 2011, 89:61-68.
  • 3Baykan E, Henzinger M, Marian L, et al. A comprehensive study of features and algorithms for URL-based topic classifi- cation[J]. ACM Transactions on the Web (TWEB), 2011, 5(3) :No 15.
  • 4Srittrai W, Meesad P, Haruechaiyasak C. Improving Web page classification by integrating neighboring pages via a topic model[ C]// Proceedings of IICS, 2010. 2010:238-246.
  • 5Qi X, Davison B D. Classifiers without borders: Incorpora- ting fielded text from neighboring Web pages [ C ]// Pro- ceedings of the 31 st Annual International ACM SIGIR Con- ference on Research & Development on Information Re- trieval. 2008:643-650.
  • 6Croft W B, Metzler D, Strohman T. Search engines: Infor- mation Retrieval in Practice [ M]. Addison-Wesley, 2010: 351-358.
  • 7Issac B, Jap W J. Implementing spam detection using Bayesian and Porter Stemmer keyword stripping approaches [C]/! IEEE Region 10 Conference on TENCON 2009- 2009. 2009 : 1-5.
  • 8AOL Inc: The Open Directory Project(ODP) [ EB/OL]. http :///www. dmoz. org/, 2013-03-01.
  • 9ceedings of the 21st International Conference Companion on World Wide Web. 2012:535-536.
  • 10Menon A K. Large-Scale Support Vector Machines: Algo- rithms and Theory [ R ]. Research Exam, University of Cal- ifomia, San Diego, 2009.

引证文献4

二级引证文献8

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部