期刊文献+

基于新型主题信息量化方法的Web主题信息提取研究 被引量:1

The Study of Topic Information Extraction from Web Pages Based on A New Method of Topic Information Calculation
下载PDF
导出
摘要 针对网页主题信息抽取不够精确的问题,提出一种新型的定义和量化主题信息的方法,即把主题信息分为三种信息形式并对不同形式的信息采用不同的方法进行量化计算。基于上述思想,结合DOM规范和分块思想,在DOM树的基础上提出IB-DOM树,并采用分治思想,先定位到包含主题信息的区域,后过滤噪音信息。实验证明本文提出的方法能够较好地解决主题信息自动提取存在的信息完整性和准确性的矛盾。 Aiming at the problem that the extration of topic information from Web page is not precise enough, this paper presents a new method of calculating the topic information of Web pages, which dividing the topic information of Web pages into three forms and using different quantization method for each. Based on the ideas above, the authors combine document object model with section thinking and present the IB - DOM model. Based on the idea of divide - and - conquer, first find the region which contains the topic information, then the irrelevant information is filtered out. The experimental re- sults show that this approach can solve the contradiction between integrity and accuracy existing in the field of automatic extraction of topical information from Web pages betterly.
出处 《现代图书情报技术》 CSSCI 北大核心 2008年第12期48-53,共6页 New Technology of Library and Information Service
基金 国家863计划重点项目“跨媒体搜索关键技术研究及服务产品开发”(项目编号:2006AA010105) 国家自然科学基金项目“基于语义的中文文本聚类研究”(项目编号:60772081) 北京市属市管高校人才强教计划项目“创新团队-智能搜索引擎和文本挖掘”(项目编号:PXM2007_014224_044677)的研究成果之一
关键词 网页主题信息信息抽取信息块语义信息IB—DOM树 Topic information of Web page Information extraction Information block Semantic information IB -DOM tree
  • 相关文献

参考文献4

二级参考文献37

  • 1许勇,荀恩东,贾爱平,宋柔.基于互连网的术语定义获取系统[J].中文信息学报,2004,18(4):37-43. 被引量:13
  • 2张锐.Wordnet综述[J].辽宁教育行政学院学报,2003,20(9):5-7. 被引量:3
  • 3邹纲,刘洋,刘群,孟遥,于浩,西野文人,亢世勇.面向Internet的中文新词语检测[J].中文信息学报,2004,18(6):1-9. 被引量:59
  • 4O Buyukkokten, H Garcia-Molina, A Paepcke. Accordion summarization for end-game browsing on PDAs and cellular phones. In: Proc of ACM Conf on Human Factors in Computing Systems(CHI 2001). New York: ACM Press, 2001. 213~220
  • 5Wang Tengjiao, Tang Shiwei, Yang Dongqing, et al. COMIIX:Towards effective WEB information extraction, integration and query answering. In: Proc of SIGMOD' 02. New York: ACM Press, 2002. 620
  • 6Liu Ling, Pu Calton, Han Wei. XWRAP: An XML-enabled wrapper construction system for Web information sources. In:Proc of the 16th Int'l Conf on Data Engineering. Washington:IEEE Computer Society Press, 2000. 611~621
  • 7R Baumgartner, S Flesca, G Gottlob. Visual Web information extraction with Lixto. In: Proc of the 27th Int'l Conf on Very Large Data Bases. San Francisco: Morgan Kaufmann, 2001. 119~ 128
  • 8D Freitag. Machine learning for information extraction in information domains. Machine Learning, 2000, 39 (2-3): 169 ~202
  • 9S SoderLan. Learning information extraction rules for semistructured and free text. Machine Learning, 1999, 34(1-3): 233~ 272
  • 10R D Doorenbos, O Etzioni, D S Weld. A scalable comparasonshopping agent for the World-Wide Web. In: ACM Agents' 97.New York: ACM Press, 1997. 39~48

共引文献93

同被引文献11

  • 1王琦,唐世渭,杨冬青,王腾蛟.基于DOM的网页主题信息自动提取[J].计算机研究与发展,2004,41(10):1786-1792. 被引量:81
  • 2Lai Jianbing, Liu Qiang, Liu Yi. Web information extrac- tion based on Hidden Markov Model [ C ]. Proceedings of the 14th International Conference on Computer Supported Cooperative Work in Design,2010: 234- 238.
  • 3Peng Chen ,Yue Zhang. Web information extraction and its application [ C ]. Proceedings of the IEEE International Conference on Cloud Computing and Intelligence Systems, 2011:448 - 451.
  • 4Lin S H, Ho J M. Discovering Informative Content Blocks from Web Documents [ C ]. Proceedings of the 8th ACM SIGKDD International Conference, 2002:588 -593.
  • 5Quanyin Zhu,Yunyang Yan ,Jin Ding, et al. The Commodities Price Extracting for Shop Online[ C]. Proceedings of the In- ternational Conference on Future Information Technology and Management Engineering,2010, (2) :317 - 320.
  • 6Quanyin Zhu, Jin Ding, Yonghua Yin, et al. A Hybrid Approach for New Products Discovery of Cell Phone Based on Web Mining [ J ]. Journal of Information and Computational Science. 2012,9 (16) :5039 - 5046.
  • 7Quanyin Zhu, Pei Zhou, Sunqun Cao, et al. A novel RDB -SW approach for commodities price dynamic trend a- nalysis based on Web extracting[ J]. Journal of Digital In- formation Management ,2012,10(4) :230 - 235.
  • 8Quanyin Zhu,Pei Zhou. The System Architecture for the Basic Information of Science and Technology Experts Based on Distributed Storage and Web Mining[ C]. Pro- ceedings of the International Conference on Computer Science and Service System,2012:661 -664.
  • 9Kangjing Hu,Jin Ding, Chengjie Xu,et al. The Develop- ment of Software Testing Platform of Huaian City [ C ]. Ap- plied Mechanics and Materials,2013:411 -414.
  • 10刘金岭,谈芸,李健普,袁娜.基于多因素的中文文本主题自动抽取方法[J].计算机技术与发展,2010,20(7):72-75. 被引量:3

引证文献1

二级引证文献1

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部