基于新型主题信息量化方法的Web主题信息提取研究被引量：1

The Study of Topic Information Extraction from Web Pages Based on A New Method of Topic Information Calculation

下载PDF

导出

摘要针对网页主题信息抽取不够精确的问题,提出一种新型的定义和量化主题信息的方法,即把主题信息分为三种信息形式并对不同形式的信息采用不同的方法进行量化计算。基于上述思想,结合DOM规范和分块思想,在DOM树的基础上提出IB-DOM树,并采用分治思想,先定位到包含主题信息的区域,后过滤噪音信息。实验证明本文提出的方法能够较好地解决主题信息自动提取存在的信息完整性和准确性的矛盾。 Aiming at the problem that the extration of topic information from Web page is not precise enough, this paper presents a new method of calculating the topic information of Web pages, which dividing the topic information of Web pages into three forms and using different quantization method for each. Based on the ideas above, the authors combine document object model with section thinking and present the IB - DOM model. Based on the idea of divide - and - conquer, first find the region which contains the topic information, then the irrelevant information is filtered out. The experimental re- sults show that this approach can solve the contradiction between integrity and accuracy existing in the field of automatic extraction of topical information from Web pages betterly.

作者吕聚旺都云程王弘蔚施水才

机构地区北京信息科技大学中文信息处理研究中心北京拓尔思信息技术股份有限公司

出处《现代图书情报技术》 CSSCI 北大核心 2008年第12期48-53,共6页 New Technology of Library and Information Service

基金国家863计划重点项目“跨媒体搜索关键技术研究及服务产品开发”(项目编号:2006AA010105) 国家自然科学基金项目“基于语义的中文文本聚类研究”(项目编号:60772081) 北京市属市管高校人才强教计划项目“创新团队-智能搜索引擎和文本挖掘”(项目编号:PXM2007_014224_044677)的研究成果之一

关键词网页主题信息信息抽取信息块语义信息IB—DOM树 Topic information of Web page Information extraction Information block Semantic information IB -DOM tree

分类号 TP391.1 [自动化与计算机技术—计算机应用技术]

引文网络
相关文献

参考文献4

1Lin S H, Ho J M. Discovering Informative Content Blocks from Web Documents [ C ] . In : Proceedings of the 8th ACM SIGKDD International Conference ,2002.
2王琦,唐世渭,杨冬青,王腾蛟.基于DOM的网页主题信息自动提取[J].计算机研究与发展,2004,41(10):1786-1792. 被引量：81
3胡国平,张巍,王仁华.基于双层决策的新闻网页正文精确抽取[J].中文信息学报,2006,20(6):1-9. 被引量：16
4范莉娅,肖田元.从HTML表格自动构建局部本体方法的研究[J].计算机集成制造系统,2007,13(9):1780-1786. 被引量：1

二级参考文献37

1许勇,荀恩东,贾爱平,宋柔.基于互连网的术语定义获取系统[J].中文信息学报,2004,18(4):37-43. 被引量：13
2张锐.Wordnet综述[J].辽宁教育行政学院学报,2003,20(9):5-7. 被引量：3
3邹纲,刘洋,刘群,孟遥,于浩,西野文人,亢世勇.面向Internet的中文新词语检测[J].中文信息学报,2004,18(6):1-9. 被引量：59
4O Buyukkokten, H Garcia-Molina, A Paepcke. Accordion summarization for end-game browsing on PDAs and cellular phones. In: Proc of ACM Conf on Human Factors in Computing Systems(CHI 2001). New York: ACM Press, 2001. 213～220
5Wang Tengjiao, Tang Shiwei, Yang Dongqing, et al. COMIIX:Towards effective WEB information extraction, integration and query answering. In: Proc of SIGMOD' 02. New York: ACM Press, 2002. 620
6Liu Ling, Pu Calton, Han Wei. XWRAP: An XML-enabled wrapper construction system for Web information sources. In:Proc of the 16th Int'l Conf on Data Engineering. Washington:IEEE Computer Society Press, 2000. 611～621
7R Baumgartner, S Flesca, G Gottlob. Visual Web information extraction with Lixto. In: Proc of the 27th Int'l Conf on Very Large Data Bases. San Francisco: Morgan Kaufmann, 2001. 119～ 128
8D Freitag. Machine learning for information extraction in information domains. Machine Learning, 2000, 39 (2-3): 169 ～202
9S SoderLan. Learning information extraction rules for semistructured and free text. Machine Learning, 1999, 34(1-3): 233～ 272
10R D Doorenbos, O Etzioni, D S Weld. A scalable comparasonshopping agent for the World-Wide Web. In: ACM Agents' 97.New York: ACM Press, 1997. 39～48

共引文献93

1赵彦斌,李庆华,赵峰.Web网页语义树的构造与利用[J].华中科技大学学报（自然科学版）,2005,33(z1):229-231. 被引量：1
2张聚弘,山岚.基于页面对比分析的数据提取[J].计算机与数字工程,2006,34(1):49-52. 被引量：1
3吴鹏飞,孟祥增,刘俊晓,马凤娟.网页区域分割与识别技术[J].现代计算机,2006(6):48-50. 被引量：4
4吴鹏飞,孟祥增,刘俊晓,马凤娟.基于结构与内容的网页主题信息提取研究[J].山东大学学报（理学版）,2006,41(3):41-44. 被引量：15
5贺智平,徐学洲,李爱玲.一种基于信息熵的Web页面主题信息抽取方法[J].计算机工程与应用,2007,43(4):164-166. 被引量：6
6赵欣欣,索红光,刘玉树.基于标记窗的网页正文信息提取方法[J].计算机应用研究,2007,24(3):144-145. 被引量：33
7谢华,刘卫国.基于局部语义的网页净化算法[J].计算机系统应用,2007,16(5):25-28.
8章勤,余洋,陶文兵.图像搜索中基于网页分块的图像分类研究[J].计算机工程与科学,2007,29(6):42-44. 被引量：1
9施水才,程涛,王霞,吕学强.基于网页内容的广告推介研究[J].中文信息学报,2007,21(4):42-47. 被引量：1
10高琰,谷士文,谭立球.基于多种策略的页面内容提取算法[J].西南交通大学学报,2007,42(4):473-477. 被引量：4

同被引文献11

1王琦,唐世渭,杨冬青,王腾蛟.基于DOM的网页主题信息自动提取[J].计算机研究与发展,2004,41(10):1786-1792. 被引量：81
2Lai Jianbing, Liu Qiang, Liu Yi. Web information extrac- tion based on Hidden Markov Model [ C ]. Proceedings of the 14th International Conference on Computer Supported Cooperative Work in Design,2010: 234- 238.
3Peng Chen ,Yue Zhang. Web information extraction and its application [ C ]. Proceedings of the IEEE International Conference on Cloud Computing and Intelligence Systems, 2011:448 - 451.
4Lin S H, Ho J M. Discovering Informative Content Blocks from Web Documents [ C ]. Proceedings of the 8th ACM SIGKDD International Conference, 2002:588 -593.
5Quanyin Zhu,Yunyang Yan ,Jin Ding, et al. The Commodities Price Extracting for Shop Online[ C]. Proceedings of the In- ternational Conference on Future Information Technology and Management Engineering,2010, (2) :317 - 320.
6Quanyin Zhu, Jin Ding, Yonghua Yin, et al. A Hybrid Approach for New Products Discovery of Cell Phone Based on Web Mining [ J ]. Journal of Information and Computational Science. 2012,9 (16) :5039 - 5046.
7Quanyin Zhu, Pei Zhou, Sunqun Cao, et al. A novel RDB -SW approach for commodities price dynamic trend a- nalysis based on Web extracting[ J]. Journal of Digital In- formation Management ,2012,10(4) :230 - 235.
8Quanyin Zhu,Pei Zhou. The System Architecture for the Basic Information of Science and Technology Experts Based on Distributed Storage and Web Mining[ C]. Pro- ceedings of the International Conference on Computer Science and Service System,2012:661 -664.
9Kangjing Hu,Jin Ding, Chengjie Xu,et al. The Develop- ment of Software Testing Platform of Huaian City [ C ]. Ap- plied Mechanics and Materials,2013:411 -414.
10刘金岭,谈芸,李健普,袁娜.基于多因素的中文文本主题自动抽取方法[J].计算机技术与发展,2010,20(7):72-75. 被引量：3

引证文献1

1朱全银,周培,尹永华,陈浮,刘金岭.基于Web数据挖掘的多因素科技专家信息提取方法[J].淮阴工学院学报,2013,22(5):23-27. 被引量：1

二级引证文献1

1朱全银,潘禄,刘文儒,李翔,张永军,刘金岭.Web科技新闻分类抽取算法[J].淮阴工学院学报,2015,24(5):18-24. 被引量：3

1吴昊,倪志伟,王会颖.基于MapReduce的蚁群算法[J].计算机集成制造系统,2012,18(7):1503-1509. 被引量：22
2刘铭.大数据管理面临的挑战及技术新趋势[J].信息安全与通信保密,2014,0(10):42-43. 被引量：1
3李中,李晓.一种性能优化的防火墙规则匹配算法[J].计算机应用研究,2013,30(4):1205-1207. 被引量：3
4周培德,王文明.确定两个任意多边形的并的算法[J].北京理工大学学报,1998,18(1):87-91. 被引量：2
5黄伟婷,赵红,祝峰.代价敏感属性约简的自适应分治算法[J].山东大学学报（理学版）,2016,51(8):98-104.
6杨智明,李艳.动态规划与贪心法的对比分析[J].保山学院学报,2016,35(5):73-76. 被引量：1
7顾韵华,李佩,谢刚.一种基于文本样式的Web主题信息提取方法研究[J].计算机与数字工程,2009,37(11):17-20.
8朱逢春.基于DOM树的网页去噪技术[J].电子制作,2015,23(8Z). 被引量：1
9黄伟婷,赵红,祝峰.分治策略下的代价敏感属性选择回溯算法[J].计算机科学与探索,2016,10(10):1451-1458. 被引量：1
10梁建飞,吐尔根.依布拉音,田生伟,赛依旦.阿不力米提.汉维主题网页自动获取技术的研究[J].计算机应用与软件,2012,29(1):42-45. 被引量：2

现代图书情报技术

2008年第12期

浏览历史

内容加载中请稍等...

基于新型主题信息量化方法的Web主题信息提取研究被引量：1

参考文献4

二级参考文献37

共引文献93

同被引文献11

引证文献1

二级引证文献1

相关作者

相关机构

相关主题

浏览历史

基于新型主题信息量化方法的Web主题信息提取研究 被引量：1

参考文献4

二级参考文献37

共引文献93

同被引文献11

引证文献1

二级引证文献1

相关作者

相关机构

相关主题

浏览历史

基于新型主题信息量化方法的Web主题信息提取研究被引量：1