期刊文献+

基于标题机器学习的网页分割方法 被引量:1

Novel Method of Web Page Segmentation Based on Title Machine Learning
下载PDF
导出
摘要 针对已有网页分割方法都基于文档对象模型实现且实现难度较高的问题,提出了一种采用字符串数据模型实现网页分割的新方法。该方法通过机器学习获取网页标题的特征,利用标题实现网页分割。首先,利用网页行块分布函数和网页标题标签学习得到网页标题特征;然后,基于标题将网页分割成内容块;最后,利用块深度对内容块进行合并,完成网页分割。理论分析与实验结果表明,该方法中的算法具有O(n)的时间复杂度和空间复杂度,该方法对于高校门户、博客日志和资源网站等类型的网页具有较好的分割效果,并且可以用于网页信息管理的多种应用中,具有良好的应用前景。 To solve the problem that it is difficult to implement the web page segmentation method based on document object model(DOM),a novel method was proposed through employing string model.The feature of the title of a web page is dug out by machine learning.Based on the found title,the web page is segmented.Firstly,the titles in web pages are picked up by the information of liner block function and title tag.Secondly,web pages are partitioned into content blocks by using the titles.Finally,the content blocks are merged by block depth information.It is proved that the complexity of algorithms in the method are O(n),and the method is suitable for web pages in the university portal,blog and resource web sites.The method is useful for many applications in web page information management,and it has a good prospect.
作者 李进生 乐惠骁 童名文 LI Jin -sheng1,LE Hui- xiao2, TONG Ming -wen2(1 Modern Education Technical Center, The Open University of Wuhan, Wuhan 430033, China;2School of Education Information Technology, Central China Normal University, Wuhan 430079, Chin)
出处 《计算机科学》 CSCD 北大核心 2018年第B06期583-587,共5页 Computer Science
基金 教育部人文社科基金资助项目:数字化学习资源无障碍适配决策模型研究(15YJA880062)资助
关键词 网页分割 标题 行块分布函数 块深度 机器学习 Webpage segmentation Title Liner block function Block depth Machine learning
  • 相关文献

参考文献3

二级参考文献37

  • 1王琦,唐世渭,杨冬青,王腾蛟.基于DOM的网页主题信息自动提取[J].计算机研究与发展,2004,41(10):1786-1792. 被引量:81
  • 2吴鹏飞,孟祥增,刘俊晓,马凤娟.网页区域分割与识别技术[J].现代计算机,2006(6):48-50. 被引量:4
  • 3罗永莲,秦振吉.新闻网页主题内容提取方法研究[J].微计算机应用,2007,28(5):556-560. 被引量:5
  • 4Morrison, D. R, Journal of ACM, 1968, (15) :514 -534
  • 5Adapting Web pages for small -screen devices. Y Chen, X Xie, WY Ma, HJ Zhang - Intemet Computing, IEEE, 2005,9 (1) : 50 - 56
  • 6G. Hattori, K. Hoashi, K. Matsumoto, F. Sugaya ( KDDI R&D Laboratories), Robust Web Page Segmentation for Mobile Terminal Using Content - Distances and Page Layout Information. Proceedings of the Sixteenth International World Wide Web Conference [C]. ( WWW2007).
  • 7VIPS a Vision - based Page Segmentation Algorithm Cai, S Yu, JR Wen, WY Ma. Microsoft Technical Report, MSR -TR - 2003 - 79, 2003
  • 8O Buyukkokten, H Garcia-Molina, A Paepcke. Accordion summarization for end-game browsing on PDAs and cellular phones. In: Proc of ACM Conf on Human Factors in Computing Systems(CHI 2001). New York: ACM Press, 2001. 213~220
  • 9Wang Tengjiao, Tang Shiwei, Yang Dongqing, et al. COMIIX:Towards effective WEB information extraction, integration and query answering. In: Proc of SIGMOD' 02. New York: ACM Press, 2002. 620
  • 10Liu Ling, Pu Calton, Han Wei. XWRAP: An XML-enabled wrapper construction system for Web information sources. In:Proc of the 16th Int'l Conf on Data Engineering. Washington:IEEE Computer Society Press, 2000. 611~621

共引文献84

同被引文献5

引证文献1

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部