摘要
针对已有网页分割方法都基于文档对象模型实现且实现难度较高的问题,提出了一种采用字符串数据模型实现网页分割的新方法。该方法通过机器学习获取网页标题的特征,利用标题实现网页分割。首先,利用网页行块分布函数和网页标题标签学习得到网页标题特征;然后,基于标题将网页分割成内容块;最后,利用块深度对内容块进行合并,完成网页分割。理论分析与实验结果表明,该方法中的算法具有O(n)的时间复杂度和空间复杂度,该方法对于高校门户、博客日志和资源网站等类型的网页具有较好的分割效果,并且可以用于网页信息管理的多种应用中,具有良好的应用前景。
To solve the problem that it is difficult to implement the web page segmentation method based on document object model(DOM),a novel method was proposed through employing string model.The feature of the title of a web page is dug out by machine learning.Based on the found title,the web page is segmented.Firstly,the titles in web pages are picked up by the information of liner block function and title tag.Secondly,web pages are partitioned into content blocks by using the titles.Finally,the content blocks are merged by block depth information.It is proved that the complexity of algorithms in the method are O(n),and the method is suitable for web pages in the university portal,blog and resource web sites.The method is useful for many applications in web page information management,and it has a good prospect.
作者
李进生
乐惠骁
童名文
LI Jin -sheng1,LE Hui- xiao2, TONG Ming -wen2(1 Modern Education Technical Center, The Open University of Wuhan, Wuhan 430033, China;2School of Education Information Technology, Central China Normal University, Wuhan 430079, Chin)
出处
《计算机科学》
CSCD
北大核心
2018年第B06期583-587,共5页
Computer Science
基金
教育部人文社科基金资助项目:数字化学习资源无障碍适配决策模型研究(15YJA880062)资助
关键词
网页分割
标题
行块分布函数
块深度
机器学习
Webpage segmentation
Title
Liner block function
Block depth
Machine learning