期刊文献+

HTML页面中的文献记录分析算法

Analysis Algorithm of Reference Record in HTML Page
原文传递
导出
摘要 为了使出版机构能够及时从大量网页中发现所需文献,需要设计能够从超文本标记语言页面中自动提取文献信息的算法.为此,设计了基于条件随机场的文献记录分析算法:首先,设计了文档对象树的分割算法,通过分割标记将页面数据分成独立的部分,这些数据块由标签和文本序列构成;随后,将该序列作为条件随机场模型的特征向量,建立文献信息标记模型;最后,设计启发式算法,从标记模型中提取文献信息数据,并通过实验验证了其有效性. With rapid development of Internet,web pages have become the main sources of information.In order to make publishing agencies timely find necessary references from large number of pages,it is necessary to design a reference information extraction algorithm to get useful references information from hyper text markup language pages. A reference analysis algorithm based on conditional random fields was proposed. Firstly,a document object tree segmentation algorithm was designed. Through classifier the web page data were divided into separate parts,and these data blocks were composed of tags and text sequences. Subsequently,these sequences were taken as characteristic vectors of conditional random field model to establish reference information labeling model. Finally,a heuristic algorithm was presented to extract reference information data from the labeling model,and validity of this algorithm was verified by experiments.
出处 《北京邮电大学学报》 EI CAS CSCD 北大核心 2017年第S1期85-88,共4页 Journal of Beijing University of Posts and Telecommunications
基金 北京市教委科技创新服务能力建设项目(PXM2016_014223_000025) 北京印刷学院校级重点项目(ea201507) 北京印刷学院教师队伍建设-博士启动金项目(27170116005/062) 北京印刷学院科研项目-出版物数据资产评估实验室建设项目(20190116005/006)
关键词 数字出版 条件随机场 文献记录分析 digital publishing conditional random field reference analysis
  • 相关文献

参考文献8

二级参考文献66

共引文献47

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部