期刊文献+

基于改进的隐马尔科夫模型的网页新闻关键信息抽取 被引量:9

Extraction of Key Information in Web News Based on Improved Hidden Markov Model
原文传递
导出
摘要 【目的】通过隐马尔科夫模型解决新闻网页中标题、日期、来源、正文等关键信息抽取问题,并根据应用场景对算法做出改进以提高抽取效果。【方法】将网页文档转为DOM树并进行预处理,映射待抽取信息项为状态,映射待抽取观测项为词汇,研究隐马尔科夫模型在网页新闻关键信息抽取中的应用并对算法提出改进。【结果】使用隐马尔科夫模型的改进算法,在已构建抽取模型的网站中,平均准确率可达97%。【局限】抽取模型在分类能力上稍有不足,无法对细微差别信息进行准确抽取。【结论】该方法具有识别准确率高、建模能力强、训练数据小、训练速度快的优点。 [Objective]This paper aims to solve key information extraction problems in news web pages,such as title,date,source,and text,by Hidden Markov Model(HMM).[Methods]The web document was transformed into a DOM tree and preprocessed.The information items to be extracted were mapped to state,and the observation value of the extracted items was mapped to vocabulary.The application of HMM in key information extraction of web news was studied,and the algorithm was improved.[Results]Using the improved HMM algorithm,the accuracy rate can reach 97%on average in the websites.[Limitations]The extraction model is slightly insufficient in classification ability,and it is impossible to accurately extract the slightly differences.[Conclusions]The experiment proves that this method has the advantages of high recognition accuracy,strong modeling ability,and fast training speed with small set of tracing data.
作者 刘志强 都云程 施水才 Liu Zhiqiang;Du Yuncheng;Shi Shuicai(School of Computer,Beijing Information Science and Technology University,Beijing 100101,China;TRS Information Technology Co.,Ltd.,Beijing 100101,China)
出处 《数据分析与知识发现》 CSSCI CSCD 北大核心 2019年第3期120-128,共9页 Data Analysis and Knowledge Discovery
基金 教育部社会科学重大攻关项目基金项目"大数据驱动的城市公共安全风险研究"(项目编号:16JZD023)的研究成果之一
关键词 信息抽取 隐马尔科夫模型 机器学习 DOM树 Information Extraction Hidden Markov Model Machine Learning DOM Tree
  • 相关文献

参考文献12

二级参考文献126

共引文献72

同被引文献100

引证文献9

二级引证文献24

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部