期刊文献+

基于编辑距离的多实体可信确认算法 被引量:2

MeTCa:Multi-entity Trusted Confirmation Algorithm Based on Edit Distance
下载PDF
导出
摘要 随着自媒体的蓬勃发展,任何人都可以在网上随意发布和转发信息,而这些信息可能是真实的,也可能是道听途说或被故意篡改的。互联网上数据的严重冗余和弱可信问题,导致现有数据的可用性很差。Bi-LSTM-CRF(Bi-Long Short Term Memory with Conditional Random Field Layer)网络虽然能够解决数据中命名实体识别的准确率问题,但不能满足识别出的实体是可信的这一要求。文中提出一种基于编辑距离的多实体可信确认算法,并通过人物命名实体识别实例对该算法进行验证。首先通过分布式爬虫抓取同一个邮箱地址在多个搜索引擎上的Top N网页记录,然后使用经过双语语料训练后的Bi-LSTM-CRF模型抽取每个页面内的人物命名实体,最后通过实体多参数融合确定邮箱所对应的人物命名实体。实验结果表明,多实体可信确认算法能够将邮箱地址与邮箱真实主人的匹配准确率MRR(Mean Reciprocal Rank)提高到91.32%,相比只使用词频的算法其MRR提升了23.08%。实验数据充分说明,多实体可信确认算法能很好地从弱可信数据中获得强可信度的实体,降低海量数据中的低质特性,从而有效地增强实体数据源的可信度。 With the development of We-media,every individual can publish and forward information on the internet at will.The information may have real records,but it may also be hearsay or even contents being intentionally tampered with.The data on the Internet has serious redundancy and weak credibility problems,thus resulting in low availability of existing network media data.Although the Bi-LSTM-CRF network can solve the problem of the accuracy of named entity recognition in data,it cannot meet the requirement that the identified entity is credible.In this paper,a multi-parameter fusion credible confirmation algorithm based on multi-source weakly trusted data is proposed,which is verified by identifying instances of person named entities.This paper uses distributed spiders to crawl Top N pages with the same mailbox address on multiple search engines.Afterwards,Bi-LSTM-CRF algorithm trained by bilingual corpus is adopted to extract person named entities from each page.Finally,the person named entities corresponding to the mailbox are determined by multi-parameter entity fusion trusted confirmation algorithm.The experimental results show that the multi-parameter fusion credible confirmation algorithm can improve the accuracy of MRR(MRR)of the matching between the mailbox address and the real owner of the mailbox to 91.32%,which is 23.08% higher than the traditional algorithm using only the term frequency model.The experimental data reasonably demonstrates that the multi-parameter fusion credible confirmation algorithm can obtain strong credibility entities from weakly trusted data and reduce the low-quality characteristics of massive data,thus effectively enhancing the credibility of entity data sources.
作者 孙国梓 吕建伟 李华康 SUN Guo-zi;LYU Jian-wei;LI Hua-kang(School of Computer Science and Technology,Nanjing University of Posts and Telecommunications,Nanjing 210003,China)
出处 《计算机科学》 CSCD 北大核心 2020年第12期327-331,共5页 Computer Science
基金 国家自然科学基金(61502247,11501302,61502243) 中国博士后科学基金(2016M600434,2016M591840) 江苏省博士后科研基金(1601128B) 江西省经济犯罪侦查与防控技术协同创新中心开放基金资助课题(JXJZXTCX-015) 数字工程与先进计算重点实验室开放课题(2017A10)。
关键词 弱可信数据 双向长短时记忆循环-条件随机场网络 多实体可信确认算法 编辑距离 Weak trusted data Bi-LSTM-CRF Multi-parameter fusion trusted confirmation algorithm Edit distance
  • 相关文献

参考文献4

二级参考文献24

共引文献158

同被引文献22

引证文献2

二级引证文献3

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部