摘要
针对Distant Supervision关系抽取方法训练语料存在大量噪声的问题,提出一种基于主题模型的噪声标注识别方法。该方法首先分析了中文Distant Supervision实体关系抽取方法面临的关系句子实例结构复杂的问题,然后利用自定义的模式以及模式聚类实现模式表示与聚合,最后使用主题模型识别噪声标注。实验结果表明,文章方法能有效识别噪声标注,用滤除噪声标注后的数据训练实体关系抽取模型,实验证明经过噪声滤除后实体关系抽取性能得到显著改善。
In view of lots of noise in training corpus for relation extraction based on Distant Supervi- sion method, this paper proposes a method based on the topic model to identify noise mark. This method first analyzes the complex structures of relation sentence examples facec by Distant Supervi- sion relation extraction in Chinese language, and then uses a pattern delimited and pattern clustering to realize pattern representation and polymerization, and last uses the topic model to realize the iden- tification of noise mark. The experimental results show that this method can identify noise mark ef- fectively, and when the data which has been filtered is used to train a relation extraction model, the result could be significantly improved.
出处
《信息工程大学学报》
2016年第3期303-308,共6页
Journal of Information Engineering University
基金
国家863计划资助项目(2011AA7032030D)