摘要
研究了非结构化中文文本的实体属性抽取方法。引入文本化简作为抽取的预处理过程,解决传统信息抽取方法因为长难句的存在和自然语言表述多样性导致抽取效果不佳的问题。其中,文本化简被建模为一个序列到序列(seq2seq)的翻译过程,并用机器翻译领域的seq2seq-RNN模型进行实现。为了提升模型的化简效果,进行了不同层面的优化,包括使用预训练词向量、收集常用词汇表、引入词性标注和设计化简评分函数,这些优化使模型专注于化简过程中句法转换的学习。针对化简后的文本,设计基于简洁规则的方法进行信息元组和实体属性抽取。实验表明,对seq2seq-RNN的改进能提升文本化简的效果,而且在化简文本上抽取的信息数量比在原始文本上的多,信息也比较精确。
In this paper,the method of entity attributes extraction on unstructured Chinese text is studied.Text Simplification(TS)is introduced as the pretreatment process of extraction to solve the problem that traditional information extraction methods are ineffective because of the existence of long and difficult sentences and the diversity of natural language expressions.TS is modeled as a sequence to sequence(seq2seq)procedure,and is implemented with the seq2seq-RNN model in the machine translation field.To improve the model,several strategies,including pre-trained word vectors,common vocabulary,POS tagging and simplifying scoring function,are introduced to make the model focus more on syntax transformation during TS.For the simplified text,a simple rule-based method is used to perform information tuple extraction,and later entity attributes are extracted from those tuples.The experimental results show that the improvements on seq2seq-RNN achieve better performance on text simplification,and the amount of information extracted from the simplified text is more than the original text,while the information is more accurate.
作者
吴呈
王朝坤
王沐贤
WU Cheng;WANG Chaokun;WANG Muxian(School of Software,Tsinghua University,Beijing 100084,China;School of Computer Science and Technology,Harbin Institute of Technology,Harbin 150001,China)
出处
《计算机工程与应用》
CSCD
北大核心
2020年第21期115-122,共8页
Computer Engineering and Applications
基金
国家自然科学基金(No.61872207)
国家重点研发计划(No.2017YFC0820402)。
关键词
文本化简
信息抽取
实体属性
自然语言处理
神经网络
text simplification
information extraction
entity attributes
natural language processing
neural network