针对中文汽车领域实体抽取任务中对嵌套实体、长实体识别效果差的问题,提出一种实体类别增强的嵌套实体抽取(ECE-NER)模型。首先,基于特征融合编码,提高模型对领域实体边界的感知能力;然后,尾词识别模块利用多层感知机得到实体尾词集合...针对中文汽车领域实体抽取任务中对嵌套实体、长实体识别效果差的问题,提出一种实体类别增强的嵌套实体抽取(ECE-NER)模型。首先,基于特征融合编码,提高模型对领域实体边界的感知能力;然后,尾词识别模块利用多层感知机得到实体尾词集合;最后,前向边界识别模块基于义原构造的实体类别特征和自注意力机制得到实体类别增强的候选尾词表征,融合领域实体类别特征,利用双仿射编码器计算特定尾词和实体类型的实体跨度概率,从而确定命名实体。在某汽车企业生产线故障数据集、汽车工业故障抽取评测数据集CCL2022和中文医学文本数据集CHIP2020上进行模型验证。实验结果表明,所提模型在前两个数据集上的实体识别F1值比序列标注模型(BERT+BiLSTM+CRF)、基于跨度的实体抽取模型(PURE(Princeton University Relation Extraction)、SpERT(Span-based Entity and Relation Transformer))分别提高了4.1、1.8、1.6个百分点和9.0、5.4、7.3个百分点;在第一个数据集和第三个数据集中嵌套实体识别F1值与PURE、SpERT模型相比提高了13.3、8.3个百分点和21.7、9.3个百分点,验证了所提模型在嵌套实体识别上的有效性。展开更多
A sememe is defined as the minimum semantic unit of languages in linguistics.Sememe knowledge bases are built by manually annotating sememes for words and phrases.HowNet is the most well-known sememe knowledge base.It...A sememe is defined as the minimum semantic unit of languages in linguistics.Sememe knowledge bases are built by manually annotating sememes for words and phrases.HowNet is the most well-known sememe knowledge base.It has been extensively utilized in many natural language processing tasks in the era of statistical natural language processing and proven to be effective and helpful to understanding and using languages.In the era of deep learning,although data are thought to be of vital importance,there are some studies working on incorporating sememe knowledge bases like HowNet into neural network models to enhance system performance.Some successful attempts have been made in the tasks including word representation learning,language modeling,semantic composition,etc.In addition,considering the high cost of manual annotation and update for sememe knowledge bases,some work has tried to use machine learning methods to automatically predict sememes for words and phrases to expand sememe knowledge bases.Besides,some studies try to extend HowNet to other languages by automatically predicting sememes for words and phrases in a new language.In this paper,we summarize recent studies on application and expansion of sememe knowledge bases and point out some future directions of research on sememes.展开更多
随着近年来机器学习方法在自然语言处理领域的应用越发广泛,自然语言处理任务的安全性也引起了研究者们重视.现有研究发现,向样本施加细微扰动可能令机器学习模型得到错误结果,这种方法称之为对抗攻击.文本对抗攻击能够有效发现自然语...随着近年来机器学习方法在自然语言处理领域的应用越发广泛,自然语言处理任务的安全性也引起了研究者们重视.现有研究发现,向样本施加细微扰动可能令机器学习模型得到错误结果,这种方法称之为对抗攻击.文本对抗攻击能够有效发现自然语言模型的弱点从而进行改进.然而,目前的文本对抗攻击方法都着重于设计复杂的对抗样本生成策略,对抗攻击成功率提升有限,且对样本进行高侵入性修改容易导致样本质量下降.如何更简单、更高效地提升对抗攻击效果,并输出高质量对抗样本已经成为重要需求.为解决此问题,从改进对抗攻击过程的新角度,设计了义原级语句稀释法(sememe-level sentence dilution algorithm,SSDA)及稀释池构建算法(dilution pool construction algorithm,DPCA).SSDA是一种可以自由嵌入经典对抗攻击过程中的新过程,它利用DPCA构建的稀释池先对输入样本进行稀释,再进行对抗样本生成.在未知文本数据集与自然语言模型的情况下,不仅能够提升任意文本对抗攻击方法的攻击成功率,还能够获得相较于原方法更高的对抗样本质量.通过对不同文本数据集、稀释池规模、自然语言模型,以及多种主流文本对抗攻击方法进行对照实验,验证了SSDA对文本对抗攻击方法成功率的提升效果以及DPCA构建的稀释池对SSDA稀释能力的提升效果.实验结果显示,SSDA稀释过程能够比经典对抗攻击过程发现更多模型漏洞,且DPCA能够帮助SSDA在提升成功率的同时进一步提升对抗样本的文本质量.展开更多
What and how we translate are questions often argued about. No matter what kind of answers one may give, priority in translation should be granted to meaning, especially those meanings that exist in all concerned lang...What and how we translate are questions often argued about. No matter what kind of answers one may give, priority in translation should be granted to meaning, especially those meanings that exist in all concerned languages. This research defines them as universal sememes, and the study of them as universal semantics, of which applications are also briefly looked into.展开更多
文摘针对中文汽车领域实体抽取任务中对嵌套实体、长实体识别效果差的问题,提出一种实体类别增强的嵌套实体抽取(ECE-NER)模型。首先,基于特征融合编码,提高模型对领域实体边界的感知能力;然后,尾词识别模块利用多层感知机得到实体尾词集合;最后,前向边界识别模块基于义原构造的实体类别特征和自注意力机制得到实体类别增强的候选尾词表征,融合领域实体类别特征,利用双仿射编码器计算特定尾词和实体类型的实体跨度概率,从而确定命名实体。在某汽车企业生产线故障数据集、汽车工业故障抽取评测数据集CCL2022和中文医学文本数据集CHIP2020上进行模型验证。实验结果表明,所提模型在前两个数据集上的实体识别F1值比序列标注模型(BERT+BiLSTM+CRF)、基于跨度的实体抽取模型(PURE(Princeton University Relation Extraction)、SpERT(Span-based Entity and Relation Transformer))分别提高了4.1、1.8、1.6个百分点和9.0、5.4、7.3个百分点;在第一个数据集和第三个数据集中嵌套实体识别F1值与PURE、SpERT模型相比提高了13.3、8.3个百分点和21.7、9.3个百分点,验证了所提模型在嵌套实体识别上的有效性。
基金the National Key Research and Development Program of China(2018 YFB1004503)the National Natural Science Foundation of China(NSFC Grant Nos.61732008,61532010).
文摘A sememe is defined as the minimum semantic unit of languages in linguistics.Sememe knowledge bases are built by manually annotating sememes for words and phrases.HowNet is the most well-known sememe knowledge base.It has been extensively utilized in many natural language processing tasks in the era of statistical natural language processing and proven to be effective and helpful to understanding and using languages.In the era of deep learning,although data are thought to be of vital importance,there are some studies working on incorporating sememe knowledge bases like HowNet into neural network models to enhance system performance.Some successful attempts have been made in the tasks including word representation learning,language modeling,semantic composition,etc.In addition,considering the high cost of manual annotation and update for sememe knowledge bases,some work has tried to use machine learning methods to automatically predict sememes for words and phrases to expand sememe knowledge bases.Besides,some studies try to extend HowNet to other languages by automatically predicting sememes for words and phrases in a new language.In this paper,we summarize recent studies on application and expansion of sememe knowledge bases and point out some future directions of research on sememes.
文摘随着近年来机器学习方法在自然语言处理领域的应用越发广泛,自然语言处理任务的安全性也引起了研究者们重视.现有研究发现,向样本施加细微扰动可能令机器学习模型得到错误结果,这种方法称之为对抗攻击.文本对抗攻击能够有效发现自然语言模型的弱点从而进行改进.然而,目前的文本对抗攻击方法都着重于设计复杂的对抗样本生成策略,对抗攻击成功率提升有限,且对样本进行高侵入性修改容易导致样本质量下降.如何更简单、更高效地提升对抗攻击效果,并输出高质量对抗样本已经成为重要需求.为解决此问题,从改进对抗攻击过程的新角度,设计了义原级语句稀释法(sememe-level sentence dilution algorithm,SSDA)及稀释池构建算法(dilution pool construction algorithm,DPCA).SSDA是一种可以自由嵌入经典对抗攻击过程中的新过程,它利用DPCA构建的稀释池先对输入样本进行稀释,再进行对抗样本生成.在未知文本数据集与自然语言模型的情况下,不仅能够提升任意文本对抗攻击方法的攻击成功率,还能够获得相较于原方法更高的对抗样本质量.通过对不同文本数据集、稀释池规模、自然语言模型,以及多种主流文本对抗攻击方法进行对照实验,验证了SSDA对文本对抗攻击方法成功率的提升效果以及DPCA构建的稀释池对SSDA稀释能力的提升效果.实验结果显示,SSDA稀释过程能够比经典对抗攻击过程发现更多模型漏洞,且DPCA能够帮助SSDA在提升成功率的同时进一步提升对抗样本的文本质量.
文摘What and how we translate are questions often argued about. No matter what kind of answers one may give, priority in translation should be granted to meaning, especially those meanings that exist in all concerned languages. This research defines them as universal sememes, and the study of them as universal semantics, of which applications are also briefly looked into.