摘要
【目的】设计特征融合和伪标签降噪策略,探索科技论文术语型引用对象自动识别方法。【方法】将术语型引用对象识别转换为序列标注问题,在BiLSTM-CNN-CRF输入层融合术语型引用对象的语言学和启发式两大类特征,增强引用对象的特征表示,设计伪标签学习降噪机制,采用半监督学习方法探究不同特征组合对识别效果的影响。【结果】本方法在术语型引用对象识别任务中最优F1值达到0.6018,比BERT模型实验结果提升8%。【局限】实验数据仅涉及计算机领域,在其他领域的可移植性有待考证。【结论】基于特征融合的深度学习方法在术语型引用对象的识别中有较好性能,伪标签学习方法解决了引用对象标注数据不足的问题,两者结合有效地探索了术语型引用对象自动化识别方法。
[Objective]This paper explores methods automatically identifying term citation objects from scientific papers,with feature fusion and pseudo-label noise reduction strategy.[Methods]First,we converted the identification of term citation objects into sequential annotation.Then,we combined linguistic and heuristic features of term citation objects in the BiLSTM-CNN-CRF input layer,which enhanced their feature representations.Finally,we designed pseudo-label learning noise reduction mechanism,and compared the performance of different models.[Results]The optimal F1 value of our method reached 0.6018,which was 8%higher than that of the BERT model.[Limitations]The experimental data was collected from computer science articles,thus,our model needs to be examined with data from other fields.[Conclusions]The proposed method could effectively identify term citation objects.
作者
马娜
张智雄
吴朋民
Ma Na;Zhang Zhixiong;Wu Pengmin(National Science Library,Chinese Academy of Sciences,Beijing 100190,China;School of Economic and Management,University of Chinese Academy of Sciences,Beijing 100190,China;Wuhan Library,Chinese Academy of Sciences,Wuhan 430071,China;Hubei Key Laboratory of Big Data in Science and Technology,Wuhan 430071,China;Institute of Automation,Chinese Academy of Sciences,Beijing 100190,China)
出处
《数据分析与知识发现》
CSSCI
CSCD
北大核心
2020年第1期89-98,共10页
Data Analysis and Knowledge Discovery
基金
中国科学院基金项目“科技文献丰富语义检索应用示范”(项目编号:院1734)的研究成果之一.