摘要
[目的/意义]目前学术文献被引片段识别研究存在两个问题:对于给定的一个引文上下文,其所对应的被引片段句子数量并没有明确的定义;构建特征中很少考虑句子中词语的语义相似度特征。文章基于以上两个角度,对已有的实验方案进行改进,旨在提高被引片段的识别效果。[方法/过程]首先,按照不同的粒度对被引文献进行句子切分,以考察不同粒度切分下被引片段的识别效果,从而确定被引片段的最佳句子数量。随后,在被引片段识别模型中加入词语语义相似度特征,即通过词嵌入进行分布式词向量表示,并依据词汇语义网络本体,度量不同句子中词语间的语义相似度。[结果/结论]实验结果表明,随着句子切分粒度的逐渐增加,被引片段识别效果呈下降趋势;另外,所增加的词语语义相似度特征能够有效地在句子间建立细粒度的语义关联,提高了模型的稳定性,从而提升了被引片段的识别效果。[局限]仅从特征构建的角度对被引片段的识别工作进行优化,提升效果较为有限。模型选择方面,仍局限于使用传统的机器学习算法,未考虑现有的深度学习算法对本工作进行改进。
[Purpose/significance]In the current research on cited span identification,there is no clear definition of how many sentences should be identified for each cited span,and the semantic similarity among the words is rarely considered in the process of feature construction.Based on the two perspectives above,this paper conduct adaptions on our previous experiments to improve the performance of cited text spans identification.[Method/process]Firstly,we segment reference paper according to different sentence granularities,and compare their identification performances,thus determining the best sentences number of cited spans.Furthermore,we add lexical semantic similarity features to measure the semantic similarity between sentences.They are obtained by distributed word vector representation through word embedding and word semantic network ontology.[Result/conclusion]The experimental results show that with the increase of sentence segmenting granularity,the identification performance shows a downward trend.Moreover,the added lexical semantic similarity features help to establish fine-grained semantic associations between sentences effectively.It improves the identification performance and therefore increases the model stability.[Limitations]In this paper,we adapt our experiments only from the aspect of feature construction,which lead to the limited improvement of cited text spans identification.On model selection,we merely use the traditional machine learning algorithms and do not bring the existing deep learning algorithms into our work.
出处
《情报理论与实践》
CSSCI
北大核心
2019年第9期139-145,共7页
Information Studies:Theory & Application
基金
国家社会科学基金重大项目“情报学学科建设与情报工作未来发展路径研究”(项目编号:17ZDA291)
江苏省研究生科研创新计划项目“学术文献引文域自动识别研究”(项目编号:KYCX18_0365)的成果
关键词
学术文献
被引片段
引文分析
文本分类
语义相似度
academic article
cited spans
citation analysis
text classification
semantic similarity