摘要
【目的】探讨在基于支持向量机(SVM)模型的科技论文摘要自动语步识别过程中,训练样本的规模、N元词(N-gram)的N取值、停用词以及词频加权方式等特征对识别效果的影响。【方法】从72万余篇科技论文结构化摘要中,抽取出总计110多万条已标注好的语步为实验数据,构建SVM模型进行语步识别实验。采用控制变量方法,基于单一变量原则,通过改变训练样本量、N-gram的N取值、是否去除停用词、词频加权方式,对比分析这些特征变化对语步识别效果的影响。【结果】训练样本数量为60万条语步、N元词的N取值为[1,2]、不去除停用词、词频加权方式采用TF-IDF时模型识别效果最好,为93.50%。【局限】主要以笔者收集的结构化论文摘要为训练和测试语料,未与其他人的结果比较。【结论】训练样本规模以及一些精细的特征对传统机器学习模型的效果有重要影响,使用者在实践中需要根据具体情况进行精细的特征选取。
[Objective]The paper explores the influence of sample size,the N value of N-gram,stop words,and weighting methods of word frequency on the automatic recognition of rhetorical moves in scientific paper,aiming to improve the abstracting method based on support vector machine(SVM)model.[Methods]We retrieved a total of 1.1 million labeled moves from 720,000 structured abstracts of scientific papers as experimental data,and constructed SVM model for move recognition.Based on the principle of single variable,we used control variable method by changing the sample size,the N value,removal of stop words,and word frequency weighting methods to analyze their impacts on the model’s performance.[Results]We found that the model yielded the best result with a sample size of 600,000 abstracts,the N value[1,2],keeping stop words,and using TF-IDF to weight word frequency.[Limitations]We only examined the model with structured abstracts,which might not be comparable with other studies.[Conclusions]The sample size and some fine features have significant impacts on the performance of traditional machine learning models.
作者
丁良萍
张智雄
刘欢
Ding Liangping;Zhang Zhixiong;Liu Huan(National Science Library,Chinese Academy of Sciences,Beijing 100190,China;Department of Library,Information and Archives Management,University of Chinese Academy of Science,Beijing 100190,China;Wuhan Library,Chinese Academy of Sciences,Wuhan 430071,China)
出处
《数据分析与知识发现》
CSSCI
CSCD
北大核心
2019年第11期16-23,共8页
Data Analysis and Knowledge Discovery
基金
中国科学院文献情报能力建设专项子项目“科技文献丰富语义检索应用示范”(项目编号:院1734)的研究成果之一
关键词
语步识别
支持向量机
结构化摘要
Move Recognition
Support Vector Machine
Structured Abstracts