摘要
科技文献通常包括研究目的、方法、结果和结论等信息,如何将科技文献标引上这些信息,帮助科研人员在数量巨大的文献中快速发现符合研究需要的内容显得尤为重要。文章在研究分析科技文献写作特点基础上,提出了基于词、英文(专有名词、缩写词)以及数字的核心特征词提取策略;然后将科技文献标引问题转化为句子分类问题,结合提出的核心特征词,采用支持向量机分类器对科技文献进行句子级别的语义标引。通过对1168篇糖尿病医学类论文实验,证明本文提出的方法能够有效地学习和标引科技文献中的句子,进而有效地对科技文献关键信息点进行自动标引。
S T literature usually includes research purpose,methods,results and conclusion. How to index S T literature of the above information and help scientific research personnel quickly find the research needs in a huge number of literatures is particularly important. Based on the research and analysis of S T literature writing characteristics,the paper puts forward core feature word selecting strategy on the basis of word,English( proper nouns,abbreviation) and digital. Then,the paper transforms the S T literature indexing problem into sentence classification problem. Combined with the proposed core feature word,the paper adopts the support vector machine classifier for sentence- level semantic indexing of S T literature. Based on experiments of 1168 diabetes medical papers,the paper proves that the proposed method can effectively learning and indexing the sentences in S T literature,thus effectively carries on the automatic indexing for key points of S T literature.
出处
《情报理论与实践》
CSSCI
北大核心
2014年第7期129-134,共6页
Information Studies:Theory & Application
基金
国家社会科学基金项目"学术文献‘意抄’检测研究"(项目编号:12CTQ032)
山东省自然科学基金项目"大规模学术文献并行处理与语义分类研究"(项目编号:ZR2011GL025)的成果之一
关键词
自动标引
支持向量机
特征提取
科技文献
automatic indexing
support vector machine
feature selection
S & T literature