期刊文献+

基于知网语义特征扩展的题名信息分类 被引量:6

Title Information Classification Based on Hownet Semantics Feature Extension
下载PDF
导出
摘要 本文利用文本集内部的语义关联性,通过高频词和隐含主题两个不同粒度得到训练集的语义核心词集,然后将知网作为外部资源计算语义核心词集与测试集中特征词之间的相似度,将训练集中相似度大于某一阈值的特征词扩展到仅有题名作为内容的待分类文本中,最后用SVM算法进行分类。实验结果表明,在训练集与测试集仅为题名的情况下,当训练集为每类200篇时,提升效果最好,达到3.1%,但提升效果随训练集文本数的增加而下降;在训练集为题名加摘要,测试集为题名时,本文提出的分类算法在复旦语料和自建的期刊语料上的Macro_F1分别平均提高1.5%和3.1%,在Micro_F1上分别平均提高2.3%和5.3%。本文通过对特征稀疏的题名信息进行特征扩展,以期提高期刊论文题名的分类效果。 This paper uses the internal semantic relevance of the text and get the core semantic word set of the training text through high frequency words and the hidden theme. It then use the Hownet as an external resource to calculate the similarity between the core semantic word set and testing text. It extends the keywords in training text, whose similarity is greater than a certain level, into the testing text, and classifies them with SVM. The result shows that in the case where training set and test set are only titles, and there are 200 pieces in each category of training set, there is an increase of efficiency to 3.1%; but the efficiency declines with the increase of the number of training set text over 200. In the case where training sets are titles and abstracts whereas the testing sets are titles, the classification algorithm put forward in this paper could achieve 1.5% and 3.1% on MacroF1 in Fudan corpus and the self-builtjournal corpus, and 2.3% and 5.3% on MicroF1. This paper aims to implement characteristic extension of journal titles with sparse characteristics in the hope of improving the work of title classification.
出处 《图书馆杂志》 CSSCI 北大核心 2017年第2期11-19,共9页 Library Journal
基金 社会科学基金项目"多种类型文本数字资源自动分类研究"(项目编号:15BTQ066)的研究成果之一
关键词 期刊论文题名 短文本分类 知网 LDA Journal title information Short-text classification Hownet LDA
  • 相关文献

参考文献12

二级参考文献134

共引文献495

同被引文献60

引证文献6

二级引证文献10

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部