摘要
本文利用文本集内部的语义关联性,通过高频词和隐含主题两个不同粒度得到训练集的语义核心词集,然后将知网作为外部资源计算语义核心词集与测试集中特征词之间的相似度,将训练集中相似度大于某一阈值的特征词扩展到仅有题名作为内容的待分类文本中,最后用SVM算法进行分类。实验结果表明,在训练集与测试集仅为题名的情况下,当训练集为每类200篇时,提升效果最好,达到3.1%,但提升效果随训练集文本数的增加而下降;在训练集为题名加摘要,测试集为题名时,本文提出的分类算法在复旦语料和自建的期刊语料上的Macro_F1分别平均提高1.5%和3.1%,在Micro_F1上分别平均提高2.3%和5.3%。本文通过对特征稀疏的题名信息进行特征扩展,以期提高期刊论文题名的分类效果。
This paper uses the internal semantic relevance of the text and get the core semantic word set of the training text through high frequency words and the hidden theme. It then use the Hownet as an external resource to calculate the similarity between the core semantic word set and testing text. It extends the keywords in training text, whose similarity is greater than a certain level, into the testing text, and classifies them with SVM. The result shows that in the case where training set and test set are only titles, and there are 200 pieces in each category of training set, there is an increase of efficiency to 3.1%; but the efficiency declines with the increase of the number of training set text over 200. In the case where training sets are titles and abstracts whereas the testing sets are titles, the classification algorithm put forward in this paper could achieve 1.5% and 3.1% on MacroF1 in Fudan corpus and the self-builtjournal corpus, and 2.3% and 5.3% on MicroF1. This paper aims to implement characteristic extension of journal titles with sparse characteristics in the hope of improving the work of title classification.
出处
《图书馆杂志》
CSSCI
北大核心
2017年第2期11-19,共9页
Library Journal
基金
社会科学基金项目"多种类型文本数字资源自动分类研究"(项目编号:15BTQ066)的研究成果之一
关键词
期刊论文题名
短文本分类
知网
LDA
Journal title information
Short-text classification
Hownet
LDA