摘要
近年来以大数据为中心的人工智能技术得到蓬勃发展,自然语言处理成为了人工智能时代最突出的前沿研究领域之一。然而,在自然语言处理领域的短文本分类中,不同的特征提取方法与机器学习算法集成时,处理效果差异明显。针对短文本分类精度较低的问题,基于组合的方式和预设的评价指标,通过将不同特征提取方法与不同机器学习算法进行组合,探究其在超短文本分类中的效果以寻求最优组合模型进而获得最佳分类效果。实验结果表明,在所选取的四种最优组合方法中,以词频-逆文件频率为特征提取方法、以逻辑回归为算法的组合模型在公开数据集中取得最好的实验效果,精度为92. 13%,查全率为90. 12%,适合应用于超短文本的分类应用场景。
In recent years,artificial intelligence technology centered on big data has been booming,natural language processing has become one of the most prominent frontier research areas in the era of artificial intelligence.However,in the short text classification of natural language processing,when different feature extraction methods are integrated with machine learning algorithms,the processing effects are significantly different.For the problem of low precision of short text classification,this paper combines different feature extraction methods with different machine learning algorithms based on the combination method and preset evaluation indicators to explore its effect in ultra-short text classification to seek the most excellent combination model to get the best classification effect.The experimental results show that among the four optimal combination methods selected,the method that the word frequency-reverse file frequency is used as the feature extraction method and the logistic regression algorithm is used as the combined model can obtain the best experimental results in the public data set with an accuracy of 92.13%, the recall rate is 90.12%,which is suitable for the classification application scene of ultra- short text.
作者
刘晓鹏
杨嘉佳
卢凯
田昌海
唐球
Liu Xiaopeng;Yang Jiajia;Lu Kai;Tian Changhai;Tang Qiu(National Computer System Engineering Research Institute of China,Beijing 100083,China;Information Research Center of Military Science,PLA Academy of Military Science,Beijing 100142,China)
出处
《信息技术与网络安全》
2019年第5期48-52,共5页
Information Technology and Network Security
关键词
自然语言处理
文本分类
超短文本
natural language processing
text classification
ultra short text