期刊文献+

统计模型在中文文本挖掘中的应用 被引量:14

Applications of Statistical Models in Chinese Text Mining
原文传递
导出
摘要 本文讨论了中文文本挖掘的三个问题:分词、关键词提取和文本分类。对分词问题,介绍了基于层叠隐马尔可夫模型的ICTCLAS分词法,以及将词与词之间的分隔视为缺失数据并用EM算法求解的WDM方法;对关键词提取问题,提出了贝叶斯因子法,并介绍了使用稀疏回归的CCS方法;对文本分类问题,介绍了根据关键词频率建立分类器的方法,以及先建立主题模型再根据主题概率建立分类器的方法。本文通过两组文本数据对上述方法进行比较,并给出使用建议。 This paper discusses three problems in Chinese text mining, including word segmentation, keyword extraction and text classification. For the word segmentation problem, we introduce the ICT- CLAS method that is based on a hierarchical hidden Markov model, and the WDM method that treats the segmentation between words as missing data and uses the EM algorithm to find the solution. For the keyword extraction problem, we propose a method based on Bayes Factor, and introduce the CCS method that uses sparse regression. For the text classification problem, we introduce a method that builds classifiers on keyword frequencies, and another method that first trains topic models and then builds classifiers on topic proportions datasets, and offers suggestions on their This paper then compares the above methods using two text practical use.
作者 王健 张俊妮
出处 《数理统计与管理》 CSSCI 北大核心 2017年第4期609-619,共11页 Journal of Applied Statistics and Management
关键词 中文分词 关键词提取 文本分类 贝叶斯因子 L1范数惩罚 主题模型 word segmentation, keyword extraction, text classification, Bayes factor, L1 penalization,topic model
  • 相关文献

参考文献1

二级参考文献17

  • 1Hearst M A. Text data mining: Issues, techniques, and the relationship to information access [R]. Presentation notes for UW/MS workshop on data mining, 1997.
  • 2Landauer T K, McNamara D S, Dennis S, et al. Handbook of latent semantic analysis [B]. Lawrence Erlbaum, 2007. Cortes C, Vapnik V. Support-vector networks [J]. Machine Learning, 1995, 20: 273-297.
  • 3Blei D M, Ng A Y, Jordan M I. Latent Dirichlet allocation [J]. The Journal of Machine Learning Research, 2003, 3:993 1022.
  • 4Wuthrich B, Permunetilleke D, Leung S, et al. Daily prediction of major stock indices from textual www data [J]. HKIE Transactions, 1998, 5: 151-156.
  • 5Lavrenko V, Schrnill M, Lawrie D, et al. Mining of concurrent text and time series [C]. In KDD-2000 Workshop on Text Mining, 2000, 2000: 37-44.
  • 6Kloptchenko A, Eklund T, Karlsson J, et al. Combining data and text mining techniques for analysing financial reports [J]. Intelligent systems in accounting, finance and management, 2004, 12:29-41.
  • 7Mittermayer M A. Forecasting intraday stock price trends with text mining techniques [C]. Proceed- ings of the 37th Annual Hawaii International Conference on System Sciences, 2004.
  • 8Seo Y W, Giampapa J A, Sycara K. Financial news analysis for intelligent portfolio management [R]. Robotics Institute, 2004..
  • 9Ingvaldsen J E, Gulla J A, Laegreid T, et al. Financial news mining: Monitoring continuous streams of text [C]. Proceedings of IEEE/WIC/ACM International Conference on Web Intelligence, 2006: 321 324.
  • 10Tr'afalis T B, Ince H. Support vector machine for regression and applications to financial forecasting [C]. Proceedings of IEEE-INNS-ENNS International Joint Conference on Neural Networks, 2000, 6: 6348-6348.

共引文献29

同被引文献158

引证文献14

二级引证文献58

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部