期刊文献+

基于Python语言的中文分词技术的研究 被引量:58

Chinese Word Segmentation Technology based on Python Language
下载PDF
导出
摘要 Python作为一种解释性高级编程语言,已经深入大数据、人工智能等热门领域。Python在数据科学领域具有广泛的应用,比如Python爬虫、数据挖掘等等。将连续的字序列划分为具有一定规范的词序列的过程称为分词。在英文中,空格是单词间的分界符,然而中文比较复杂。一般来说对字、句子和段落的划分比较简单,但中文中词的划分没有明显的标志,所以对中文文本进行分词的难度较大。运用Python爬虫对网页数据进行抓取作为实验文本数据,使用python强大的分词库jieba对中文文本进行分词处理。对分词结果分别采用TF-IDF算法和TextRank算法进行提取关键词,实验结果明显优于基于词频的分词算法。最后采用词云的方式对关键词进行展现,使得分词结果一目了然。 As an interpreted high-level programming language,Python has penetrated into popular fields such as big data and artificial intelligence.Python has a wide range of applications in data science,such as Python crawlers,data mining,etc.Word segmentation is the process of recombining consecutive subsequences into word sequences in accordance with certain specifications.In English,spaces are delimiters between words,but Chinese is fairly complicated.Generally speaking,the division of words,sentences and paragraphs is relatively simple,but the division of words in Chinese has no obvious signs,so it is more difficult to segment Chinese words.Python crawlers are used to crawl web page data as experimental text data.Python’s powerful word segmentation library jieba is used for word segmentation of Chinese text.The TF-IDF algorithm and the TextRank algorithm are used to extract keywords for the word segmentation results.The experimental results are obviously better than the word frequency-based word segmentation algorithm.Finally,the word cloud is used to display the keywords,thus making the word segmentation results clear at a glance.
作者 祝永志 荆静 ZHU Yong-zhi;JING Jing(School of Information Science and Engineering,Qufu Normal University,Rizhao Shandong 276826,China)
出处 《通信技术》 2019年第7期1612-1619,共8页 Communications Technology
基金 山东省自然科学基金项目(No.ZR2013FL015) 山东省研究生教育创新资助计划(No.SDYY12060)~~
关键词 PYTHON 文本分词 jieba 词云 数据可视化 Python text segmentation jieba word cloud data visualization
  • 相关文献

参考文献12

二级参考文献84

  • 1费洪晓,康松林,朱小娟,谢文彪.基于词频统计的中文分词的研究[J].计算机工程与应用,2005,41(7):67-68. 被引量:68
  • 2石晶,戴国忠.基于PLSA模型的文本分割[J].计算机研究与发展,2007,44(2):242-248. 被引量:25
  • 3黄昌宁,赵海.中文分词十年回顾[J].中文信息学报,2007,21(3):8-19. 被引量:249
  • 4刘知远.基于文档主题结构的关键词抽取方法研究[D].北京:清华大学,2011.
  • 5Mihalcea R, Tarau P. TextRank: Bringing Order into Texts [C]. In: Proceedings of Conference on Empirical Methods in Natural Language Processing, Barcelona, Spain. 2004: 404-411.
  • 6Frank E, Paynter G W, Witten I H, et al. Domain-Specific Keyphrase Extraction [C]. In: Proceedings of the 16th International Joint Conference on Artificial Intelligence, Stockholm, Sweden. San Francisco: Morgan Kaufmann Publishers Inc., 1999: 668-673.
  • 7Turney P D. Learning Algorithms for Keyphrase Extraction [J]. Information Retrieval, 2000, 2(4): 303-336.
  • 8Blei D M, Ng A Y, Jordan M I. Latent Dirichlet Allocation [J]. Journal of Machine Learning Research, 2003, 3: 993-1022.
  • 9Page L, Brin S, Motwani R, et al. The PageRank Citation Ranking: Bringing Order to the Web [R]. Stanford InfoLab, 1999.
  • 10Kleinberg J M. Authoritative Sources in a Hyperlinked Environment[J]. Journal of the ACM, 1999, 46(5): 604-632.

共引文献412

同被引文献440

引证文献58

二级引证文献213

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部