摘要
Python作为一种解释性高级编程语言,已经深入大数据、人工智能等热门领域。Python在数据科学领域具有广泛的应用,比如Python爬虫、数据挖掘等等。将连续的字序列划分为具有一定规范的词序列的过程称为分词。在英文中,空格是单词间的分界符,然而中文比较复杂。一般来说对字、句子和段落的划分比较简单,但中文中词的划分没有明显的标志,所以对中文文本进行分词的难度较大。运用Python爬虫对网页数据进行抓取作为实验文本数据,使用python强大的分词库jieba对中文文本进行分词处理。对分词结果分别采用TF-IDF算法和TextRank算法进行提取关键词,实验结果明显优于基于词频的分词算法。最后采用词云的方式对关键词进行展现,使得分词结果一目了然。
As an interpreted high-level programming language,Python has penetrated into popular fields such as big data and artificial intelligence.Python has a wide range of applications in data science,such as Python crawlers,data mining,etc.Word segmentation is the process of recombining consecutive subsequences into word sequences in accordance with certain specifications.In English,spaces are delimiters between words,but Chinese is fairly complicated.Generally speaking,the division of words,sentences and paragraphs is relatively simple,but the division of words in Chinese has no obvious signs,so it is more difficult to segment Chinese words.Python crawlers are used to crawl web page data as experimental text data.Python’s powerful word segmentation library jieba is used for word segmentation of Chinese text.The TF-IDF algorithm and the TextRank algorithm are used to extract keywords for the word segmentation results.The experimental results are obviously better than the word frequency-based word segmentation algorithm.Finally,the word cloud is used to display the keywords,thus making the word segmentation results clear at a glance.
作者
祝永志
荆静
ZHU Yong-zhi;JING Jing(School of Information Science and Engineering,Qufu Normal University,Rizhao Shandong 276826,China)
出处
《通信技术》
2019年第7期1612-1619,共8页
Communications Technology
基金
山东省自然科学基金项目(No.ZR2013FL015)
山东省研究生教育创新资助计划(No.SDYY12060)~~