期刊文献+

基于词频统计规律的文本数据预处理方法 被引量:11

Text Data Preprocessing Based on Term Frequency Statistics Rules
下载PDF
导出
摘要 在大数据时代,文本挖掘面临特征的"高维-稀疏"问题,海量文本词汇与稀少关键特征间的矛盾导致了高时空复杂度和低效率等问题,严重制约了文本挖掘效率,因此在文本挖掘前进行有效的数据预处理至关重要。传统文本挖掘算法在数据预处理阶段只进行分词和去停用词操作。为提高性能,提出基于词频统计规律的文本数据预处理方法。首先,基于齐普夫定律和最大值法推导同频词数表达式;然后,基于同频词数表达式探究各频次词语在文中的分布规律,结果表明词频为1和2的词语与文档的关联度较低,但比重高达2/3;最后,基于词频统计规律进行数据预处理,在预处理阶段去除低频词,减小特征维度。在公共数据集Reuters-21578和20-Newsgroups上进行的实验的结果表明,各频次词语的分布规律是正确的,基于词频统计规律的文本数据预处理方法在分类准确率、精确率、召回率以及F1度量值方面均有提升,运行时间明显降低,文本挖掘效率得到显著提高。 In age of big data,it is a severe problem that feature terms are faced with"high-dimension and sparse"challenge in text mining.Contradiction between enormous scale of terms and scarce of features will cause high-time-space complexity and poor efficiency,and restricts the efficiency of text mining seriously.Thus,it is crucial to preprocess data before mining text.Terms-dividing and stop-words-deleting are operated merely in data preprocessing of traditional text mining algorithms.In order to improve process of data preprocessing,data preprocessing algorithm based on term frequency statistics rules(DPTFSR)was proposed.To begin with,expression about number of terms with identical frequency is deduced based on Zif's Law and rule of maximum area.What's more,regularities of distribution based on terms with identical frequency is explored.It is discovered that proportion of low-frequency terms in documents reach up to 2/3,but there is little relevancy between them.Lastly,data is preprocessed based on terms frequency statistics rules.Low-frequency terms are deleted,and features dimension is decreased greatly.Correctness of term frequency statistics rules and validity of algorithm DPTFSR are verified on data sets from Reuters-21578 and 20-Newgroups.Experimental results show that accuracy,precision,recall and F1 measure are increased,and running time is shortened obviously.Thus,efficiency of text mining is significantly enhanced.
出处 《计算机科学》 CSCD 北大核心 2017年第10期276-282,288,共8页 Computer Science
基金 国家自然科学基金项目(71271067) 国家社科基金重大项目(13&ZD091) 河北省高等学校科学技术研究项目(QN2014196) 河北师范大学硕士基金(xj2015003)资助
关键词 大数据 文本挖掘 数据预处理 词频统计 Big data,Text mining,Data preprocessing,Term frequency statistics
  • 相关文献

参考文献4

二级参考文献190

  • 1Zhou Y, Xie X, Wang C, Gong Y, Ma W-Y. Hybrid index structures for location-based web search//Proceedings of the CIKM. Bremen, Germany, 2005 :155-162.
  • 2Chen YY, Suel T, Markowetz A. Efficient query processing in geographic web search engines//Proceedings of the SIGMOD. Chicago, IL, 2006:277-288.
  • 3Felipe I D, Hristidis V, Rishe N. Keyword search on spatial databases//Proeeedings of the ICDE. Caneun, Mexico, 2008:656-665.
  • 4Zhang D, Chee Y M, Mondal A, Tung A K H, Kitsuregawa M. Keyword search in spatial databases: Towards searching by document//Proceedings of the ICDE. Shanghai, China, 2009:688-699.
  • 5Cong G, Jensen C S, Wu D. Efficient retrieval of the top-kmost relevant spatial Web objects. Proceedings of the VLDB Endowment, 2009: 2(1): 337-348.
  • 6Yao B, Li F, Hadjieleftheriou M, Hou K. Approximate string search in spatial databases//Proceedings of the ICDE. Long Beach, California, USA, 2010:545-556.
  • 7Cao X, Cong G, Jensen C S. Retrieving top-k prestige-based relevant spatial Web objects. Proceedings of the VLDB Endowment, 2010, 3(1):373-384.
  • 8Wu D, Yiu M L, Jensen C S, Cong G. Efficient continuously moving top-k spatial keyword query processing//Proceedings of the ICDE. Hannover, Germany, 2011:541-552.
  • 9Cao X, Cong G, Jensen C S, Ooi B C. Collective spatial key- word querying//Proceedings of the SIGMOD Conference. Athens, Greece, 2011: 373-384.
  • 10Roy S B, Chakrabarti K. Location aware type ahead search on spatial databases: Semantics and efficiency//Proceedings of the SIGMOD Conference. Athens, Greece, 2011:361-372.

共引文献140

同被引文献109

引证文献11

二级引证文献85

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部