期刊文献+

基于Hadoop分布式平台的Web文本关键词提取方案 被引量:5

Web Text Keyword Extraction Scheme Based on the Hadoop Distributed Platform
下载PDF
导出
摘要 针对海量Web文本的关键词提取问题,提出一种基于Hadoop分布式计算平台的关键词提取方案.首先,配置Hadoop平台,使其能够支持自然语言处理过程;然后,使用GATE工具对Web文本进行词句分割、词性标注和注释规则操作,得到候选关键词集;最后,利用单词位置和跨度重要性因子对传统TF-IDF算法进行加权,从而计算候选关键词与文档之间的相关性,最终获得该文档的关键词以标注文档属性.实验结果表明,提出的分布式关键词提取方案能够快速准确地提取Web文档的关键词. For the issues that the keyword extraction of massive Web text, a web text keyword extractionscheme based on the Hadoop distributed platform is proposed. F irst, The Hadoop platform is configured tosupport natural language processing. Then, the GATE tool is used to perform words segmentation, part ofspeech tagging and annotation rules for Web text, and get a set of candidate keywords. F inally, the TF-IDFalgorithm which weighted by the word position and span factor is used to calculate the correlation betweencandidate keywords and documents, and obtain the document keywords to indicate document properties.Experimental results show that the distributed keyword extraction system can quickly and accurately extractthe key words of Web documents.
出处 《湘潭大学自然科学学报》 CAS 北大核心 2016年第2期79-83,共5页 Natural Science Journal of Xiangtan University
基金 国家自然科学基金项目(61203164 61174184)
关键词 WEB文本 关键词提取 HADOOP平台 自然语言处理 分布式 Web text keyword extraction Hadoop platform natural language processing distributed
  • 相关文献

参考文献6

二级参考文献117

  • 1崔杰,李陶深,兰红星.基于Hadoop的海量数据存储平台设计与开发[J].计算机研究与发展,2012,49(S1):12-18. 被引量:141
  • 2Deerwester S C, Dumais S T, Landauer T K, et al. Indexing by latent semantic analysis. Journal of the American Society for Information Science, 1990.
  • 3Hofmann T. Probabilistic latent semantic indexing//Proceedings of the 22nd Annual International SIGIR Conference. New York: ACM Press, 1999:50-57.
  • 4Blei D, Ng A, Jordan M. Latent Dirichlet allocation. Journal of Machine Learning Research, 2003, 3: 993-1022.
  • 5Griffiths T L, Steyvers M. Finding scientific topics//Proceedings of the National Academy of Sciences, 2004, 101: 5228 5235.
  • 6Steyvers M, Gritfiths T. Probabilistic topic models. Latent Semantic Analysis= A Road to Meaning. Laurence Erlbaum, 2006.
  • 7Teh Y W, Jordan M I, Beal M J, Blei D M. Hierarchical dirichlet processes. Technical Report 653. UC Berkeley Statistics, 2004.
  • 8Dempster A P, Laird N M, Rubin D B. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, 1977, B39(1): 1-38.
  • 9Bishop C M. Pattern Recognition and Machine Learning. New York, USA: Springer, 2006.
  • 10Roweis S. EM algorithms for PCA and SPCA//Advances in Neural Information Processing Systems. Cambridge, MA, USA: The MIT Press, 1998, 10.

共引文献248

同被引文献24

引证文献5

二级引证文献7

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部