期刊文献+

YARN平台上的并行主题标引算法 被引量:2

Parallel Subject Indexing Algorithm in YARN Platform
下载PDF
导出
摘要 文档主题标引是当前个性化智能检索的重要前提,但面对大规模海量数据资源时,主题标引也成为性能瓶颈。当前在Map Reduce框架上设计实现的主题标引算法,通常存在启动任务耗时长,中间数据过多地进行磁盘IO等缺陷。为了解决此类问题,采用YARN(yet another resource negotiator)作为底层分布式资源管理平台,选择更加合适的计算框架来改善计算性能。针对文档主题标引算法计算步骤多、阶段性强的特点,选择有向无环图(directed acyclic graph,DAG)计算模型进行算法实现,避免不必要的作业拆分,从而减少中间结果的磁盘IO。另外,考虑到Map Reduce的排序策略耗时较多,而有些计算无需对结果排序,故可以改用基于Hash的数据归约策略来提高计算性能,但这又会带来随机读的问题。利用固态硬盘高速随机读的特性,设计相应的优化计算策略来解决随机读的问题。通过实验对比发现,以YARN为底层管理平台,在此基础上选择合适的计算框架并加以优化,可以有效改善分布式计算的性能。 Subject indexing is a very important component in personalized intelligent search system. However, the huge amount of data resource makes it a great challenge in processing performance. Nowadays, the subject indexing over MapReduce computing framework has been widely used, which has shortcomings, such as time-consuming of starting the tasks and too many disk IOs. This paper adopts YARN (yet another resource negotiator) as the underlying platform, and chooses more appropriate calculation frameworks to improve the performance. For the feature of subject indexing algorithm, which is multistage, the directed acyclic graph (DAG) model is selected to avoid unnecessary operations of job split, which reduces the disk IOs of intermediate results. In addition, considering the sorting strategy is time-consuming, this paper adopts Hash-based data gathering strategy to improve computing performance. However, the new policy will bring the problem of random read. This paper designs an optimization strategy, which takes advantage of the feature of high-speed random read of solid state disk (SSD), to further improve the computa-tional efficiency. Through the experimental results, choosing targeted computing framework based on YARN and optimizing it, can effectively improve computing performance.
出处 《计算机科学与探索》 CSCD 2014年第12期1409-1421,共13页 Journal of Frontiers of Computer Science and Technology
基金 国家自然科学基金 国家高技术研究发展计划(863计划) 华中科技大学自主创新基金~~
关键词 主题标引 YARN平台 有向无环图计算框架 固态硬盘 subject indexing YARN platform directed acyclic graph (DAG) computation solid state disk
  • 相关文献

参考文献1

二级参考文献6

共引文献3

同被引文献15

  • 12015年政府工作报告[R/OL].2015-03-17.http://poli-tics.people.tom.cn/n/2015/0317/e1024-26702211.html.
  • 2互联网+[EB/OL].2015.http://baike.baidu.com/link?url=2011GGEjBsYHE6Xxe5k8yOfEQ-Krj7WfuUwE8CLoDUL90AtWDRCbsFbFmSKAM7ukwAANlQmHyhVgwx7JQ8La.
  • 3Dean J, Ghemawat S. MapReduce:simplified data processing on large clusters [ J ]. Communications of the ACM, 2008,51(1) :107-113.
  • 4Apache Hadoop NextGen MapReduce (YARN)[ EB/OL]. 2014-06-21. http://hadoop, apache, org/docs/r2.4. 1/ha- doop -yarn/hadoop-yam-site/YARN. html.
  • 5Hadoop : writing YARN applications [ EB/OL ]. 2015 -06-29. http://hadoop, apache, org/docs/current/hadoop- yarn/ha- doop -yam - site/WfitingYamApplications, html.
  • 6MapReduce tutorial [ EB/OL]. 2015 -06-29. http ://hadoop. apache, org/docs/current/hadoop- mapreduce - client/hadoop -mapreduce-client -core/MapReduceTutorial. html.
  • 7HDFS users guide[ EB/OL]. 2015-06-29. http ://hadoop. a- pache, org/docs/current/hadoop- project- dist/hadoop- hdfs/ HdfsUserGuide. html.
  • 8国家质量监督检验检疫总局.GB/T22388-2008,原料乳与乳制品中三聚氰胺检测方法[s].北京:国家质量监督检验检疫总局,2008.
  • 9中华人民共和国卫生部.GB4789.4-2010,食品安全国家标准食品微生物学检验沙门氏菌检验[s].北京:中华人民共和国卫生部,2010.
  • 10国家质量监督检验检疫总局.GB8372-2008,牙膏[s].北京:国家质量监督检验检疫总局,2008.

引证文献2

二级引证文献18

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部