摘要
Hadoop是Apache基金会所开发的支持涉及数千个节点和海量数据的分布式计算的高级项目。它是一个开源免费的软件框架,受到谷歌的MapReduce和谷歌文件系统(GFS)的启发,由Java语言实现,全球的开发志愿者共同对它进行完善。Hadoop的子项目包括HDFS,MapReduce,HBase,Hive等。HDFS是一个分布式文件系统,提供高吞吐量的应用数据接口使得Hadoop具有很好的性能。MapReduce是一个软件框架,它执行涉及海量集群数据的分布式计算的MapReduce算法。尽管Hadoop被广泛使用,但它仍然存在一些影响性能的缺陷,对于小文件的处理能力就是其中缺陷之一。档案文件(Hadoop Archives)和序列文件(sequence files)是两种现有的改善小文件处理问题的解决方案,但它们仍然有各自的不足,提出一个解决方案,保留它们的优点使Hadoop在处理小文件上拥有更好的性能。
Hadoop is Apache senior project that supports distributed applications which involves huge amount of data and thousands of nodes. It is a software framework,free and open source,Which is inspired by Google ’s MapReduce and Google file system. It is developed by global volun-teers,implemented in Java. It’s subprojects including HDFS,MapReduce,HBase,Hive and so on. HDFS is a distributed file system which provides the high performance of Hadoop by giving out high throughput access to data. MapReduce is a software framework,performs distributed computation involving huge amount of data on clusters. Although Hadoop is widely used,it has some defects which affect its performance,one of them is the small files problem. Hadoop Archives and sequence files are two existing solutions. But they still have their shortcomings. This paper propose a solution which is expected to derive their merits and make the Hadoop has better performance in handing small files.
出处
《信息技术》
2015年第10期142-144,148,共4页
Information Technology
作者简介
艾明(1989-),男,硕士研究生,研究方向为云计算。