015 Distributed Cache in Hadoop: Most Comprehensive Guide
1. Distributed Cache in Hadoop: Objective
1. 在 Hadoop 分布式缓存: 目标
In our blog about Hadoop distributed cache you will learn what is distributed cache in Hadoop, Working and implementations of distributed cache in Hadoop framework. This tutorial also covers various Advantages of Distributed Cache, limitations of Apache Hadoop Distributed Cache.
在我们关于 Hadoop 分布式缓存的博客中,您将了解Hadoop 中的分布式缓存是什么:Hadoop 框架中分布式缓存的工作和实现.本教程还介绍了分布式缓存的各种优点,Apache Hadoop 的局限性分布式缓存
Distributed Cache in Hadoop: Most Comprehensive GuideDistributed Cache in Hadoop: Most Comprehensive Guide
2. Introduction to Hadoop
2. Hadoop 入门
Apache Hadoop is an open-source software framework. It is a system for distributed storage and processing of large data sets. Hadoop follows master slave architecture. In which master is NameNode and slave is DataNode. Namenode stores meta-data i.e. number of blocks, their location, replicas. Datanode stores actual data in HDFS. And it perform read and write operation as per request for the client.
In Hadoop, data chunks process in parallel among Datanodes, using a program written by the user. If we want to access some files from all the Datanodes, then we will put that file to distributed cache.
Read:** Automatic Failover in Hadoop**
Apache Hadoop是一个开源的软件框架.它是一个对大数据集进行分布式存储和处理的系统.Hadoop 遵循主从架构.有哪位高手有、复制指令和奴隶是 DataNode.南德存储元数据,即块的数量、位置、副本.Datanode将实际数据存储在HDFS.它根据客户端的请求执行读写操作.
在 Hadoop 中,数据块使用用户编写的程序在数据节点之间并行处理.如果我们想从所有数据节点访问一些文件,那么我们将把该文件放入分布式缓存.
3. What is Distributed Cache in Hadoop?
Hadoop 中的分布式缓存是什么 3.
Distributed Cache is a facility provided by the Hadoop MapReduce framework. It cache files when needed by the applications. It can cache read only text files, archives, jar files etc. Once we have cached a file for our job, Hadoop will make it available on each datanodes where map/reduce tasks are running.
Thus, we can access files from all the datanodes in our map and reduce job.
分布式缓存是由提供的设施Hadoop MapReduce 框架.当应用程序需要时,它会缓存文件.它可以缓存只读文本文件、归档文件、 jar 文件等.一旦我们为我们的工作缓存了一个文件,Hadoop 将使它在运行 map/reduce 任务的每个数据节点上
因此,我们可以从我们的地图和减少工作.
3.1. Working and Implementation of Distributed Cache in Hadoop
3.1.Hadoop 中分布式缓存的工作与实现
First of all, an application which need to use distributed cache to distribute a file:
首先,需要使用分布式缓存来分发文件的应用程序:
-
Should make sure that the file is available.
-
And also make sure that file can accessed via urls. Urls can be either hdfs****: // or http****://.
-
应该确保文件可用.
-
并确保可以通过 url 访问该文件.网址可以是Hdfs****://或者Http****://.
Now, if the file is present on the above urls, the user mentions it to be a cache file to the distributed cache. MapReduce job will copy the cache file on all the nodes before starting of tasks on those nodes.
The Process is as Follows:
现在,如果上述 url 上存在该文件,用户会提到它是分布式缓存的缓存文件.MapReduce job 将在所有节点上复制缓存文件,然后在这些节点上开始任务.
流程如下:
-
Copy the requisite file to the HDFS:
-
将必要的文件复制到HDFS:
$ hdfs dfs-put/user/dataflair/lib/jar_file.jar
-
Setup the application’s JobConf:
-
设置应用程序的 job conf:
DistributedCache.addFileToClasspath(new Path (“/user/dataflair/lib/jar-file.jar”), conf)
DistributedCache.addFileToClasspath (新路径 (“/user/dataflair/lib/jar-file.jar”),conf)
-
Add it in Driver class.
-
在驱动程序类中添加它.
3.2. Size of Distributed Cache in Hadoop
3.2.Hadoop 中分布式缓存的大小
With cache size property in mapred*****-site.xml* it is possible to control the size of distributed cache. By default size of Hadoop distributed cache is 10 GB.
Read: Important Features of Hadoop
中的缓存大小属性地图红*****-Site.xml*可以控制分布式缓存的大小.Hadoop 分布式缓存默认大小为 10gb.
阅读:Hadoop 的重要特性
4. Benefits of Distributed Cache in Hadoop
4. Hadoop 分布式缓存的好处
Below are some advantages of MapReduce Distributed Cache-
下面是 MapReduce 分布式缓存的一些优点-
4.1. Store Complex Data
4.1.存储复杂数据
It distributes simple, read-only text file and complex types like jars, archives. These achieves are then un-archived at the slave node.
它分发简单、只读的文本文件和像 jars 、 archives 这样的复杂类型.然后,在从属节点上取消存档这些成就.
4.2. Data Consistency
4.2.数据一致性
Hadoop Distributed Cache tracks the modification timestamps of cache files. And it notifies that the files should not change until a job is executing. Using hashing algorithm, the cache engine can always determine on which node a particular key-value pair resides. Since, there is always a single state of the cache cluster, it is never inconsistent.
Hadoop 分布式缓存跟踪缓存文件的修改时间戳.它会通知文件在作业执行之前不应更改.使用哈希算法,缓存引擎总是可以确定特定的节点键值对居住.因为缓存集群的状态总是单一的,所以它从来都不是不一致的.
4.3. Single point of Failure
4.3.单点故障
A distributed cache runs as an independent process across many nodes. Thus, failure of a single node does not result in a complete failure of the cache.
Read: How Hadoop works internally?
分布式缓存作为一个独立的进程跨多个节点运行.因此,单个节点的失败不会导致缓存的完全失败.
阅读:Hadoop 内部是如何工作的?
5. Overhead of Distributed Cache
5. 开销的分布式缓存
A MapReduce distributed cache has overhead that will make it slower than an in-process cache:
MapReduce 分布式缓存的开销会比进程内缓存慢:
5.1. Object serialization
5.1.对象序列化
A distributed cache must serialize objects. But the serialization mechanism has two major problems:
分布式缓存必须序列化对象.但是序列化机制有两大问题:
-
Very slow– Serialization uses reflection to inspect the type of information at runtime. Reflection is a very slow process as compared to pre-compiled code.
-
Very bulky– Serialization stores complete class name, cluster, and assembly details. It also stores references to other instances in member variables. All this makes the serialization very bulky.
-
非常慢-序列化使用反射在运行时检查信息类型.与预编译代码相比,反射是一个非常缓慢的过程.
-
非常笨重-序列化存储完整的类名称、集群和组件详细信息.它还在成员变量中存储对其他实例的引用.这一切使得序列化非常庞大.
6. Distributed Cache in Hadoop – Conclusion
Hadoop 分布式缓存的 6.-结束
In conclusion to Distributed cache in Hadoop, it is a mechanism that Hadoop MapReduce framework supports. Using distributed cache in Hadoop, we can broadcast small or moderate sized files (read only) to all the worker nodes. The distributed cache files will be deleted from worker node once the job runs successfully.
综上所述,Hadoop 中的分布式缓存是 Hadoop MapReduce 框架所支持的一种机制.在 Hadoop 中使用分布式缓存,我们可以向所有工作节点广播大小适中的文件 (只读).作业成功运行后,分布式缓存文件将从工作节点中删除.
See Also-
另见-
If you like this post or have any query about hadoop Distributed Caching, do leave a comment.
如果你喜欢这篇文章,或者对 hadoop 分布式缓存有任何疑问,请留下评论.
网友评论