Hadoop provides several ways of accessing HDFS
All of the following support almost all features of the filesystem -
- FileSystem (FS) shell commands: Provides easy access of Hadoop file system operations as well as other file systems that Hadoop supports, such as Local FS, HFTP FS, S3 FS.
This needs hadoop client to be installed and involves the client to write blocks directly to one Data Node. All versions of Hadoop do not support all options for copying between filesystems.- WebHDFS: It defines a public HTTP REST API, which permits clients to access Hadoop from multiple languages without installing Hadoop, Advantage being language agnostic way(curl, php etc....).
WebHDFS needs access to all nodes of the cluster and when some data is read, it is transmitted from the source node directly but there is a overhead of http over (1)FS Shell but works agnostically and no problems with different hadoop cluster and versions.- HttpFS. Read and write data to HDFS in a cluster behind a firewall. Single node will act as GateWay node through which all the data will be transfered and performance wise I believe this can be even slower but preferred when needs to pull the data from public source into a secured cluster.
Cloudera Doc about HttpFS
hdfs vs webhdfs Q&A from Cloudera community
1. Which one will be faster?
The native protocol of HDFS is hdfs:// and this is the fastest type (purely TCP, with efficient data packet transfers). Other protocols such as webhdfs:// or the deprecated hftp:// add overheads due to their HTTP usage that make them slower overall.
2. Can we use one protocol at source and other at destination (I mean combination of both)
3. When can we webhdfs in particular
Yes to (2).
See http://www.cloudera.com/content/www/en-us/documentation/enterprise/latest/topics/cdh_admin_distcp_da... for (3).
Rule of thumb is:
- Use webhdfs:// for source when its a different major version (such as a CDH4 source to CDH5 target).
- Use hdfs:// otherwise, when the major version is the same (such as between any CDH 5.x).
- Prefer webhdfs:// over hftp://, unless its a very old version (pre CDH3u5) that has no WebHDFS support.
4. Will there be any speed difference in transfer between in using these protocols.
Yes. This is also a repeat of (1), which I've answered above.
5. What will be the port numbers needed in using these (somewhere I saw commands with 50070 and 80020, when to use what)
Follow the CDH5 ports guide at http://www.cloudera.com/content/www/en-us/documentation/enterprise/latest/topics/cdh_ig_ports_cdh5.h... to find the right ports for your environment. Defaults are used in the below statement.
HDFS native protocol transfers require every host on the DistCp job cluster (usually target), to be able to talk to the source's 8020 (for NameNode(s)) and 50010/1004, 50020 (across all DataNodes) ports.
WebHDFS or HFTP, HTTP based protocol transfers require every host on the DistCp job cluster (usually target), to be able to talk to the source's 50070 (for NameNode(s)) and 50075/1006 (across all DataNodes) ports.
网友评论