一、Big data Examples
HINI flu virus
Walmart, Walmart data analysis is focused on evaluating the effectiveness of pricing strategies and advertising campaigns, seeking for improvement methods in inventory management and supply chains.
Amazon
Citibank. Big data analysis on the database of basic financial transactions can enable global insight on investments, market changes, trade patterns and economical conditions.
Product Development and salses
二、Big data's 4 Vs
Big data's 4V challenges.
Volume---data size
Variety---data formats
Velocity--data streaming speed
Veracity---data Trustwrthiness
三、HADOOP
Hadoop,是一个能够对大量数据进行分布式处理的软件框架。高可靠性,高扩展性,高效性,高容错性,低成本。It is a big data analysis engine. It is a reliable shared storage and analysis system.
date storage, access, and analysis.
Simultaneous parallel read and write of data with multiple hard disks have some challenges:
hardware failure; cost; combining analyzed data.
The major two components of Hadoop: HDFS and MapReduce.
HDFS,Hadoop Distributed FileSystem, it provides data storage.
MapReduce provides data analysis. Two functions: map function and reduce function.
1)HDFS was designed to be optimal in performance for a WORM( Write Once, Read Many times)pattern. which is very efficient data processing pattern. It is designed considering the time to read the whole data set to be more important than the time required to read just the first record.
HDFS clusters use 2 types of nodes: Name node(master node), Data node(worker node)
Namenodes manages the filesystem namespace, it maintains the filesystem tree and the metadata for all the files and directories in the tree. It stores on the local disk using 2 file forms, namespace image and Edit log.
Datanodes is the workhorse of the filesystem, it stores and retrieve blocks when requested by the client or the namenode. It reports back to the namenode periodically with lists of blocks that were stored.
2)MapReduce, is a program that abstracts the analysis problem from stored data. It transforms the analysis problem into a computation process that uses a set of keys and values.
Characteristics of MapReduce, brute-force data analysis approach, Batch Query Processor Model, ad hoc query, combining many distributed systems in a very efficient way.
3)Technical Terms used in the MapReduce
Seek Time, the delay in finding a file.
Transfer Rate, the speed to move a file.
--- It gains performance enhancement through optimal balancing of seeking and Transfer operations.
四、MapReduce VS RDBMS
网友评论