019 How Big Data is helping Flipkart to achieve the Milestone?
Flipkart the World’s number one e-commerce platform is using analytics and algorithms to get better insights into its business during any type of sale or festival season. This article will explain you how the Flipkart is leveraging Big Data Platform for processing big data in streams and batches. This service-oriented architecture empowers user experience, optimizes logistics and improves product listings. It will give you an insight into how this ingenious big data platform is able to process such large amounts of data. Before starting, I recommend you to check this Big Data guide to better understand its core concepts.
Flipkart 是非常棒的电子商务平台,在任何类型的销售或节日季节,它都在使用分析和算法来更好地了解其业务.本文将向您解释 Flipkart 如何利用大数据平台以流和批处理大数据. 这种面向服务的体系结构增强了用户体验、优化了物流并改进了产品列表. 它将让你深入了解这个巧妙的大数据平台是如何处理如此大量的数据的. 在开始之前,我建议你检查一下大数据指南 更好地理解其核心概念.
Big Data at Flipkart
Flipkart Data Platform is a service-oriented architecture that is capable of computing batch data as well as streaming data. This platform comprises of various micro-services that promote user experience through efficient product listings, optimization of prices, maintaining various types of data domains – Redis, HBase, SQL, etc. This FDP is capable of storing 35 PetaBytes of data and is capable of managing 800+ Hadoop nodes on the server. This is just a brief of how Big Data is helping Flipkart. Below I am sharing a detailed explanation of Flipkart data platform architecture that will help you to understand the process better.
Flipkart 数据平台是一种面向服务的体系结构,既能计算批量数据,又能计算流式数据. 该平台包括各种微服务,通过高效的产品列表、价格优化、维护各种类型的数据域 (Redis 、 HBase 、SQL这个 FDP 能够存储 35 PB 的数据,并且能够管理服务器上的 800 多个 Hadoop 节点.这只是一个简单的大数据正在 Flipkart. 下面,我将分享 Flipkart 数据平台架构的详细解释,这将有助于您更好地理解这个过程.
The Architecture of Flipkart Data Platform
Flipkart 数据平台架构
To know how Flipkart is using Big Data, you need to understand the flow of data or Flipkart’s data platform architecture which is explained through the below flow chart-
要了解 Flipkart 如何使用大数据,您需要了解数据流或 Flipkart 的数据平台体系结构,这可以通过下面的流程图来解释 --
Flipkart architectureHow Big Data is helping Flipkart?
Let’s take a tour to the complete process of how Flipkart works on Big Data. Starting with the FDP ingestion system –
让我们来看看 Flipkart 如何在大数据上工作的完整过程.从 FDP 摄入系统开始-
1. FPD Ingestion System
A Big Data Ingestion System is the first place where all the variables start their journey into the data system. It is a process that involves the import and storage of data in a database. This data can either be taken in the form of batches or real-time streams. Simply speaking, batch consists of a collection of data points that are grouped in a specific time interval. On the contrary, streaming data has to deal with a continuous flow of data. Batch Data has greater latency than streaming data which is less than sub-seconds. There are three ways in which ingestion can be performed –
大数据摄取系统是所有变量开始进入数据系统的第一个地方.这是一个涉及数据库中数据的导入和存储的过程.这些数据可以以批处理或实时流的形式获取.简单地说,批处理由在特定时间间隔内分组的数据点集合组成.相反,流数据必须处理连续的数据流.批处理数据比小于秒的流数据具有更大的延迟.有三种方法可以进行摄入-
-
Specter – This is a Java library that is used for sending the draft to Kafka.
-
Dart Service – This is a REST service which allows the payload to be sent over HTTP.
-
File Ingestor – With this, we can make use of the CLI tool to dump data into the HDFS.
-
Specter- 这是一个 Java 库,用于将草稿发送给 Kafka.
-
Dart Service 这是一个 REST 服务,它允许通过 HTTP 发送有效载荷.
-
**File Ingestor- **有了这个,我们可以利用 CLI 工具将数据转储到 HDFS.
Then, the user creates a schema for which the corresponding Kafka topic is created. Using Specter, data is then ingested into the FDP. The payload in the HDFS file is stored in the form of HIVE tables.
然后,用户创建一个模式,为该模式创建相应的 Kafka 主题.然后使用 Specter 将数据输入到 FDP 中. HDFS 文件中的有效载荷以 HIVE 表的形式存储.
2. Batch Compute
This part of the big data ecosystem is used for computing and processing data that is present in batches. Batch Compute is an efficient method for processing large scale data that is present in the form of transactions that are collected over a period of time. These batches can be computed at the end of the day when the data is collected in large volumes, only to be processed once. This is the time you need to explore Big Data as much as possible. Here is the free Big Data tutorials series which will help you to master the technology.
大数据生态系统的这一部分用于批量计算和处理存在的数据.批量计算是一种处理大规模数据的有效方法,这些数据以一段时间内收集的交易形式存在.这些批次可以在大量收集数据的一天结束时计算,只处理一次.这是你需要尽可能多地探索大数据的时候.这里是 免费大数据教程系列: 这将有助于你掌握这项技术.
3. Streaming Platform
3. 流媒体平台
The streaming platforms process the data that is generated in sub-seconds. Apache Flink is one of the most popular real-time streaming platforms that are used to produce fast-paced analytical results. It provides a distributed, fault-tolerant and scalable data streaming capabilities that can be used by the industries to process a million transactions at one time without any latency.
流平台处理以秒为单位生成的数据. Apache Flink 是用于产生快节奏分析结果的最受欢迎的实时流媒体平台之一.它提供了分布式、容错和可扩展的数据流功能,行业可以使用这些功能一次处理百万笔事务,而不会有任何延迟.
4. Messaging Queue
4. 消息队列
A Messaging Queue acts like a buffer or a temporary storage system for messages when the destination is busy or not connected. The message can be in the form of a plain message, a byte array consisting of headers or a prompt that commands the messaging queue to process a task. There are two components in the Messaging Queue Architecture – Producer and Consumer. A Producer generates the messages and delivers them to the messaging queue. A Consumer is the end destination of the message where the message is processed.
A消息队列当目的地忙或未连接时,充当消息的缓冲区或临时存储系统.消息可以是普通消息、由标题组成的字节数组或命令消息队列处理任务的提示.消息队列体系结构中有两个组件 -- 生产者和消费者.生产者生成消息并将其传递到消息队列.消费者是处理消息的最终目的地.
The most popular tool used in Messaging Queues is Kafka. Apache Kafka is an open-source stream-processing software system that is heavily inspired by the transaction logs. Thousands of companies around the world make use of Kafka as their primary platform for buffering messages. Its scalability, zero fault tolerance, reliability, and durability make it an ideal choice for industry professionals.
Apache Kafka 是一个开源的流处理软件系统,它在很大程度上受到了事务日志的启发.世界上成千上万的公司将卡夫卡作为他们缓冲信息的主要平台.它的可扩展性、零容错、可靠性和耐用性使其成为行业专业人士的理想选择.
5. Real-time Serving
5. 实时服务
After the messages are retrieved from the Messaging Queue, the real-time serving system acts as a consumer for the messaging queue. With the help of this real-time serving platform, users can gather real-time insights from the data platform. Furthermore, with the help of real-time serving, the users can access the data through dynamic pipelines.
从消息队列中检索消息后,实时服务系统充当消息队列的消费者.借助这个实时服务平台,用户可以从数据平台收集实时洞察.此外,在实时服务的帮助下,用户可以通过动态管道访问数据.
6. Data Lake
6. 数据湖
The core component of this architecture is the data storage platform. This is a*** Hadoop platform** *that stores raw data, journaled data as well as derived data. Using this, the data is stored in the form of a backup, archive that can be retrieved or purged according to the requirements. The raw data is mostly used by the data scientists who use the insights from the original data to make decisions and develop data products. The data is present in the form of batches or real-time streams. The real-time data is in the form of click streams, summarized reports of user data, product insights, reviews, etc.
该架构的核心组件是数据存储平台.这是一个 *Hadoop 平台 * 存储原始数据、日志数据和导出数据的.使用此功能,数据以备份、归档的形式存储,可以根据要求检索或清除这些数据.原始数据主要由数据科学家使用,他们使用原始数据中的洞察力来决策和开发数据产品.数据以批处理或实时流的形式存在.实时数据以点击流、用户数据汇总报告、产品洞察、评论等形式呈现.
From the data lake, data is transferred to three main routes –
从数据湖,数据传输到三条主要路线
a. Reports
The reports are generally produced from the batch data. These reports allow a comprehensive insight into website logs, daily website readings and other forms of reports. With the help of these reports, companies like Flipkart are able to quantify the market needs as well as
报告一般是从批量数据中产生的.这些报告可以全面了解网站日志、日常网站阅读和其他形式的报告.在这些报告的帮助下,像Flipkart能够量化市场需求以及
b. Ad hoc Query
Ad hoc Query is designed for some specific purpose or use. The Adhoc Queries that are generated from the data-lake are handled by the data analysts. These data analysts make use of various *business intelligence tools *to discover meaning from the data.
Ad hoc 查询是针对某些特定的目的或用途而设计的.从数据湖生成的临时查询由数据分析师处理.这些数据分析师利用各种 *商业智能工具 *从数据中发现意义.
c. Batch Export
This part of the data platform takes the data from data-lake and exports it in various formats to the further processing platforms. The data is present in huge bulks that are exported.
数据平台的这一部分从 data-lake 获取数据,并以各种格式将其导出到进一步的处理平台.这些数据存在于导出的巨大块中.
7. Knowledge Graphs
Knowledge graphs represent an inter-linked network of real-world entities or objects through which we can extract information to process it in an efficient manner. This knowledge graph takes input from the meta-data. This metadata is beneficial for understanding the underlying semantics which is used for deriving newer facts. The knowledge graph also makes use of various machine learning tools and libraries to gain insights and understand the relationships between the objects. One of the most popular tools that are used for building graph is Apache Spark’s GraphX library.
知识图是一个由现实世界中的实体或对象组成的相互链接的网络,通过这个网络,我们可以提取信息,以有效的方式处理它.这个知识图从元数据中获取输入.这些元数据有助于理解用于导出更新事实的底层语义.知识图还利用各种机器学习工具和库来获得洞察和理解对象之间的关系.用于构建图形的最流行的工具之一是 Apache Spark 的 GraphX 库.
Learn everything about Apache Spark and master the technology
了解一切Apache Spark掌握了技术
Summary
Hope now you understand how Big Data is helping Flipkart to offer the best services all around the World. In this article, we looked at the ingenious big data platform that is designed by Flipkart to handle large scale data transactions. We also understood how Flipkart makes use of various big data components to deliver dynamic results to the user. We also had a look at how the Big Data Platform is capable of processing large scale data queries that allow it to produce results. Still, if you want to ask anything about the same, feel free to ask through comments. I will happy to help you. Here is our next article which you must read – Top real-time big data applications.
希望现在你能明白大数据是如何帮助 Flipkart 在世界各地提供最好的服务的.在这篇文章中,我们研究了 Flipkart 为处理大规模数据交易而设计的巧妙的大数据平台.我们还了解了 Flipkart 如何利用各种大数据组件为用户提供动态结果.我们还研究了大数据平台如何处理大规模数据查询,从而产生结果.尽管如此,如果你想问同样的问题,可以通过评论来询问.我很乐意帮助你.这是我们的下一篇文章,你必须阅读 -- 顶级实时大数据应用.
Keep learning🙂
不断学习
网友评论