CREATE THE DATASET
概念概述
This section introduced three important concepts:
Datasets/数据集
Partitioning/分区
Connections/连接
DSS中的数据集
DSS中的数据集可以是表格格式的任何数据。可能的DSS数据集的示例包括:
an uploaded Excel spreadsheet
an SQL table
a folder of data files on a Hadoop cluster
a CSV file in the cloud, such as an Amazon S3 bucket
DSS will represent all of these examples in the Flow of a project with a blue square with the icon matching the type of the source dataset.
possible DSS datasets无论源数据集的起源如何,与任何DSS数据集交互的方法都是相同的。您可以使用相同的方法读取,写入,可视化和操纵DSS中的数据集。您将找到同一探索,图表和统计选项卡,以及相同的视觉,代码和插件配方。
datasetThis is possible because DSS decouples data processing logic (such as recipes in the Flow) from the underlying storage infrastructure of a dataset.
With the exception of directly uploading files to DSS (as done in this Basics tutorial), the DSS server does not need to ingest the entire dataset to create its representation in DSS. Generally, creating a dataset in DSS means that the user merely informs DSS of how it can access the data from a particular connection. DSS remembers the location of the original external or source datasets. The data is not copied into DSS. Rather, the dataset in DSS is a view of the data in the original system. Only a sample of the data, as configured by the user, is transferred via the browser.
The following example Flow includes different types of datasets, such as an uploaded file, a table in a SQL database, and cloud storage datasets:
Partitioning
Partitioning a dataset refers to the splitting of a dataset based on one or multiple dimensions. When a dataset is partitioned, each chunk or partition of the dataset contains a subset of the data, and the partitions are built independently of each other.
When new data is added at regular intervals(间隔), such as daily, you can tell DSS to build only the partition that contains the new data.
In DSS, you can partition both file-based datasets and SQL-based datasets. For file-based datasets, the partitioning is based on the filesystem hierarchy(等级制度) of the dataset. For SQL-based datasets, one partition is created per unique value of the column and generally does not involve splitting the dataset into multiple tables.
You can recognize a partitioned dataset in the Flow by its distinct stacked representation.
To configure file-based partitioning for a dataset, first activate partitioning by visiting the Partitioning tab under Settings, then specify the partitioning dimensions (e.g., time).
To configure SQL-based partitioning, specify which column contains the values you want to use to logically partition the dataset.
When running a recipe that builds a partitioned dataset, use the Input / Output tab of the recipe to configure which partitions from the input dataset will be used to build the desired partitions of the output, and to specify if there are any dependencies, such as a time range.
Once this is configured, select the output dataset in the Flow, then click Build to view the configured partition or partitions. The output to input mapping can be one to one, one to many, or more complex, depending on the use case. Once this is set up, you can build the Flow incrementally.
Connections
The processing logic that acts upon a DSS dataset is decoupled(脱钩) from its underlying storage infrastructure. The way in which DSS manages connections helps make this possible.
You can import a new dataset in the Flow by uploading your own files or accessing data through any previously-established connections, such as SQL databases, cloud storage, or NoSQL sources. You might also haven plugins allowing you to import data from other non-native sources.
While importing a dataset, you can browse connections and available file paths, and preview the dataset and its schema. Once you have done that, the user interface for exploring, visualizing, and preparing the data is the same for all kinds of datasets.
Admin users have the ability to manage connections on an instance from a centralized location. From here, they can control settings such as credentials, security settings, naming rules, and usage parameters. Admins can also establish new connections to SQL and NoSQL databases, cloud storage, and other sources. Many additional connection types are available in the Plugin Store for any non-native connections.
One benefit of this system is a clearer division of labor between those who manage data connections and those who work with data. While having some understanding of a dataset’s storage is often beneficial, particularly in cases of very large datasets, those working with data do not necessarily always need expertise in how their organization warehouses its data.
What’s next?
Now it’s your turn! In the next lesson, you’ll create your first machine learning model!
网友评论