Explore Your Data
概念概述
Schema/架构
When we upload a dataset or connect to a dataset, Dataiku DSS detects the columns with their names and types. While uploading a dataset, we can preview it to see the columns and types. We can find the Schema tab within the Settings tab of a dataset.

When running recipes in the Flow, DSS asks if you want to update the schema. This is because the output dataset’s schema changes as you apply changes to the columns including things like date parsing(解析) and creating new computed columns. In most cases, you will update the schema.
Storage Type and Meaning
You might be wondering why there are two kinds of “types”.
The storage type indicates how the dataset backend should store the column data, and how many bytes will be allocated to store these values. Common storage types are string, integer, float, boolean, and date.
Meanwhile the meaning gives a “rich” semantic label to the data type. Meanings are automatically detected from the contents of the columns, but you can also define custom meanings. Meanings have high-level definitions such as url, ip address, or country. Each meaning is able to validate a cell value. Therefore each cell can be valid or invalid for a given meaning.

Storage types and meanings are related. Both constrain the values that the column can contain and are useful in managing data in different ways. You can find the storage type and meaning of each column in the Dataset view, when importing a dataset, and in the Explore tab for any dataset in your project.
The storage type of a column impacts its ability to serve as a key column when joining two datasets. For example, a string column in one dataset cannot serve as the key column with an integer column in another dataset.
While in the Explore tab of a dataset, DSS displays a context sensitive menu depending on the values in the column. For example, a column of unparsed dates and a natural language column will have their own relevant transformation options.

When the DSS-detected meaning does not reflect the values in the column, you might want to select a less restrictive meaning. For example, changing meaning from “integer” to “text” when some of the values in the column contain text.

Sampling
Sampling allows for immediate visual feedback while exploring data no matter how large the dataset. There are a number of different sampling methods available, aside from the default first 10,000 rows. The same sampling principle applies to visualization (Charts) and data prep (Prepare recipe).

Exploring very large datasets can be unwieldy, as even simple operations can be expensive, both in terms of computational resources and time. The approach DSS takes to solving this problem is to display only a sample when exploring and preparing data.
The default sample for any dataset is the first 10,000 rows. Although it is the fastest method, the sample may be biased depending on the composition of the dataset. Depending on your needs, many other sampling strategies, such as random, stratified, or class rebalancing, are available. The tradeoff for a potentially more representative sample is the time needed for DSS to make a full pass or sometimes two full passes of the data.
The main purpose of sampling is to provide immediate visual feedback while exploring and preparing the dataset, no matter how large it may be. This means that because DSS is only viewing a relatively small sample of the data, you can very quickly sort the sample by a column, apply a filter, display column distributions, color columns by values, and view summary statistics.
Analyze
From the Explore tab of a dataset, you can begin to investigate the values of any column in your dataset using the Analyze window. You can access the Analyze window from the context menu of a column header. By default, DSS calculates statistics shown in the Analyze window using the dataset sample.

Data quality is one area into which the Analyze window provides insight. It reveals the number of valid, invalid, and empty values, as well as those values which appear only once.
Numeric columns plot a histogram and boxplot of the distribution. Categorical columns plot a bar chart, sorted by the most frequent observations.
The window also provides summary statistics, counts of the most frequent values, and recognition of outliers.
By default, these statistics are calculated from the current sample configured in the Explore tab. It is also possible, however, to compute them on the whole dataset.
Charts
Visualization is a key tool in the data exploration and discovery process. To meet this need, the Charts tab of a DSS dataset houses a drag-and-drop interface for visual exploration. Many different types of charts are natively available including bar charts, line graphs, pivot tables, and scatterplots.

The Chart builder has many other features to assist in the exploration of your data. For example, with time series, you can zoom in on different periods, change the aggregated date interval, explore multiple series within the same chart, examine them side-by-side in subcharts, or create basic animations.
When working with large numbers of groups of categorical data, you can easily control the number of displayed values by grouping less-prevalent categories into an “other” bucket. You can also drill down into a dataset by adding filters to the chart from a tooltip. By default, charts in DSS use the same sample found in the Explore tab. You can also select an execution engine when working with certain types of datasets, such as those stored in SQL databases. Such a chart can be executed in-database to improve performance.
网友评论