Concept Summary: Prepare the Data
Recipes in DSS
Note
Recipes in DSS contain the transformation steps, or processing logic, that act upon datasets.
In the Flow, blue squares represent datasets. The yellow, orange, and red circles, on the other hand, which connect datasets to one another, represent recipes.
Keeping processing logic separate from datasets has a number of benefits:
- One is that data storage technologies rapidly change. As these winds shift, the underlying storage infrastructure of a dataset can change (for example, switching cloud providers) without impacting the processing logic found in the recipes of a Flow.
- Another is a clear sense of data lineage in a project. By looking at the Flow, you can see all actions that have been applied to the data recorded in recipes – from the raw imported data to the final output dataset.
A circle in the Flow represents a recipe, but its color represents the category of recipe. DSS recipes can be divided into visual, code, or plugin recipes.
Visual recipes (in yellow) accomplish the most common data transformation operations, such as cleaning, grouping, and filtering, through a pre-defined graphical user interface.
Instead of a pre-defined visual recipe, you are free to define your own processing logic in a code recipe (in orange), using a language such as Python, R, or SQL.
The third category of recipe is the plugin recipe (typically in red). A full discussion of plugins within DSS is outside the scope of this section, but know that they are a way for coders to extend the native capabilities of DSS.
If code recipes give you complete freedom to perform any data processing task, and visual recipes can be used and understood by everyone in your team, a plugin recipe combines these benefits by wrapping a visual interface on top of a code recipe.
Prepare Recipe
The Prepare recipe is a visual recipe in DSS that allows you to create data cleansing, normalization, and enrichment scripts in an interactive way.
This is achieved by assembling a series of transformation steps from a library of more than 90 processors. Most processors are designed to handle one specific task, such as filtering rows, rounding numbers, extracting regular expressions, concatenating or splitting columns, and much more.
In addition to directly adding steps from the processor library, you can add steps to the script in a number of other ways.
In the column context menu, DSS will suggest steps to add based on the column’s meaning. For example, DSS will suggest to remove rows with invalid values according to the column meaning.
Another method to add steps to the script is through the Analyze window. Within a Prepare recipe, the Analyze window can guide data preparation, for example merging categorical values.
You can also directly drag columns to adjust their order, or switch from the Table view to the Columns view to apply certain steps to more than one column at a time.
When adding new steps to the script, you’ll notice how the step output is immediately visible. This is possible because the step is being applied to the same sample of the dataset found in the Explore tab. The quick feedback allows you to work incrementally, quickly modifying your transformation steps.
Notice that steps in the script constitute a list of instructions. These instructions are not immediately applied to the dataset itself. For example, adding a “Delete Column” step removes that column from the step preview, but it does not actually delete the column in the dataset, as it would in a spreadsheet. Only when you choose to actually run the recipe will DSS execute the instructions on the full input dataset, and thereby produce a new output dataset.
If a script starts to grow in complexity, a number of features can help you manage them.
- You can disable steps.
- You can organize individual steps into groups of steps.
- You can add colors and comments to steps in order to send reminders to yourself and colleagues.
- You can even copy and paste steps within the same recipe or to another recipe, even if that recipe is in another project or another DSS instance.
Date Handling in DSS
Working with dates poses a number of data cleaning challenges.
There are many date formats, different time zones, and components like “day of the week” which can be difficult to extract. A human might be able to recognize that “1/5/19”, “2019-01-05”, and “1 May, 2019” are all the same date. However, to a computer, these are just three different strings.
Strings representing dates need to be parsed, so that the computer can recognize the true, unambiguous meaning of the Date. The DSS answer to this problem can be found in the Prepare recipe.
When you have a column that appears to be a Date, DSS is able to recognize it as a date. In the example below, the meaning of the first column is an unparsed date.
You could open the processor library, filter for Dates, and search for a step to help in whatever situation you may find yourself. Here, we find the Parse date processor.
You could also take advantage of how DSS suggests transformation steps based on a column’s meaning. Because DSS has identified this column as an unparsed date, it suggests adding the Parse date processor to the script. Both methods achieve the same result.
Once you have chosen the correct processor, it is just a few more clicks to select the correct settings, in this case, the format of the date and the timezone for example.
Once you have a properly parsed date, you’re on your way! DSS will suggest new steps, such as “Compute time since”, “Extract date components”, and “Filter date range”.
Formulas in DSS
Often in a Prepare recipe, you will want to create new columns based on those already present in your dataset. In the world of machine learning, this is called feature generation.
Similar to what you might find in a spreadsheet tool like Excel, DSS has its own Formula language.
It is a powerful expression language to perform calculations, manipulate strings, and much more.
From the processor library, you can add a Formula step and provide the name of the output column.
You could write simple formulas directly in the Expression box. Clicking the Edit button, however, adds a few support measures. The first is code completion. As soon as you start typing, DSS starts suggesting columns from the dataset or functions to apply. The Editor will also alert you if the formula is invalid.
The Formula language allows you to craft expressions of considerable complexity. For example, you can use:
- common mathematical functions, such as round, sum and max
- comparison operators, such as >, <, >=, <=
- logical operators, such as AND and OR
- tests for missing values, such as isBlank() or isNULL()
- string operations with functions like contains(), length(), and startsWith()
- conditional if-then statements
What’s next?
This summary reviewed the concept of recipes in DSS, and more specifically, the Prepare recipe.
Get more practice using the Prepare recipe in the following hands-on section.
网友评论