A brief overview of sampling
When a dataset is first created, a background job begins to generate a sample using the first set of rows of the dataset. This initial sample is usually very quick to generate, so that you can get to work right away on your transformations. By default, each sample is 10MB in size or the entire dataset if it's smaller.
Additional samples can be generated from the context panel on the right side of the Transformer page. Sample jobs are independent executions. As you develop your recipe, you might need to take new samples of the data. Through the Transformer page you can specify the type of sample that you wish to create and initiate the job to create the sample. This sampling job runs in the background.
Recipe logic and sampling
When a sample is executed from the Samples panel, it is launched based on the steps leading up to the current location in the recipe steps. For example, if your recipe includes joining in other datasets, those steps are executed, and the sample is generated with dependencies on these other datasets. As a result, if you change your recipe steps that occur before the step where the sample was generated, you can invalidate your sample.
โ
Sample Methodologies
There are six types of samples:
First rows/Initial Sample
Random
Filter-based
Anomaly-based
Stratified
Cluster-based
Certain sample methodologies depend on the Sample Type.
Sample Type
There are two types of sampling: quick and full. A quick scan looks at the first 2GB of data and creates samples from the limited set. A full scan samples from the entire dataset.
More Information
Refer to the following articles to learn more about Sampling:
More details about sampling can be found in our product documentation