Skip to main content
Managing samples in complex flows

As your flows grow more complex, you should be aware of how to manage the sampling of your data.

Thea avatar
Written by Thea
Updated over 4 years ago

Best practices for complex flows

When building your flows, you should be aware of how to manage the sampling of data from your sources, so that:

  • You are collecting representative sets of your data.

  • Outliers in your data are processed or discarded appropriately.

  • The data in your flows is fresh enough to complete the development of your flows.

  • The samples used in your flow do not impact performance when you are developing your flow.

Samples and their impacts on workflow performance is the topic of this article, particularly as they apply to complex flows. As an example, take a look at this behemoth of a flow:

When are samples taken

Whenever you create a new recipe and open it in the Transformer page, a sample is taken. This sample is always collected from the initial rows of the data that feeds into the recipe. Some key characteristics of the initial sample:

  • By default, this sample contains as many rows as can fit into 10 MB of data. If your rows are wide, that means fewer total rows of source data in the initial sample.

  • If the entire dataset is less than 10MB, then the initial sample is the entire dataset.

  • The job to collect this sample is executed in the web client that is downloaded from Trifacta into the local desktop browser. This is important for later.

Trifacta Wrangler is a resource-intensive application running in your local browser. It must compete with the other browser windows and other applications that are currently in use. If you are experiencing performance impacts, you may consider closing some applications on your local desktop.

For Trifacta Wrangler Enterprise customers, you can adjust the size limits of samples and payloads sent to the desktop browser. For more information, please contact your Trifacta administrator.

After you have begun working with your dataset, you may decide that you need to take a new sample.

Tip: A new sample can be taken at any time from the context menu. Click the Eyedropper icon at the top of the page and select the type of sample. See Samples Panel.

These generated samples are executed in one of the following environments:

Local (Photon) environment

Embedded in the Trifacta web client that you use in your local browser is Photon, an in-memory environment for processing jobs. While processing on the local machine is faster than transmitting and processing on a remote network, local Photon jobs can demand a lot of resources from your local desktop.

Photon is used for initial sample jobs.

Backend environment

For larger jobs, the Trifacta web client is connected to a backend clustered environment, where data and tasks can be distributed across multiple nodes in the network. For generated samples, the sampling technique across a very large dataset can require a significant amount of computational power. While these sampling jobs can take longer to collect, they do permit you to perform statistically relevant sampling of your data.

All other sampling jobs are executed on the backend environment, including any generated First Rows samples.

In some environments, collecting a sample on the backend environment costs money. Customers may be billed for using compute units to perform sampling jobs, either from Trifacta or the enterprise that manages the backend environment for you. If you are unsure of the costs of sampling, you should contact your Trifacta administrator.

Why does this matter?

New generated samples are collected only when you request them. If you do not choose to generate a sample for your recipe, the initial sample is used.

New samples are collected for the currently selected step of the recipe.

When a non-Photon sample is executed for a single dataset-recipe combination, the following steps occur:

  1. All of the steps of the recipe are executed on the dataset on the backend, up to the currently selected recipe step.

  2. The generated sample is executed on the current state of the dataset.

When your flow contains multiple datasets and flows, all of the preceding steps leading up to the currently selected step of the recipe are executed, which can mean:

  • The number of datasets that must be accessed increases.

  • The number of recipe steps that must be executed on the backend increases.

  • The time to process the sampling job increases.

This sample is valid from the currently selected step and beyond:

  • If you edit a step before the one where the sample was taken, the sample may be invalidated and cannot be used.

  • When the currently selected sample is invalidated, the Transformer page reverts to the most recently collected sample that is valid for the current context.

  • When the sample is selected for use in the browser and you are working on a step after the one where it was executed, all of the steps must be executed in the browser. Details are below.

Step execution in the browser

Suppose you generate a sample on Step #3 of Recipe #1 in your flow, and load it as your current sample in use. After generating that sample, you write 50 more steps in the recipe, including 3 complex joins of other recipes.
You notice a performance impact in the Transformer page in the following areas:

  • Your recipe is very slow to load in the Transformer page.

  • Adding new steps, particularly, multi-dataset steps, takes longer than expected.

These performance issues are caused by:

  • In the browser, all 50 of those steps must be executed on the collected sample. This execution relies on the memory and computing resources in your local desktop.

  • Since you have joined in other recipes, the steps from the last collected sample for those joined-in objects must also be executed in the browser.

  • If samples from those joined-in objects have not been shared with you, then the Transformer page works based off of the initial sample for those objects and then computes all of the subsequent steps in the browser.

How to optimize

When considering your samples, you must balance between:

  • Performance in the browser

  • Breadth of coverage of samples across recipes

  • Sample invalidation

  • Cost of sample generation (where applicable)

Here are some useful tips to optimize sample usage:

Performance issues are likely to occur when:

  • Your current recipe is dependent on a number of other recipes (complexity).

  • You have not taken a sample in your current recipe.

  • You are working with wide datasets.

After you have created a multi-dataset step, such as a join, union, pivot, or lookup, generate a new sample if you notice a browser slow-down and feel confident in the quality of your MDS step.

Avoid creating lengthy recipes based on the single initial sample. When you create a new sample, you might consider inserting a Comment transformation step to mark the location where you created the sample.

Avoid creating and working with datasets wider than 2500 columns. Wide datasets can cause performance problems, and your samples may not be statistically significant, due to the limited number of rows in the sample.

Adjusting the size of your browsing cache in Google Chrome may help.

(Wrangler Enterprise) If you notice continual issues with performance in the browser, you can work with your Trifacta administrator to determine if there are adjustments to the browser and sampling limits that could help.

If there are changes in performance in the Transformer page and in the rest of the Trifacta application as well, the issue is unlikely to be related to sampling.

Addressing browser crashes

If you are working with a complex flow and have not optimized samples management, the browser may become overloaded with the in-browser computations and can crash. Typical error messages include:

Error: Transformation engine has crashed. Please reload your browser.

Here are some things you can do to address these crashes:

If the browser crashed in a recipe that depends on earlier recipes, open those recipes and review the samples and flows in it.

  • If the only available sample is the initial sample, you should take another one.

  • A good time to take a sample is right after a multi-dataset operation in your recipe.

Look to narrow your dataset. Early in the process, if you know columns that need to be deleted, perform those deletions in early recipe steps. Hidden columns are not deleted.

Similarly, delete rows that you know are unnecessary.

On your local desktop:

  • Shut down other browser tabs and applications that are not being used.

  • Clear your browser cache.

Wrangler Enterprise users, in your environment:

  • Review the size limits on data that is sent to and from the web client.

More Information

Did this answer your question?