Skip to main content
All CollectionsGet Results
Full Job Execution on Snowflake
Full Job Execution on Snowflake

Learn about using Spark+Snowflake as an execution engine - Full Pushdown to Snowflake

Thea avatar
Written by Thea
Updated over 2 years ago

Snowflake provides cloud-based data storage and analytics as a service.

Among other infrastructures, Snowflake runs on Amazon S3. If all of your source datasets and outputs are in Snowflake locations and other conditions are met, then the entire execution of the transformations can occur in Snowflake.

Transferring the execution steps from the Trifacta node to Snowflake yields the following benefits:

  • A minimum of data is transferred between systems (recipe steps and associated metadata). Everything else remains in Snowflake.

  • Recipe steps are converted into SQL that is understandable and native to Snowflake.

  • Execution times are much faster due to minimal data movement and optimization.

  • Depending on your environment, the total cost of executing the job may be lower in Snowflake.

During pushdown execution, the recipe steps in Trifacta are converted to SQL and sequentially executed against your source data in temporary tables. The results are written directly to the target tables.

Enabling the Feature

Snowflake as a running environment requires that pushdowns be enabled for the workspace and for the specific flow for which the job is executed. If the flow and the workspace are properly configured, the job is automatically executed in Snowflake.

Workspace Settings

The following setting must be enabled in the workspace. Toggle User menu > Admin console > Workspace settings > Logical and physical optimization of jobs.

When this is enabled, the Trifacta application attempts to optimize job execution through logical optimizations of your recipe and physical optimizations of your recipe's interactions with data.

For more information, see Workspace Settings Page

Flow Optimizations

By default, all newly created flows will be optimized. For older flows, you must enable the Snowflake optimizations in your flow. In Flow View, select More menu > Optimization settings

All general optimizations must be enabled for your flow, as well as the following optimizations, which are specific to Snowflake:

  • Column pruning from source

  • Snowflake > Filter Pushdown

  • Snowflake > Full Pushdown

Optimization

Description

Column pruning from source

When enabled, job execution performance is improved by removing any unused or redundant columns from the source database.

Filter pushdown

Optimizes job performance on this flow by pushing data filters directly on the source database.

Full pushdown

When this setting is enabled, all supported pushdown operations, including full transformation, sampling and profiling job execution, is pushed down to Snowflake , where possible.

Full execution for S3

If requirements are met for data sourced from S3, you can enable execution of your S3 datasources in Snowflake .

For more information, see Flow Optimization Settings

Pre-requisites

  • You can create a connection to your Snowflake data warehouse.

    Snowflake is an S3-based data warehouse service hosted in the cloud. Auto-scaling, automatic failover, and other features simplify the deployment and management of your enterprise's data warehouse. For more information, see https://www.snowflake.com.

    This connection type can also be used to connect to Snowflake instances hosted in Azure.

  • Trifacta application must be integrated with Snowflake. See Snowflake Connections.

    • The permission to execute jobs in Snowflake must be enabled.

  • All sources and outputs must reside in Snowflake.

  • Spark + Snowflake must be selected as the running environment. See Run Job Page.

Jobs are executed in the virtual warehouse that is specified as part of the Snowflake connection.

Sampling

When the source data is Snowflake, and a sample is requested to be created, the sampling process is pushed down to the Snowflake. Sampling in Snowflake is supported with the following limitations:

  • The following sampling methods are not supported:

    • Stratified

    • Cluster-based

  • The following file formats are not supported for sampling in Snowflake when stored in S3:

    • CSV

    • JSON

  • Other JDBC-based sources are not supported.
    โ€‹

Support for S3 files in Snowflake

When creating a dataset, you can choose to read data in from a source stored from S3 or local file. For S3 data sources that are written to Snowflake , you may be able to execute the job in Snowflake .

Run Job

To execute a job in Snowflake in the Trifacta application:

  • Your job must meet the requirements listed above.

  • Your job must not include the functions, transformations, or other unsupported elements that are listed below.

  • You must select Snowflake + Spark as your running environment in the Run Job page.

To verify execution in Snowflake, in the Overview tab of the Job summary, the value for Environment under the Execution summary should read Snowflake.

You can also see the Environment listed as Snowflake in the Execution stages.

For more information, see Job Details Page

Limitations

  • All data sources and all outputs specified in a job must be located within Snowflake.

  • All recipe steps, including all Wrangle functions in the recipe, must be translatable to SQL.

  • Some transformations and functions are not currently supported for execution in Snowflake. See the link in More Info.

  • If your recipe includes data quality rules, the job cannot be fully executed in Snowflake.

  • Visual profiling is supported with the following conditions or requirements.

    • Visual profiles are unloaded to a stage in an S3 bucket.

    • If a stage is named in the connection, it is used. This stage must point to the default S3 bucket in use.

    • If no stage is named, a temporary stage is be created in the PUBLIC schema. The connecting user must have write-access to PUBLIC.

More Info

For more information, see Snowflake Running Environment

Did this answer your question?