Snowflake provides cloud-based data storage and analytics as a service.
Among other infrastructures, Snowflake runs on Amazon S3. If all of your source datasets and outputs are in Snowflake locations and other conditions are met, then the entire execution of the transformations can occur in Snowflake.
Transferring the execution steps from the Trifacta node to Snowflake yields the following benefits:
A minimum of data is transferred between systems (recipe steps and associated metadata). Everything else remains in Snowflake.
Recipe steps are converted into SQL that is understandable and native to Snowflake.
Execution times are much faster due to minimal data movement and optimization.
Depending on your environment, the total cost of executing the job may be lower in Snowflake.
During pushdown execution, the recipe steps in Trifacta are converted to SQL and sequentially executed against your source data in temporary tables. The results are written directly to the target tables.
Enabling the Feature
Snowflake as a running environment requires that pushdowns be enabled for the workspace and for the specific flow for which the job is executed. If the flow and the workspace are properly configured, the job is automatically executed in Snowflake.
Workspace Settings
The following setting must be enabled in the workspace. Toggle User menu > Admin console > Workspace settings > Logical and physical optimization of jobs.
When this is enabled, the Trifacta application attempts to optimize job execution through logical optimizations of your recipe and physical optimizations of your recipe's interactions with data.
For more information, see Workspace Settings Page
Flow Optimizations
By default, all newly created flows will be optimized. For older flows, you must enable the Snowflake optimizations in your flow. In Flow View, select More menu > Optimization settings
All general optimizations must be enabled for your flow, as well as the following optimizations, which are specific to Snowflake:
Column pruning from source
Snowflake > Filter Pushdown
Snowflake > Full Pushdown
Optimization | Description |
Column pruning from source | When enabled, job execution performance is improved by removing any unused or redundant columns from the source database. |
Filter pushdown | Optimizes job performance on this flow by pushing data filters directly on the source database. |
Full pushdown | When this setting is enabled, all supported pushdown operations, including full transformation, sampling and profiling job execution, is pushed down to Snowflake , where possible. |
Full execution for S3 | If requirements are met for data sourced from S3, you can enable execution of your S3 datasources in Snowflake . |
For more information, see Flow Optimization Settings
Pre-requisites
You can create a connection to your Snowflake data warehouse.
Snowflake is an S3-based data warehouse service hosted in the cloud. Auto-scaling, automatic failover, and other features simplify the deployment and management of your enterprise's data warehouse. For more information, see https://www.snowflake.com.
This connection type can also be used to connect to Snowflake instances hosted in Azure.
Trifacta application must be integrated with Snowflake. See Snowflake Connections.
The permission to execute jobs in Snowflake must be enabled.
All sources and outputs must reside in Snowflake.
Spark + Snowflake
must be selected as the running environment. See Run Job Page.
Jobs are executed in the virtual warehouse that is specified as part of the Snowflake connection.
Sampling
When the source data is Snowflake, and a sample is requested to be created, the sampling process is pushed down to the Snowflake. Sampling in Snowflake is supported with the following limitations:
The following sampling methods are not supported:
Stratified
Cluster-based
The following file formats are not supported for sampling in Snowflake when stored in S3:
CSV
JSON
Other JDBC-based sources are not supported.
โ
Support for S3 files in Snowflake
When creating a dataset, you can choose to read data in from a source stored from S3 or local file. For S3 data sources that are written to Snowflake , you may be able to execute the job in Snowflake .
You must enable the Full execution for S3 file option and configure the Snowflake flow optimizations. For more information, see Flow Optimization Settings Dialog.
Additional configuration and limitations may apply. For more information, see Snowflake Running Environment.
Run Job
To execute a job in Snowflake in the Trifacta application:
Your job must meet the requirements listed above.
Your job must not include the functions, transformations, or other unsupported elements that are listed below.
You must select
Snowflake + Spark
as your running environment in the Run Job page.
To verify execution in Snowflake, in the Overview tab of the Job summary, the value for Environment under the Execution summary should read Snowflake
.
You can also see the Environment listed as Snowflake in the Execution stages.
For more information, see Job Details Page
Limitations
All data sources and all outputs specified in a job must be located within Snowflake.
All recipe steps, including all Wrangle functions in the recipe, must be translatable to SQL.
Some transformations and functions are not currently supported for execution in Snowflake. See the link in More Info.
If your recipe includes data quality rules, the job cannot be fully executed in Snowflake.
Visual profiling is supported with the following conditions or requirements.
Visual profiles are unloaded to a stage in an S3 bucket.
If a stage is named in the connection, it is used. This stage must point to the default S3 bucket in use.
If no stage is named, a temporary stage is be created in the
PUBLIC
schema. The connecting user must have write-access toPUBLIC
.
More Info
For more information, see Snowflake Running Environment