BigQuery is a cloud-based data warehouse platform that is fully integrated into the Google Cloud Platform. BigQuery supports querying through standard SQL.
For flows that have both the source and output in BigQuery, the Trifacta application can execute the job and necessary transformation steps directly in BigQuery. By doing so, no data needs to be transferred out of the data warehouse, and the performance should be significantly better.
In addition to BigQuery sources, you can execute jobs in BigQuery on source files from Base Storage.
By default, the Full execution for GCS file option is enabled for new flows. For more information, see Flow Optimization Settings Dialog.
For more information, see Google Cloud Storage Access.
BigQuery is not a running environment that you explicitly select or specify as part of a job. If all of the requirements are met, then the job is executed in BigQuery when you select Dataflow. For more information, see Overview of Job Execution.
Supported File Formats from Base Storage
CSV. CSV files that fail to meet the following requirements may cause job failures when executed in BigQuery-
For job execution of CSV files in BigQuery, source CSV files must be well-formatted.
Newlines must be inserted.
Fields must be demarcated with quotes and commas.
Quotes in field value must be escaped with quotes when needed (
""
).
Each row must have the same number of columns.
TSV
JSON (newline-delimited)
TXT
LOG
Compressed Files (gz and bz)
Benefits
Transferring the execution steps from the batch execution to BigQuery provides the following benefits:
Data transfer between systems (Dataprep, Dataflow, GCS, etc.) is minimized. Only recipe and associated metadata is transferred, but everything else remains in BigQuery. When running a job fully in BigQuery, your data never leaves BigQuery
Recipe steps are converted into SQL that is understandable and native to BigQuery. Execution times can be much faster.
Depending on your environment, the total cost of executing the job may be lower in BigQuery.
For jobs that are executed in BigQuery, you can optionally enable the execution of the visual profile in BigQuery, too. This option is enabled for individual flows. For more information, see Flow Optimization Settings Dialog.
Configuration
A project owner must enable the following features in the project:
BigQuery Execution
Logical and physical optimization of jobs
For more information, see Dataprep Project Settings Page.
For individual flows, all general and BigQuery optimizations must be enabled. For more information, see Flow Optimization Settings Dialog.
When executing with BigQuery, recipe steps in Trifacta are converted to SQL, executed on the source data and written into temporary tables, before being output to the defined output table.
You will see a notification on your Run Job page indicating if BigQuery execution is enabled and is being used.
If all recipe steps and Wrangle are translatable to SQL, then Full Execution on BigQuery mode is used.
If some steps are not translatable to SQL, then only the translatable steps are executed on BigQuery, while the rest are executed on Dataflow in a Hybrid Execution mode.
For jobs executed in BigQuery, profiling jobs may also be executed in BigQuery.
An Optimization summary (enabled/disabled) for the job can be viewed on the Job Details page in the right panel.
You can also view the output and executed statements directly on BigQuery by clicking on the button 'View on BigQuery' under Job Details.
Sampling
Beginning v8.11, data sources from BigQuery support sampling via BigQuery execution. BigQuery sampling is limited to full scan samples, as quick scan uses the Trifacta Photon execution.
The following sampling types are supported for execution in BigQuery:
Random sampling
Filter-based sampling
Anomaly-based sampling
Note: Flows with GCS sources and BigQuery targets can use BigQuery execution for job execution, but cannot use BigQuery execution for sampling.
Publishing to BigQuery
Beginning v9.7, You can publish Designer Cloud Powered by Trifacta Objects and arrays of Objects types as complex types in BigQuery.
For details on publishing to BigQuery destinations, refer to this article
Supported Functions
Note that some transformations and functions are not currently supported for execution in BigQuery. Upserts are also supported for full execution in BigQuery. For more information on limitations and the unsupported transformations, read this documentation guide.
Beginning v8.9, imported datasets created with customer SQL are supported for execution in the BigQuery running environment. For more information, see BigQuery Running Environment.
Beginning v9.4, JavaScript User-Defined Functions are supported for execution in the BigQuery running environment. For more information, see JavaScript UDFs.
Full execution for GCS files
When enabled, jobs that are sourced from files on Google Cloud Storage can be executed in BigQuery when all of the requirements for BigQuery execution are met. By default, this setting is enabled.