The efficiencies of automation and hosted solutions are compelling for many enterprises.
Google Cloud Functions constitute an event-driven serverless compute platform that gives you the ability to run your code in the cloud without worrying about the underlying infrastructure.
Cloud Functions provide the following fundamental advantages:
In our use case we’re going to use the event-based trigger in Cloud Functions to:
So, why not just use Dataprep to run BigQuery views and schedule a Dataflow job to fire at 12:05 am after the GA table is created, then save the results to Cloud Storage with no coding (except the SQL queries of course)? The issue here is that if for any reason the GA BQ export got delayed the BQ views will fail causing your job to fail. Using a Stackdriver trigger is a more failsafe approach. The trigger will only fire once the table is created, eliminating the timing dependency and ensuring that the Cloud Function will find the table when executing the queries.
Actually no, Cloud Functions can’t be triggered with BigQuery. Still, there is a way to work around this using Stackdriver and Pub/Sub (which is one of the source triggers of Cloud Functions). In our use case we’ll be using more than one product from the GCP family like Stackdriver, Pub/Sub, Cloud Functions (Python), Cloud Storage and of course BigQuery (not necessarily in that given order).
Well, it may look like there are a lot of products used here, but really almost all the work is done in Cloud Functions and BigQuery, the rest of the products are just helping to close the loop with minimal configurations.
So, let’s list all the steps needed to get the job done, then get into details one by one:
We’ll not go deep into the functionality of each GCP product. We’ll just review what we need to configure to have our use case working.
The help docs provide additional discussion on the Google Cloud Platform components used in this solution.
First we create a new Pub/Sub topic (which can be done implicitly while creating the Stackdriver export, as shown below). Then we’ll make use of the fact that BigQuery logs are active by default, and define an export in Stackdriver with a filter for only when the GA table is created in BigQuery and configure the Stackdriver export to send the body of this log (which is in JSON format) to our Pub/Sub sink.
In this step we prepare the BQ queries that will be used to produce the needed reports. Without getting into too much explanation about how to write the BigQuery queries, we’ll use the query below, which retrieves all sessions from the day before that included Add to cart eCommerce action, with all details about the products returned in the query.
SELECT<br />CONCAT(fullVisitorId,'.', CAST(visitId AS string)) AS sessionId,<br />hit.page.pageTitle AS pageTitle,<br />CONCAT(hit.page.hostname, hit.page.pagePath) AS pageURL,<br />hit.page.hostname AS hostname,<br />product.productSKU AS productSKU,<br />product.v2ProductName AS productName,<br />product.v2ProductCategory AS productCategory,<br />product.productPrice/1000000 AS productPrice,<br />product.productQuantity AS productQuantity<br />FROM<br />`..ga_sessions_*`,<br />UNNEST(hits) AS hit,<br />UNNEST(hit.product) AS product<br />WHERE<br />hit.eCommerceAction.action_type = '3'<br />AND _TABLE_SUFFIX = FORMAT_DATETIME("%Y%m%d",<br />DATETIME_ADD(CURRENT_DATETIME(),INTERVAL -1 DAY))
Now, on to creating our Cloud Function. We need to define the function name, memory to allocate, trigger (Pub/Sub in our case), topic (our topic name), and for the runtime we’ll use Python 3 (which is currently is in beta). There is another option to use Node.JS, but we’ll stick to Python for now.
Below is a snippet of the Cloud Function python code used to run, execute and export the BigQuery’s results into a CSV file into a Cloud Storage Bucket.
from google.cloud import bigquery<br />from google.cloud import storage<p>def export_to_gcs():<br /># BQ Query to get add to cart sessions<br />QUERY = "SELECT<br />CONCAT(fullVisitorId,'.', CAST(visitId AS string)) AS sessionId,<br />hit.page.pageTitle AS pageTitle,<br />CONCAT(hit.page.hostname, hit.page.pagePath) AS pageURL,<br />hit.page.hostname AS hostname,<br />product.productSKU AS productSKU,<br />product.v2ProductName AS productName,<br />product.v2ProductCategory AS productCategory,<br />product.productPrice/1000000 AS productPrice,<br />product.productQuantity AS productQuantity<br />FROM<br />`..ga_sessions_*`,<br />UNNEST(hits) AS hit,<br />UNNEST(hit.product) AS product<br />WHERE<br />hit.eCommerceAction.action_type = '3'<br />AND _TABLE_SUFFIX = FORMAT_DATETIME('%Y%m%d',DATETIME_ADD(CURRENT_DATETIME(),INTERVAL -1 DAY))"<br />bq_client = bigquery.Client()<br />query_job = bq_client.query(QUERY) # API request<br />rows_df = query_job.result().to_dataframe() # Waits for query to finish<br />storage_client = storage.Client()<br />bucket = storage_client.get_bucket('BucketName')<br />blob = bucket.blob('Add_to_Cart.csv')<br />blob.upload_from_string(rows_df.to_csv(sep=';',index=False,<br />encoding='utf-8'),content_type='application/octet-stream')<br />
Here, we are using google.cloud.bigquery and google.cloud.storage packages to:
The final step is to set our Python function export_to_gcs() as “Function to execute” when the Cloud Function is triggered.
Say our data is in the volume of millions of records; we can always extend the memory allocated for our Cloud Function up to 2GB, but this comes with a higher price, and what if even 2GB is not enough?
Another workaround for this is not using Pandas to save query results. Instead, we can save the results to a BigQuery intermediate table, then export this table directly to Cloud Storage, letting BigQuery do all the heavy lifting for us.
Below are two functions to do so. The save_to_bq_table() function runs a query and saves the results to a BigQuery table, here we are setting allow_large_results = True to avoid job crashing if the result set is huge.
def save_to_bq_table():<br />bq_client = bigquery.Client()<br /># Saving data to a intermediate table then export it to GCS<br />query = "##Query with millions of records results##"<br />job_config = bigquery.QueryJobConfig()<br /># Set the destination table<br />table_ref = bq_client.dataset(dataset_id).table('TableID')<br />job_config.destination = table_ref<br />job_config.allow_large_results = True<br /># Start the query, passing in the extra configuration.<br />query_job = bq_client.query(<br />query,<br />location='US', # Location must match that of the source table<br />job_config=job_config) # API request - starts the query<br />query_job.result() # Waits for the query to finish<br />
The export_bq_table() function, exports the table to Cloud Storage CSV file(s), then deletes the table.
def export_bq_table():<br />client = bigquery.Client()<br />destination_uri = 'gs://{}/{}'.format('BucketName','ExportFileName_*.csv')<br />dataset_ref = client.dataset(dataset_id, project=project_id)<br />table_ref = dataset_ref.table(tableId)<br />extract_job = client.extract_table(<br />table_ref,<br />destination_uri,<br />location='US') # API request # Location must match that of the source table<br />extract_job.result() # Waits for job to complete.<br />client.delete_table(table_ref) # API request<br />
The exported files will have a limit of 1GB per file, so adding an asterisk * somewhere in the file name in the URI will generate multiple files with incremental files names, FileName-000000000000.csv, FileName-000000000001.csv, FileName-000000000002.csv and so on.
Although Cloud Functions can’t be used for complex transformations which is a task for Dataflow, Cloud Functions are also a very powerful tool that can be used alongside other GCP products to automate quick tasks with little code writing effort. They can be used for exporting data from BigQuery, writing data from Cloud Storage into BigQuery once files are put into a GS Bucket, reacting to a specific HTTP request, monitor Pub/Sub topics to parse and process different messages, and so much more.
Some other use cases of Google Cloud Functions include:
Data Engineering Team Lead Mahmoud is all about facing challenges, the harder the challenge the sweeter the victory. After receiving his Bachelor degree in Computer Science and a diploma in Software Development he worked as a software developer. Later switching careers to the Business Intelligence and Data Analysis field in which he found his passion working with data, analysis and revealing hidden insights to help companies make better business decisions.
In today’s hyper-competitive digital market, businesses need more than just data; they need actionable insights…
As consumers become increasingly digitally savvy, and more and more brand touchpoints take place online,…
Marketers are on a constant journey to optimize the efficiency of paid search advertising. In…
This website uses cookies.