bigquery

Getting Your Feet Wet in the Data Lake: Analytics 360 in BigQuery

I am really fascinated by BigQuery. The ease of use and the speed it offers to handle really large amounts of data is just amazing. In this post, I’ll highlight the operational benefit of BigQuery over the on-premise Big Data platforms. We’ll then begin filling our data lake with Google Analytics 360 data as just one of the many data sets for storage in BigQuery as a first step in integration, visualization, and advanced reporting.

As an editorial note, Google Analytics Premium is now called Analytics 360 within the Google Analytics 360 Suite. We’ll alternatively refer to it as Analytics 360 or Google Analytics as the context warrants.

Benefits for Data Engineers, Analysts and Marketers

As a Big Data platform, BigQuery offers benefits for multiple stages and roles in the Big Data process:

For marketers and analysts, you can run ad hoc queries and get the results within minutes or seconds. The elusive quest for understanding online and offline attribution, user funnels, and long-term customer value comes within reach.

For data engineers, BigQuery offers a tremendous operational benefit, as outlined in the next section.

No-ops Big Data Platform

With BigQuery, there’s no need to worry about all the infrastructure preparation and the dedicated resources that on-premise solutions would require. In contrast to on-premise platforms such as Hadoop and Spark, BigQuery is a truly no-ops cloud-based solution that requires no resources for maintaining hardware infrastructure, software updates, and automatic adding and removing servers according to usage.

The no-ops benefit of BigQuery by itself makes BigQuery a great choice as a proof of concept for your Big Data platform that can likely evolve into a solution for all your long-term Big Data needs.

More than Buzz in the New Buzzword: Data Lake

For E-Nor, the export of Google Analytics data into BigQuery was a game changer.
Google Analytics was already supporting imported data from other sources, but an inverse approach–exporting Google Analytics data into BigQuery–offers great advantages:

Since BigQuery is designed to house data sets from many different sources and can pull in any CSV or JSON data, it facilitates integration of Google Analytics with countless marketing and customer data sources. Having large diverse data sets in a raw format changes the game entirely for any future exploration, visualization, or advanced/automated reporting. Imagine all of the correlations that one may find once all the enterprise data is in one place and unified tools are used for the exploratory data analysis.
Your Google Analytics data is more easily used for advanced visualizations and automated reporting in Google’s Data Studio 360 or third-party tools such as Tableau.

Once you have incorporated all your enterprise data in one repository and you begin using unified tools for exploratory data analysis, the correlations and the insights begin to emerge. If you have a business that could be affected by weather, you could integrate publicly available meteorological data sets with your Google Analytics data sets to correlate weather with purchases and thereby identify the most opportune times to activate marketing campaigns and special offers.

Setting Up a Google Cloud Platform Project

BigQuery is not an add-on to Analytics 360; it’s a separate Google product. More accurately, it’s a component of Google Cloud Platform (GCP). That’s why we start by creating a GCP project and enabling BigQuery API and billing. BigQuery uses a different billing model from Analytics 360. All GCP components follow the pay-as-you-go model. That is, your consumption is measured and billed at the end of every month. BigQuery costs depend on 3 factors: storage, queries, and direct insertion of data into BigQuery.

Google has provided a great tool to estimate and plan your usage and cost: https://cloud.google.com/products/calculator/

As an Analytics 360 user, you get $500 credit per month to use against BigQuery usage. Since BigQuery storage is priced so competitively, this $500 can accommodate a great deal of your enterprise data.

Analytics 360 Export

Before we can analyze Google Analytics data in BigQuery, we must get the data into the platform. Fortunately, Analytics 360 provides a very straightforward set-it-and-forget-it nightly export from Google Analytics to BigQuery (Linking between standard Analytics and BigQuery is not supported.) As BigQuery adoption grows in the analytics community, we expect more Google, and non-Google, products will support linking to BigQuery. Google’s Firebase Analytics for mobile app measurement, as one example, has already taken this step of automated export to BigQuery.

For a step-by-step guide to configuring the Analytics 360 export to Big Query, see: https://support.google.com/analytics/answer/3437618

Viewing Your Google Analytics Data in BigQuery

Once you have completed the linking procedure, you should start seeing Google Analytics data in your BigQuery project within 24 hours. Google will also provide a historical export of analytics data. As of the publication of this post, the historical export limit is the smaller between 10 billion hits and 13 months of data. Historical export will happen in parallel with the daily export.

Hit-level Granularity

Once you start seeing GA data in BigQuery, you will notice that it is the raw analytics data, at the hit level (that is, at pageview, event, and e-commerce transaction level rather than session or user level).

This granularity is a big advantage if you’re investigating an issue or if you want to build a report that’s totally different than Google’s standard or custom reports.
On the flip side, it also means you have to write queries to calculate some session- and user-scope metrics that are available directly in Google Analytics interface.

In an example of issue investigation, BigQuery allowed us to quickly get to the bottom of an issue we observed in a particular Google Analytics property: many sessions with (not set) as the Landing Page. Experimentation with a few queries revealed that video events were being sent to Google Analytics after session timeout and without a new pageview due to long-duration videos and the session reset at midnight.

No Sampling

In addition to the benefits outlined in previous sections, another reason for exporting Analytics 360 data to BigQuery is to avoid sampling altogether. All hits will be grouped into sessions and exported to BigQuery, so you can be confident that all analysis and visualizations, for any combination of dimensions and metrics, will be based on your complete, unsampled Google Analytics data set.

Google Analytics Schema in BigQuery

First, the schema is always evolving and improving, with new fields being added and old fields being deprecated, so be sure to bookmark the link below and keep checking the schema for updates.

Here is a reference to the schema for exported Google Analytics data within BigQuery: https://support.google.com/analytics/answer/3437719

Second, we should note the differences between the data in Google Analytics Interface and BigQuery: some of these differences are minor, such as the hit type is called PAGE in BigQuery and not pageview as in Google Analytics, while others are bigger than expected, as described below.

It may be helpful to start by highlighting some key differences between BigQuery and traditional databases, and why some core concepts and best practices apply to RDBMSs and not BigQuery:

Select * is NOT recommended:In BigQuery, tables are divided into columns, which allows for better compression. Think of it as if each column is stored in a separate file, maybe even a separate machine. Running a select on a smaller number of columns uses fewer resources and will perform faster.
GA data is partitioned over tables by date. Data for each day is stored in a separate table under the same dataset. The dataset uses the Google Analytics view id as its name. To query several tables and combine the result as if it were one table, BigQuery provides the table_date_range function.
In BigQuery’s Google Analytics schema, each record represents a session. The fullVisitorId and visitId, together, form a unique key for each session.
In BigQuery, a field data type can be a RECORD. That is, you can have a whole record, with fields and values, within a field. In BigQuery’s Google Analytics schema, a good example of this is trafficSource. Within this field a record containing all kinds of details of trafficSource, such as trafficSource.source, trafficSource.medium, trafficSource.campaign, trafficSource.keyword, or trafficSource.referralPath are stored. Using the dot notation, you can retrieve any of these inner fields.

In BigQuery, a field can be REPEATED, in addition to being NULLABLE and REQUIRED available in traditional databases. These serve as the data mode of the field. In a REPEATED field, you can have a collection of values, rather than a single value, stored inside one field. In BigQuery’s Google Analytics schema, that’s how customDimensions are stored, as REPEATED RECORDs inside the main record which represents a session.

An example Query

Here is one example query that covers a lot:

SELECT
  trafficSource.source + ' / ' + trafficSource.medium AS source_medium,
  COUNT(DISTINCT CONCAT(fullVisitorId, STRING(visitId)), 100000) AS sessions,
  SUM(totals.bounces) AS bounces,
  100 * SUM(totals.bounces) / COUNT(DISTINCT CONCAT(fullVisitorId, STRING(visitId)), 100000) AS bounce_rate,
FROM (TABLE_DATE_RANGE([XXXYYYZZZ.ga_sessions_], TIMESTAMP('2016-05-01'), TIMESTAMP('2016-05-31')))
GROUP BY
  source_medium
ORDER BY
  sessions DESC

In the query above, replace XXXYYYZZZ with the ID of the Google Analytics view that you are exporting to BigQuery.

A few other points to note about the query syntax:

The source and medium will be combined into one field.
The fullVisitorId and the visitId together represent a unique session and that what we are counting.
The field totals.bounces will tell us if this session has only a single pageview.
The TABLE_DATE_RANGE function allows us to query over 31 tables, to cover all GA data during the month of May. The TIMESTAMP function will convert the date to a string that can be appended to the table name.
The grouping is part of the aggregation/calculation and sorting is controlled by the order by phrase.

Here are 2 useful references for the query language:

BigQuery Cookbook:https://support.google.com/analytics/answer/4419694
BigQuery Reference:https://cloud.google.com/bigquery/query-reference

What results does the query return?

Next Steps

In this post, we’ve only dipped our toe into the BigQuery data lake. We’ve set up a BigQuery project, configured the export of Analytics 360 data to BigQuery, reviewed the Google Analytics schema within BigQuery, and run a basic query on our data..
There are so many other ways to enjoy the BigQuery data lake. In our next posts, we’ll go in up to our knees with more advanced analysis of Google Analytics content data within BigQuery, and then we’ll begin filling up the data lake with additional data sets and really start splashing around.