Modeling Google Analytics and Back-End Retail Data on Google Cloud Platform

The ultimate objectives of data capture are, of course, insight and action. For many organizations, however, the gulf remains wide between data capture and well informed data-driven decisions.

As a discipline, data modeling is intended to foster a conscious, outcome-oriented approach to data capture and enable meaningful reporting, visualization, and analysis. More specifically, data modeling can help developers and data professionals to:

fully inventory and understand the available and needed data in their target domains
anticipate the types of analysis to be performed on the data
facilitate presentation and reporting

Data modeling can apply to a wide range of domains. In this blog post, we concentrate on modeling Google Analytics e-commerce data integrated with other back-end retail data.

We approach the retail data model in four phases:

Integrating online and offline data sources, we map out a normalized schema in BigQuery.
We populate the normalized schema for staging in BigQuery
We replicate the normalized data in BigQuery in a denormalized schema, also within BigQuery but optimized for fast reads as a presentation layer.
We then outline several types of analysis that the data model will facilitate:
- Customer Sentiment Analysis
- Customer Service Performance
- Online Promotion Influence on Offline Sales (and Vice Versa)
- Churn Pervention for Marketing
- Determining Real Campaign Impact
- Customer Segmentation
- Customer Lifetime Value Prediction

Along the way, we take advantage of Google Cloud Platform’s data modeling services to generate additional dimensions against the data already included in the model.

Data Lake, Data Warehouse, Data Mart, Data Model
Before we dive in, let’s review some essential data management concepts:
Data Lake: fairly unstructured, comprehensive data repository. Your data lake might contain files in Cloud Storage or Google Drive or transactional data in Cloud Bigtable.
Data Warehouse: structured, transformed data repository, designed with defined uses cases in mind.
Data Mart: subset of the data warehouse typically oriented towards a specific team, business function, or KPI.
Data Model: model that organizes elements of data and their interrelationships.
A data model can help to clarify requirements for a Data Warehouse or a Data Mart.

Unified, Efficient Presentation Layer

The retail data model that we define in this post will enable a unified presentation layer in BigQuery. We can aggregate and transform the external data sources using a range of Google Cloud Platform components, including:

Cloud Functions to transport data to Google Cloud Storage for staging
Cloud Dataflow to move data from Google Cloud Storage to BigQuery
Cloud Composer to orchestrate data pipelines jobs

The specific architecture is guided by the pipelines that the data model will reference.

The resulting data model, and the physical data warehouse or data marts that it can help to shape, will offer the following benefits:

unification: the data model will consist of online, in-store, and back-end data.
efficiency: since the data model is itself built around different reporting requirements, consumers of the corresponding data marts will be able to efficiently query and present only the data that is needed for reporting and analysis.

Normalized-to-Denormalized Structure

We’ll start the data model in a normalized form but then move to denormalization:

Normalized form represents a systematic approach of decomposing tables to eliminate data redundancy (repetition) and anomalies due to record insertion, update, and deletion. To minimize redundancy, normalized form relies on strong relationships between tables.
Denormalized form represents a strategy applied to previously normalized database tables in order to increase performance. With some redundancy and decreased write performance, read performance is improved.

So normalized form favors storage efficiency and relations between entities, while denormalized form represents read efficiency. The first layer of our data model will be normalized; the second layer will be denormalized but will also take advantage of BigQuery’s nested and repeated field structure to preserve some of the benefits of normalization.

Normalized Model Structure
We design the first layer of the model in a Normalized form in a staging area of BigQuery to organize and cleanse the data. It will encompass the following main areas of data:

Customer
Products
Order Transactions
Customer Service
Loyalty Program
Click Stream
Promotions
Staff

The physical tables for this layer of our data model will reside in BigQuery and will be accessed only by the developers rather than used directly for presentation. While BigQuery is not considered an RDBMS, we’ll be able to take advantage of some relational concepts such as logical joins and primary key enforcement to map the relations between tables and ensure a great degree of data integrity and data quality in this first layer of the retail data model.

Normalized layer of the retail data model (view live version)

Portions of the data model highlighted by function.

(view full diagram – click the link and then click within the main panel to display the diagram)

Note that this model represents a subset of a comprehensive retail data model, which could also encompass inventory, finance, and other aspects of data capture for a retail organization.

As shown in the above diagram, which is the initial design for Retail industry relational data model in normalized form.

BigQuery is designed to support denormalized data more than normalized data. The increase in query speed that denormalized data model offers outweighs the storage savings of normalized data, which is generally not as high as a priority in modern systems.

Denormalization for the Presentation Layer:
Based on the best-practice considerations outlined above, we’ll design the presentation layer to be denormalized. Below is a denormalized BigQuery schema corresponding to the normalized schema above.

Can I ever report from the normalized layer?
There are two main considerations for maintaining and reporting in normalized form without replicating in denormalized form as you design your schema for BigQuery:

denormalize if large: if a dimension table is larger than 10 gigabytes, it is a good indicator that you should denormalize the dimension table.
keep normalized if you have frequent UPDATEs and DELETEs: if these operations are frequent, keep the dimension tables normalized and the dimension table smaller than 10 gigabytes.

Breaking out Reference Fields in Normalized Form

Whether you’re creating a normalized layer that precedes the denormalized presentation layer or you’re using the normalized data layer directly for presentation, you can enhance performance by breaking out frequently updated fields into their own entity within the normalized schema.

Entity containing fields that are updated frequently and fields that are static.

Frequently updated fields have been broken out to a separate entity on the left, joinable on customer_id.

Product:

Product
Product price
Product discount
Product category
Product category discount

Customer and Contract:

Customer
Customer Reward Point Log
Membership type

Service Request

Service Request
Agent
Service Request Feedback

Sales and Inventory

Order
Order Feedback (other data source because 150-character limitation for CD)
Order Items
E-commerce (Click Stream)
- Session
- Hits
- Checkout Step
Promotions
Sales Performance (online or offline channel)
Marketing Campaign
Customer Journey (first activity date, last activity date – its own data mart)

Note that within the tables above, the e-commerce data originates from Google Analytics and is stored in nested, repeated form.

Workflow: Customer Sentiment Analysis

One of the types of analysis that our data model will enable is the analysis of customer sentiment based on survey inputs.

The end-to-end workflow mapped below will include the transfer of data to Google Cloud Storage, the population of the normalized and denormalized layers of the data model in BigQuery, and the generation of additional fields through Google’s Natural Language API.

link to full version

Step 1: Preparing Files

The source data will be either files or external databases. In the case of an external database, you can export tables compressed to tar.gz files in preparation for Google Cloud Storage.

Step 2: Transfering Files to Google Cloud Storage

There are several ways to transfer data to a Google Cloud Storage bucket, including the two options below:

ETL Job using any data integration tool, such as Talend
Batch / Shell scripts using gsutil command in Google Cloud SDK

$ gsutil cp gs:///

Step 3: Organizing Transferred Files

The Cloud Storage data lake for the retail data will consist of two buckets:

Staging: A region standard bucket where we receive the tar.gz files and decompress them
Archive: depending on the expected frequency of disaster recovery or the need to query for new KPIs, you can configure the archive as either nearline or coldline: [region & (nearline or coldline)]
- Nearline bucket: choose if expected access may be once per month
- Coldline bucket: choose if expected access may be once per year

Suggested settings for the archive bucket in Google Cloud Storage. For storage class, select Nearline or Coldline depending on expected frequency of access to the archived bucket.

Step 4: Reading Data from GCS Bucket

You can set up a Cloud Function to fire based on a Cloud Storage trigger for when a file is uploaded for processing. The Cloud Function itself can then trigger an Airflow workflow in Cloud Composer.

You can configure a Cloud Composer workflow to perform the following steps in parallel, as shown in the diagram below:

copying data from Google Cloud Storage buckets to normalized tables in BigQuery
generate additional fields to capture sentiment analysis

Cloud Composer workflow that copies data from GCS to BigQuery and in parallel generates sentiment scoring through the Natural Language API.

Cloud Composer Workflow

The Cloud Composer workflow includes the following specific steps:

Extracts the input file that triggered the workflow.
The Airflow microservice within Cloud Composer will manage the following steps:
1. Decompress the input
2. Execute a Cloud Dataflow job that process the file that contains customer review to be sent to Natural Language API.
3. Send the result to BigQuery as the normalized layer of the data model
4. Process the other files to be sent to BigQuery, also within the normalized layer

The Dataflow job includes the following Python code for Apache Beam:

Create Pipeline:

argv = [
      '--project={0}'.format(PROJECT),
      '--staging_location=gs://{0}/staging/'.format(BUCKET),
      '--temp_location=gs://{0}/staging/'.format(BUCKET),
      '--runner=DataflowRunner',
      '--job_name=dataextraction{0}'.format(YESTERDAY),
      '--save_main_session',
      '--requirements_file=/home/airflow/gcs/dags/requirements.txt',
      '--num_workers=1',
      '--max_num_workers=1'
    ]
    p = beam.Pipeline(argv=argv)
 
    (p
        | 'ReadData' >> beam.io.textio.ReadFromText(src_path)
        | 'ParseCSV' >> beam.ParDo(Sentiment())
        | 'WriteToBigQuery' >>
          beam.io.WriteToBigQuery
          ('[PROJECT-ID]:[DATASET-ID].[TARGET-TABLE]',
           write_disposition=beam.io.BigQueryDisposition.WRITE_APPEND)
     )
    p.run()

Send a request to Natural Language API and retrieve the Sentiment, Score, and Magnitude fields:

 client = language.LanguageServiceClient()
 document = types.Document(content=content,
                              type=enums.Document.Type.PLAIN_TEXT)
 sentiment = client.analyze_sentiment(document).document_sentiment

Customer Sentiment Analysis: measure customer sentiment based on any text input received from the customer. Source: Kaggle Datasets.

score of the sentiment ranges between -1.0 (negative) and 1.0 (positive) and
magnitute indicates the perceived strength of emotion expressed. Longer text responses may generate greater magnitudes.

Sentiment	Score	Magnitude
Clearly Positive*	0.8	> 3.0
Clearly Negative*	-0.6	> 4.0
Neutral	0.1	0.0
Mixed	0.0	4.0

The Score and Magnitude fields are returned by the Natural Language API. Sentiment is a calculated field that will vary by use case.

Further analysis can be done using the sentiment results. Below is a sample Data Studio dashboard that visualizes a BigQuery data source that includes sentiment data generated from the Natural Language API.

Additional Use Cases

The presentation layer of the data model that we built above will facilitate the following additional analysis use cases.

Online Promotion Influence on Offline Sales
Understand the influence of online promotions on offline sales and vice versa

Inputs

Channel (Online/Offline)
Total amount
Total number of transactions
Number of visits

Customer Segmentation

Segment customers based on behavior (and then use the segments for website personalization or targeted marketing).

Input

Customer Number
Annual_Revenue,
timeOnScreen,
UniqueScreenViews,
Number of Visits
Loyalty_Program,
Lifetime_Value,
Age

Target

Customer Number
Cluster ID

Determining Real Campaign Impact
Measuring the real impact of a campaign based on a prediction without the campaign if we are running campaign for a specific segment of Customer.

We select a segment of customers similar to the segment who were exposed to a campaign.

We select them from a period before the campaign period, try to avoid any seasonality period. We will use some features from the data in order to get a similar segment:

Gender
Age (Segmented)
Total Purchase amount (Segmented)
Number of Visits

Predictive Modeling

Taking advantage of other Google Cloud Platform services such as AutoML tables, we can generate additional data for analysis based on the original retail fields included in our model.

Customer Service Performance
Understand top/bottom performing customer service representatives – understand how customer service drives retention revenue:

Input
- customer representative ID
- channel (online chat, phone, email, other)
- cases resolved
- average time to resolution
- average time to first response
- satisfaction score
- sales/order
- revenue
Target
- customer representative ID
- predicted sales/orders
- predicted/revenue

Churn Prevention for Marketing
Identify users who are most likely to churn and use to target them with suitable messaging.

Input
- Transactions
  - Payment data
  - Credit card expiration date
- Product/website usage
  - Daily/Weekly/Monthly Average Users per account
  - Power Users
  - Login frequency
  - Number of features used
  - Number of high value / sticky features used
- Sentiment Result
  - Customer
  - Sentiment Analysis Result
  - Number of Transactions
  - Total Revenue
Target
- Customer ID
- Prediction date
- Probability (NUM)
- Prediction of churn (BOOL)

Customer Lifetime Value Prediction

Determine characteristics of users most likely to generate high lifetime value after initial conversion and target marketing to other users who demonstrate similar characteristics.

Input

Customer Number
Annual_Revenue
timeOnScreen
UniqueScreenViews
Number of Visits
Loyalty_Program
Age

Target

Customer Number
Predicted Lifetime value

Build Your Foundation

Data modeling is foundational work that will encourage a cleaner and more deliberate data capture that will facilitate better analysis. Using the capabilities of BigQuery and other components of the Google Cloud Platform, we’re able to not only build robust physical data architectures; we can enrich our data model with additional fields output from predictive modeling and target the most relevant users with the most relevant messages.

For any questions about data modeling, data architecture, and advanced analysis, please contact us today with no obligation.