The ultimate objectives of data capture are, of course, insight and action. For many organizations, however, the gulf remains wide between data capture and well informed data-driven decisions.
As a discipline, data modeling is intended to foster a conscious, outcome-oriented approach to data capture and enable meaningful reporting, visualization, and analysis. More specifically, data modeling can help developers and data professionals to:
- fully inventory and understand the available and needed data in their target domains
- anticipate the types of analysis to be performed on the data
- facilitate presentation and reporting
Data modeling can apply to a wide range of domains. In this blog post, we concentrate on modeling Google Analytics e-commerce data integrated with other back-end retail data.
We approach the retail data model in four phases:
- Integrating online and offline data sources, we map out a normalized schema in BigQuery.
- We populate the normalized schema for staging in BigQuery
- We replicate the normalized data in BigQuery in a denormalized schema, also within BigQuery but optimized for fast reads as a presentation layer.
- We then outline several types of analysis that the data model will facilitate:
- Customer Sentiment Analysis
- Customer Service Performance
- Online Promotion Influence on Offline Sales (and Vice Versa)
- Churn Pervention for Marketing
- Determining Real Campaign Impact
- Customer Segmentation
- Customer Lifetime Value Prediction
Along the way, we take advantage of Google Cloud Platform’s data modeling services to generate additional dimensions against the data already included in the model.
Before we dive in, let’s review some essential data management concepts:
Data Lake: fairly unstructured, comprehensive data repository. Your data lake might contain files in Cloud Storage or Google Drive or transactional data in Cloud Bigtable.
Data Warehouse: structured, transformed data repository, designed with defined uses cases in mind.
Data Mart: subset of the data warehouse typically oriented towards a specific team, business function, or KPI.
Data Model: model that organizes elements of data and their interrelationships.
A data model can help to clarify requirements for a Data Warehouse or a Data Mart.
Unified, Efficient Presentation Layer
The retail data model that we define in this post will enable a unified presentation layer in BigQuery. We can aggregate and transform the external data sources using a range of Google Cloud Platform components, including:
- Cloud Functions to transport data to Google Cloud Storage for staging
- Cloud Dataflow to move data from Google Cloud Storage to BigQuery
- Cloud Composer to orchestrate data pipelines jobs
The specific architecture is guided by the pipelines that the data model will reference.
The resulting data model, and the physical data warehouse or data marts that it can help to shape, will offer the following benefits:
- unification: the data model will consist of online, in-store, and back-end data.
- efficiency: since the data model is itself built around different reporting requirements, consumers of the corresponding data marts will be able to efficiently query and present only the data that is needed for reporting and analysis.
Normalized-to-Denormalized Structure
We’ll start the data model in a normalized form but then move to denormalization:
- Normalized form represents a systematic approach of decomposing tables to eliminate data redundancy (repetition) and anomalies due to record insertion, update, and deletion. To minimize redundancy, normalized form relies on strong relationships between tables.
- Denormalized form represents a strategy applied to previously normalized database tables in order to increase performance. With some redundancy and decreased write performance, read performance is improved.
So normalized form favors storage efficiency and relations between entities, while denormalized form represents read efficiency. The first layer of our data model will be normalized; the second layer will be denormalized but will also take advantage of BigQuery’s nested and repeated field structure to preserve some of the benefits of normalization.
Normalized Model Structure
We design the first layer of the model in a Normalized form in a staging area of BigQuery to organize and cleanse the data. It will encompass the following main areas of data:
- Customer
- Products
- Order Transactions
- Customer Service
- Loyalty Program
- Click Stream
- Promotions
- Staff
The physical tables for this layer of our data model will reside in BigQuery and will be accessed only by the developers rather than used directly for presentation. While BigQuery is not considered an RDBMS, we’ll be able to take advantage of some relational concepts such as logical joins and primary key enforcement to map the relations between tables and ensure a great degree of data integrity and data quality in this first layer of the retail data model.
(view full diagram – click the link and then click within the main panel to display the diagram)
Note that this model represents a subset of a comprehensive retail data model, which could also encompass inventory, finance, and other aspects of data capture for a retail organization.
As shown in the above diagram, which is the initial design for Retail industry relational data model in normalized form.
BigQuery is designed to support denormalized data more than normalized data. The increase in query speed that denormalized data model offers outweighs the storage savings of normalized data, which is generally not as high as a priority in modern systems.
Denormalization for the Presentation Layer:
Based on the best-practice considerations outlined above, we’ll design the presentation layer to be denormalized. Below is a denormalized BigQuery schema corresponding to the normalized schema above.
There are two main considerations for maintaining and reporting in normalized form without replicating in denormalized form as you design your schema for BigQuery:
- denormalize if large: if a dimension table is larger than 10 gigabytes, it is a good indicator that you should denormalize the dimension table.
- keep normalized if you have frequent UPDATEs and DELETEs: if these operations are frequent, keep the dimension tables normalized and the dimension table smaller than 10 gigabytes.
Breaking out Reference Fields in Normalized Form
Whether you’re creating a normalized layer that precedes the denormalized presentation layer or you’re using the normalized data layer directly for presentation, you can enhance performance by breaking out frequently updated fields into their own entity within the normalized schema.
Denormalization for the Presentation Layer:
Based on the best-practice considerations outlined above, we’ll design the presentation layer to be denormalized. Below is a denormalized BigQuery schema corresponding to the normalized schema above.
Product:
- Product
- Product price
- Product discount
- Product category
- Product category discount
Customer and Contract:
- Customer
- Customer Reward Point Log
- Membership type
Service Request
- Service Request
- Agent
- Service Request Feedback
Sales and Inventory
- Order
- Order Feedback (other data source because 150-character limitation for CD)
- Order Items
- E-commerce (Click Stream)
- Session
- Hits
- Checkout Step
- Promotions
- Sales Performance (online or offline channel)
- Marketing Campaign
- Customer Journey (first activity date, last activity date – its own data mart)
Note that within the tables above, the e-commerce data originates from Google Analytics and is stored in nested, repeated form.
Workflow: Customer Sentiment Analysis
One of the types of analysis that our data model will enable is the analysis of customer sentiment based on survey inputs.
The end-to-end workflow mapped below will include the transfer of data to Google Cloud Storage, the population of the normalized and denormalized layers of the data model in BigQuery, and the generation of additional fields through Google’s Natural Language API.
Step 1: Preparing Files
The source data will be either files or external databases. In the case of an external database, you can export tables compressed to tar.gz files in preparation for Google Cloud Storage.
Step 2: Transfering Files to Google Cloud Storage
There are several ways to transfer data to a Google Cloud Storage bucket, including the two options below:
- ETL Job using any data integration tool, such as Talend
- Batch / Shell scripts using gsutil command in Google Cloud SDK
$ gsutil cp gs:///
Step 3: Organizing Transferred Files
The Cloud Storage data lake for the retail data will consist of two buckets:
- Staging: A region standard bucket where we receive the tar.gz files and decompress them
- Archive: depending on the expected frequency of disaster recovery or the need to query for new KPIs, you can configure the archive as either nearline or coldline: [region & (nearline or coldline)]
- Nearline bucket: choose if expected access may be once per month
- Coldline bucket: choose if expected access may be once per year
Step 4: Reading Data from GCS Bucket
You can set up a Cloud Function to fire based on a Cloud Storage trigger for when a file is uploaded for processing. The Cloud Function itself can then trigger an Airflow workflow in Cloud Composer.
You can configure a Cloud Composer workflow to perform the following steps in parallel, as shown in the diagram below:
- copying data from Google Cloud Storage buckets to normalized tables in BigQuery
- generate additional fields to capture sentiment analysis
Cloud Composer Workflow
The Cloud Composer workflow includes the following specific steps:
- Extracts the input file that triggered the workflow.
- The Airflow microservice within Cloud Composer will manage the following steps:
- Decompress the input
- Execute a Cloud Dataflow job that process the file that contains customer review to be sent to Natural Language API.
- Send the result to BigQuery as the normalized layer of the data model
- Process the other files to be sent to BigQuery, also within the normalized layer
The Dataflow job includes the following Python code for Apache Beam:
Create Pipeline:
argv = [ '--project={0}'.format(PROJECT), '--staging_location=gs://{0}/staging/'.format(BUCKET), '--temp_location=gs://{0}/staging/'.format(BUCKET), '--runner=DataflowRunner', '--job_name=dataextraction{0}'.format(YESTERDAY), '--save_main_session', '--requirements_file=/home/airflow/gcs/dags/requirements.txt', '--num_workers=1', '--max_num_workers=1' ] p = beam.Pipeline(argv=argv) (p | 'ReadData' >> beam.io.textio.ReadFromText(src_path) | 'ParseCSV' >> beam.ParDo(Sentiment()) | 'WriteToBigQuery' >> beam.io.WriteToBigQuery ('[PROJECT-ID]:[DATASET-ID].[TARGET-TABLE]', write_disposition=beam.io.BigQueryDisposition.WRITE_APPEND) ) p.run()
Send a request to Natural Language API and retrieve the Sentiment, Score, and Magnitude fields:
client = language.LanguageServiceClient() document = types.Document(content=content, type=enums.Document.Type.PLAIN_TEXT) sentiment = client.analyze_sentiment(document).document_sentiment
Customer Sentiment Analysis: measure customer sentiment based on any text input received from the customer. Source: Kaggle Datasets.
- score of the sentiment ranges between -1.0 (negative) and 1.0 (positive) and
- magnitute indicates the perceived strength of emotion expressed. Longer text responses may generate greater magnitudes.
Sentiment | Score | Magnitude |
---|---|---|
Clearly Positive* | 0.8 | > 3.0 |
Clearly Negative* | -0.6 | > 4.0 |
Neutral | 0.1 | 0.0 |
Mixed | 0.0 | 4.0 |
The Score and Magnitude fields are returned by the Natural Language API. Sentiment is a calculated field that will vary by use case.
Further analysis can be done using the sentiment results. Below is a sample Data Studio dashboard that visualizes a BigQuery data source that includes sentiment data generated from the Natural Language API.
Additional Use Cases
The presentation layer of the data model that we built above will facilitate the following additional analysis use cases.
Online Promotion Influence on Offline Sales
Understand the influence of online promotions on offline sales and vice versa
Inputs
- Channel (Online/Offline)
- Total amount
- Total number of transactions
- Number of visits
Customer Segmentation
Segment customers based on behavior (and then use the segments for website personalization or targeted marketing).
Input
- Customer Number
- Annual_Revenue,
- timeOnScreen,
- UniqueScreenViews,
- Number of Visits
- Loyalty_Program,
- Lifetime_Value,
- Age
Target
- Customer Number
- Cluster ID
Determining Real Campaign Impact
Measuring the real impact of a campaign based on a prediction without the campaign if we are running campaign for a specific segment of Customer.
We select a segment of customers similar to the segment who were exposed to a campaign.
We select them from a period before the campaign period, try to avoid any seasonality period. We will use some features from the data in order to get a similar segment:
- Gender
- Age (Segmented)
- Total Purchase amount (Segmented)
- Number of Visits
Predictive Modeling
Taking advantage of other Google Cloud Platform services such as AutoML tables, we can generate additional data for analysis based on the original retail fields included in our model.
Customer Service Performance
Understand top/bottom performing customer service representatives – understand how customer service drives retention revenue:
- Input
- customer representative ID
- channel (online chat, phone, email, other)
- cases resolved
- average time to resolution
- average time to first response
- satisfaction score
- sales/order
- revenue
- Target
- customer representative ID
- predicted sales/orders
- predicted/revenue
Churn Prevention for Marketing
Identify users who are most likely to churn and use to target them with suitable messaging.
- Input
- Transactions
- Payment data
- Credit card expiration date
- Product/website usage
- Daily/Weekly/Monthly Average Users per account
- Power Users
- Login frequency
- Number of features used
- Number of high value / sticky features used
- Sentiment Result
- Customer
- Sentiment Analysis Result
- Number of Transactions
- Total Revenue
- Transactions
- Target
- Customer ID
- Prediction date
- Probability (NUM)
- Prediction of churn (BOOL)
Customer Lifetime Value Prediction
Determine characteristics of users most likely to generate high lifetime value after initial conversion and target marketing to other users who demonstrate similar characteristics.
Input
- Customer Number
- Annual_Revenue
- timeOnScreen
- UniqueScreenViews
- Number of Visits
- Loyalty_Program
- Age
Target
- Customer Number
- Predicted Lifetime value
Build Your Foundation
Data modeling is foundational work that will encourage a cleaner and more deliberate data capture that will facilitate better analysis. Using the capabilities of BigQuery and other components of the Google Cloud Platform, we’re able to not only build robust physical data architectures; we can enrich our data model with additional fields output from predictive modeling and target the most relevant users with the most relevant messages.