The ultimate objectives of data capture are, of course, insight and action. For many organizations, however, the gulf remains wide between data capture and well informed data-driven decisions.
As a discipline, data modeling is intended to foster a conscious, outcome-oriented approach to data capture and enable meaningful reporting, visualization, and analysis. More specifically, data modeling can help developers and data professionals to:
Data modeling can apply to a wide range of domains. In this blog post, we concentrate on modeling Google Analytics e-commerce data integrated with other back-end retail data.
We approach the retail data model in four phases:
Along the way, we take advantage of Google Cloud Platform’s data modeling services to generate additional dimensions against the data already included in the model.
The retail data model that we define in this post will enable a unified presentation layer in BigQuery. We can aggregate and transform the external data sources using a range of Google Cloud Platform components, including:
The specific architecture is guided by the pipelines that the data model will reference.
The resulting data model, and the physical data warehouse or data marts that it can help to shape, will offer the following benefits:
We’ll start the data model in a normalized form but then move to denormalization:
So normalized form favors storage efficiency and relations between entities, while denormalized form represents read efficiency. The first layer of our data model will be normalized; the second layer will be denormalized but will also take advantage of BigQuery’s nested and repeated field structure to preserve some of the benefits of normalization.
Normalized Model Structure
We design the first layer of the model in a Normalized form in a staging area of BigQuery to organize and cleanse the data. It will encompass the following main areas of data:
The physical tables for this layer of our data model will reside in BigQuery and will be accessed only by the developers rather than used directly for presentation. While BigQuery is not considered an RDBMS, we’ll be able to take advantage of some relational concepts such as logical joins and primary key enforcement to map the relations between tables and ensure a great degree of data integrity and data quality in this first layer of the retail data model.
(view full diagram – click the link and then click within the main panel to display the diagram)
Note that this model represents a subset of a comprehensive retail data model, which could also encompass inventory, finance, and other aspects of data capture for a retail organization.
As shown in the above diagram, which is the initial design for Retail industry relational data model in normalized form.
BigQuery is designed to support denormalized data more than normalized data. The increase in query speed that denormalized data model offers outweighs the storage savings of normalized data, which is generally not as high as a priority in modern systems.
Denormalization for the Presentation Layer:
Based on the best-practice considerations outlined above, we’ll design the presentation layer to be denormalized. Below is a denormalized BigQuery schema corresponding to the normalized schema above.
Whether you’re creating a normalized layer that precedes the denormalized presentation layer or you’re using the normalized data layer directly for presentation, you can enhance performance by breaking out frequently updated fields into their own entity within the normalized schema.
Denormalization for the Presentation Layer:
Based on the best-practice considerations outlined above, we’ll design the presentation layer to be denormalized. Below is a denormalized BigQuery schema corresponding to the normalized schema above.
Product:
Customer and Contract:
Service Request
Sales and Inventory
Note that within the tables above, the e-commerce data originates from Google Analytics and is stored in nested, repeated form.
One of the types of analysis that our data model will enable is the analysis of customer sentiment based on survey inputs.
The end-to-end workflow mapped below will include the transfer of data to Google Cloud Storage, the population of the normalized and denormalized layers of the data model in BigQuery, and the generation of additional fields through Google’s Natural Language API.
The source data will be either files or external databases. In the case of an external database, you can export tables compressed to tar.gz files in preparation for Google Cloud Storage.
There are several ways to transfer data to a Google Cloud Storage bucket, including the two options below:
$ gsutil cp gs:///
The Cloud Storage data lake for the retail data will consist of two buckets:
You can set up a Cloud Function to fire based on a Cloud Storage trigger for when a file is uploaded for processing. The Cloud Function itself can then trigger an Airflow workflow in Cloud Composer.
You can configure a Cloud Composer workflow to perform the following steps in parallel, as shown in the diagram below:
Cloud Composer Workflow
The Cloud Composer workflow includes the following specific steps:
The Dataflow job includes the following Python code for Apache Beam:
Create Pipeline:
argv = [ '--project={0}'.format(PROJECT), '--staging_location=gs://{0}/staging/'.format(BUCKET), '--temp_location=gs://{0}/staging/'.format(BUCKET), '--runner=DataflowRunner', '--job_name=dataextraction{0}'.format(YESTERDAY), '--save_main_session', '--requirements_file=/home/airflow/gcs/dags/requirements.txt', '--num_workers=1', '--max_num_workers=1' ] p = beam.Pipeline(argv=argv) (p | 'ReadData' >> beam.io.textio.ReadFromText(src_path) | 'ParseCSV' >> beam.ParDo(Sentiment()) | 'WriteToBigQuery' >> beam.io.WriteToBigQuery ('[PROJECT-ID]:[DATASET-ID].[TARGET-TABLE]', write_disposition=beam.io.BigQueryDisposition.WRITE_APPEND) ) p.run()
Send a request to Natural Language API and retrieve the Sentiment, Score, and Magnitude fields:
client = language.LanguageServiceClient() document = types.Document(content=content, type=enums.Document.Type.PLAIN_TEXT) sentiment = client.analyze_sentiment(document).document_sentiment
Customer Sentiment Analysis: measure customer sentiment based on any text input received from the customer. Source: Kaggle Datasets.
Sentiment | Score | Magnitude |
---|---|---|
Clearly Positive* | 0.8 | > 3.0 |
Clearly Negative* | -0.6 | > 4.0 |
Neutral | 0.1 | 0.0 |
Mixed | 0.0 | 4.0 |
The Score and Magnitude fields are returned by the Natural Language API. Sentiment is a calculated field that will vary by use case.
Further analysis can be done using the sentiment results. Below is a sample Data Studio dashboard that visualizes a BigQuery data source that includes sentiment data generated from the Natural Language API.
The presentation layer of the data model that we built above will facilitate the following additional analysis use cases.
Online Promotion Influence on Offline Sales
Understand the influence of online promotions on offline sales and vice versa
Inputs
Customer Segmentation
Segment customers based on behavior (and then use the segments for website personalization or targeted marketing).
Input
Target
Determining Real Campaign Impact
Measuring the real impact of a campaign based on a prediction without the campaign if we are running campaign for a specific segment of Customer.
We select a segment of customers similar to the segment who were exposed to a campaign.
We select them from a period before the campaign period, try to avoid any seasonality period. We will use some features from the data in order to get a similar segment:
Taking advantage of other Google Cloud Platform services such as AutoML tables, we can generate additional data for analysis based on the original retail fields included in our model.
Customer Service Performance
Understand top/bottom performing customer service representatives – understand how customer service drives retention revenue:
Churn Prevention for Marketing
Identify users who are most likely to churn and use to target them with suitable messaging.
Customer Lifetime Value Prediction
Determine characteristics of users most likely to generate high lifetime value after initial conversion and target marketing to other users who demonstrate similar characteristics.
Input
Target
Data modeling is foundational work that will encourage a cleaner and more deliberate data capture that will facilitate better analysis. Using the capabilities of BigQuery and other components of the Google Cloud Platform, we’re able to not only build robust physical data architectures; we can enrich our data model with additional fields output from predictive modeling and target the most relevant users with the most relevant messages.
As consumers become increasingly digitally savvy, and more and more brand touchpoints take place online,…
Marketers are on a constant journey to optimize the efficiency of paid search advertising. In…
Unassigned traffic in Google Analytics 4 (GA4) can be frustrating for data analysts to deal…
This website uses cookies.