Google policies mandate that no data be passed to them that could be recognized as personally identifiable. This post aims to provide an easy-to-follow, structured approach to identifying Personally Identifiable Information (PII) that might exist in your or your client’s Google Analytics account, as well as different methods for preventing further collection of such information. In this post I will outline what constitutes as PII, and how to avoid potentially passing this information to Google when implementing Analytics on a property.
The approaches outlined below aim to help alert you that PII is being captured. Ultimately however, Google requires that:
“You will not and will not assist or permit any third party to, pass information to Google that Google could use or recognize as personally identifiable information.”
This means that if you find PII in your data collection, simply filtering out the data from your Google Analytics property is only half the battle. Ultimately no PII should make it into Google Analytics at all.
What constitutes PII according to Google?
Any name, email address, billing information, social security numbers, or other data which can be reasonably linked to such information by Google, or data that permanently identifies a particular device (such as a mobile phone’s unique device identifier), even in hashed form.
“The Google Analytics terms of service, which all Google Analytics customers must adhere to, prohibits sending personally identifiable information (PII) to Google Analytics … Your Google Analytics account could be terminated and your data destroyed if you use any of this information.”
Possible trouble areas
So you suspect that you might be collecting PII, but are not sure of where to look or what to look for? Then this post is for you! Below are some of the major areas where users can run into trouble with PII within their Google Analytics Data. Oftentimes, the inclusion of PII in any of these different areas is unintentional, which is why performing a PII audit is so important.
Looking for PII during the setup and testing phase of your Google Analytics implementation is recommended as a best practice in order to avoid running into any PII collection issues further down the line.
Places where PII can be found
- Query string parameters located in URLs
- Data imports
- Event parameters (category, action, label)
- Custom dimensions
- Social event dimensions
- Campaign tags
Common PII types (as defined by Google)
- Email address
- First name / last name
- Billing Information
- Social security number
- Credit card number
- IP address
- Device ID
- Any other information that would identify a specific individual
Common Regular Expressions
So, now we know where and what to look for in our Google Analytics reporting interface. But before we dive into the various auditing methods, I wanted to take a moment to highlight one of the techniques we will use to assist us in our task. According to Jan Goyvaerts over at www.regular-expressions.info:
“A regular expression (regex or regexp for short) is a special text string for describing a search pattern. You can think of regular expressions as wildcards on steroids.”
Below you can view an assortment of regular expressions for matching some of the different types of PII. These expressions will allow you to search for some common PII types. There are probably many other variations of these regular expressions or even regular expression types that would fit in here and essentially do the same thing, but these are some of the more common ones:
*Caveat: not every type of PII can be searched for in this way due to the complexity of the text (e.g. a physical home address, or first/last name).
PII Type | RegEx |
---|---|
Email address | ([a-zA-Z0-9_\.-]+)@([\da-zA-Z\.-]+)\.([a-zA-Z\.]{2,6}) |
Social security number | ^\d{3}-?\d{2}-?\d{4}$ |
IP address | ^(?:[0-9]{1,3}\.){3}[0-9]{1,3}$ |
Auditing Methods
This is an overview of the two main methods you will be using to identify potential PII within the common trouble areas, and their limitations. Here you can use the regular expressions listed above, as well as your own personal sleuthing skills to look for PII. Since regular expressions won’t help you when it comes to things like physical address or first/last name combinations, you will need to manually scan the different reports for those types of PII.
Inline Filter
The inline filter method will be your first, and likely best approach for identifying PII in your data. It will allow you to quickly scan your standard reports for the presence of the most common types of PII. As previously mentioned, some of the most common places where PII lives include: query string and event parameters. The most common reports where this auditing technique can be used:
- Reporting > Behavior > Site Content
- Reporting > Behavior > Events
- Reporting > Behavior > Site Search
The process is simple, and consists of four easy steps:
- Click on the “Advanced” button next to the inline filter input box at the top of your chosen report
- From the filter type drop-down, select the “Matching RegExp” option
- In the input field, copy and paste your desired regular expression from the table above (or use a custom one designed by you)
- Click on ‘Apply’
Your chosen report will now be filtered to only show you data which includes PII according to which regular expression you have chosen. If you don’t see any records this is GREAT NEWS! It means that your data does not contain the type of PII you are searching for. If you do see results, then this means that your data contains PII and you will need to take some action to address the issue (more on this later).
Figure 1.0
Advanced Segment
The advanced segment method is similar to the inline filter method with the major difference being that the segment applies to all reports automatically once it is created. We will be using the Regular Expressions listed above to create a segment which will identify any sessions which contained different types of PII.
The example segment setup below (Figure 2.0) looks for sessions which contained pageviews containing PII in the URL, however this approach could also be applied to event parameters (event category, event action, event label), as well as custom dimensions, site search terms, or social events.
Using this approach also displays the number of users and the number of sessions (Figure 2.1) as a percentage of the total.
As with the inline filter approach, the most common reports where your newly created segment will identify PII are:
- Reporting > Behavior > Site Content
- Reporting > Behavior > Events
- Reporting > Behavior > Site Search
Figure 2.0
Figure 2.1
Conclusion
So now that you’ve gone through and checked for PII and haven’t found anything then congratulations, you can stop reading here!
If you have found some form of PII, don’t panic. You will just need to take the following steps:
- Work with your implementation team and stop the collection of PII (simply filtering out PII in the Google Analytics interface will not be sufficient, as Google requires that you stop sending any PII to their servers, even if it is being filtered out)
- Once PII collection has ceased, backup your data (Analytics 360 customers can export unsampled reports to an Excel spreadsheet, or Google Sheets. They can also migrate their data into Google BigQuery, a service which does not have PII limitations)
- Create copies of the views in which you found PII (copy over all configuration settings: filters, goals, view settings, etc), and start collecting PII free, fresh data.
- Work with Google Support and inform them that your web property has been collecting PII.
- It is better to be proactive here, as Google Support is much more likely to remove only offending data if they are informed ahead of time.
- Should the Google Support team discover PII in your account on their own volition, they are much more likely to purge the entire account of all data.