Aggregate data
What Is Aggregate Data?
Aggregate data is data related to a collective or category of consumers in which data has been grouped and summed or averaged across multiple consumers.
For example, in a survey comparing people’s preferences for different political candidates, aggregate data of the results would present the overall popularity of each candidate without revealing individual voting details. Additional filters may be applied to obtain more specific information, such as regional preferences or voting patterns based on gender.
Notably, aggregate data needs to be broad enough for it to not inadvertently lead to the identification of a particular person.
Third-party definition
Aggregate data refers to data collected from a group of individuals that do not contain any personally identifiable information. An example might be the number of website visitors in a day — this metric cannot be used to identify a single individual. – Osano
Difference Between Aggregate, De-Identified, and Anonymized Data
Look into aggregate data, and you’ll come across related terms like “de-identified data” and “anonymized data.” Here’s what they mean.
Aggregate data
Aggregate data combines and summarizes information from multiple individual data points to provide a broader overview or statistical summary.
For example, instead of looking at the ages of individual people, with aggregate data, you might see the average age of a group.
De-identified data
De-identified data is information that has been stripped of personally identifiable details but might still retain some characteristics that could be used to re-identify individuals in certain situations.
For example, removing names and addresses from a dataset but leaving other information like age, gender, and ZIP code would create de-identified data.
Anonymized data
Anonymized data goes a step further than de-identified data. It is modified to ensure that there is no way to link the information back to individual people, even if other external data sources are considered.
For example, changing or scrambling the values in a dataset so that, even with additional information, it’s impossible to identify specific individuals creates anonymized data.
How Aggregate Data Is Used
Aggregate data is used in various fields and industries to analyze trends, make informed decisions, and derive general insights without focusing on individual details.
For instance, companies aggregate personal information about customers for internal use and analysis, or they sell it for profit. When aggregating data, companies typically say they’ve anonymized it so it’s no longer “personal information.”
Aggregate Data and State Laws
Several US state data privacy laws exclude aggregate data from their scope.
Under the California Consumer Privacy Act and the California Privacy Rights Act, businesses can collect, retain, use, share, sell, or disclose consumers’ personal information as long as it’s de-identified or aggregate data. If it is possible to link data to a device, it is not considered aggregate data in California.
Utah (under the Utah Consumer Privacy Act) and Iowa (under the Iowa Act Relating to Consumer Data Protection) specifically exclude aggregated or de-identified data from their definition of personal data.
While other data privacy state laws like the VCDPA do not expressly exempt aggregate data from their scope, they do so implicitly. Information that is aggregated is unlikely to be considered “linked or reasonably linkable to an identified or identifiable natural person” under the VCDPA, and as such, it’s unlikely that it would be defined as “personal data” in the state of Virginia.
Is Aggregate Data Actually Anonymous?
Not necessarily. While many see data aggregation as a safeguard for personal information since it presents information in collective categories, the reality is somewhat more nuanced.
When analyzing aggregate data, it is possible to uncover personal details about individuals.
For instance, imagine using aggregated data for a health and fitness app to understand users’ most popular exercise routines. Now, envision querying for the preferred workout routine of users in a specific age group, during a particular week, and who achieved notable fitness milestones.
By applying specific filters and queries, unintentional identification of individual customers becomes possible. The more granular and specific the requests made on aggregate data, the higher the risk of reconstructing detailed individual-level information.
For a real-world example, look at how researchers at NC State University were able to find a loophole in Strava’s (an app for tracking physical exercise) heatmap feature that uses aggregate user data.
The researchers said, “In a densely populated area, with lots of routes and lots of users, there is so much data that it would be extremely difficult to track any specific person. However, in areas where there are few users and/or few routes, it becomes a simple process of elimination – particularly if the person someone is looking for is a highly active Strava user.”
To preserve consumer privacy, it should not be possible to issue an unlimited number of queries on aggregate data. Instead, as the International Association of Privacy Professionals notes, there needs to be a “privacy budget” that sets a limit on the number of queries that can be made.
Previous research has also shown that it’s not impossible to re-identify aggregate/de-identified/anonymized data.
One study demonstrated how, with just 15 characteristics (like gender, age, and marital status), it is possible to re-identify Americans in anonymized data sets 99.8% of the time. Even before that, in 2000, Dr. Latanya Sweeney proved that it was likely possible to re-identify 87% of the population with just their date of birth, assigned gender, and zip code.