Organizer: David Choffnes (Northeastern University)
Date: Wednesday, October 9, 2024
On October 9, 2024, David Choffnes (Northeastern University) organized a roundtable with leading researchers on commercial surveillance. The one-hour event started with brief introductions, followed by an engaging discussion about topics ranging from web tracking, algorithmic targeting, AI (particularly generative AI), fingerprinting, YouTube recommendations, custom/lookalike audiences, data brokers, and pricing. We also discussed harms for vulnerable populations such as children, teens, individuals experiencing life events (such as birth, marriage, divorce, death), those struggling with addiction, protected groups, and various disparate impacts.
The researchers included:
- David Choffnes, Associate Professor, Computer Science, and Executive Director of the Cybersecurity and Privacy Institute at Northeastern University
- Umar Iqbal, Assistant Professor, Computer Science and Engineering, Washington University in St. Louis
- Sebastian Zimmeck, Assistant Professor of Computer Science, Wesleyan University
- Piotr Sapiezynski, Research Scientist, Northeastern University
- Athina Markopoulou, Professor in EECS, Associate Dean for Graduation & Professional Studies in the School of Engineering, University of California, Irvine
- Christo Wilson, Professor, Computer Science, Northeastern University
- Alan Mislove, Professor, Computer Science, Northeastern University
- Rishab Nithyanand, Assistant Professor, Computer Science, University of Iowa
In more detail, the following topics were discussed:
Online tracking
We discussed recent trends in online tracking; i.e., how companies track and link our online activities together. Traditionally data has been collected via HTTP cookies, tracking pixels, browser fingerprinting, etc., and that is still very much the case for many websites and mobile apps. Some platforms (that are not party to data transactions), like OSes and browsers can help mitigate the root cause of the harm. There have been recent changes in browser and mobile-app platforms that have generated much discussion on privacy, as well as some changes in the commercial surveillance industry. These include generally positive developments such as Safari’s third-party cookie policies, and others where the outcome is less clear (e.g., Google’s Privacy Sandbox in Chrome and Android).
For example, there has been an effort to deprecate third party (3P) cookies in Chrome and other browsers, though the transition away from 3P cookies continues to be delayed, perhaps indefinitely. In parallel with this, many online tracking vendors have moved their cookies into the first-party context to avoid any disruption if 3P cookies are banned (e.g., https://arxiv.org/pdf/2208.12370). Ironically, this negates many privacy benefits from banning 3P cookies, and potentially puts consumers at even greater privacy risk due to the information available to tracking scripts placed in the first party context. In addition, the Privacy Sandbox has been proposed by Google as an alternative model for supporting commercial surveillance without the same privacy risks as 3P cookies. However, researchers have not been convinced that this approach is sufficiently privacy preserving.
Some platforms, e.g., mobile OSes, have placed restrictions on the kinds of identifiers that can be used by advertisers to target individuals. Instead of using their own randomly generated unique identifiers, advertisers are increasingly turning to identity graphs, which link information such as phone numbers, email addresses, or other PII (and hashes of those PII) to the same individual.
As platforms give consumers more control over their privacy and protections against tracking, new tracking technologies have emerged that are harder to evade. For instance, trackers can use device “fingerprints” (e.g., characteristics of a browser, mobile device, household device) to link online activities to individuals or families. Such fingerprinting is particularly problematic because it is difficult to avoid, and there is no ability to “reset” your fingerprint in the same way we can delete cookies.
There is a clear need to give consumers a better understanding of online tracking behavior as they use the Internet. To that end, a privacy web extension called Privacy Pioneer (paper; extension code) identifies how data is being shared so people can see it.
Data use
The discussion entailed substantial detail on how companies use data about consumers for various purposes, including targeted advertising, content suggestions, and generative AI.
How consumers are targeted
Participants identified the various ways that consumers are targeted based on their data. Generally, attributes about individuals are gathered or inferred, then those individuals are grouped together into “audiences” (also called “segments”) that contain consumers with similar attributes/interests. Example types of audiences include people of certain ages, likely voters, gamblers, and the like. They also include “life events” (e.g., birth, death, marriage, divorce, etc), and numerous other categories. One participant identified hundreds of pages of attributes and audiences attributed to the individual, found in disclosures from Oracle’s data broker business.
In addition to audiences provided by platforms like Facebook, Google, and others, “custom audiences” (supported by nearly every platform) allow advertisers to upload consumer PII (usually hashed), then match this information to individuals on the platform. Those matches become the custom audience, and allow arbitrary targeting by advertisers. Further, Facebook supports “lookalike audiences”, where an advertiser can then extend their reach by targeting ads towards individuals who are not in their custom audience (i.e., individuals for whom they have no PII) but who have similar attributes as those in the custom audience. Combined, these features allow advertisers to target protected groups (potentially in violation of regulations). Such approaches have been found to be used for ads for alcohol, casinos, and other vices.
These audiences are fed by attributes about individuals, some of which are inferred. Attributes such as age, gender, and the like can be inferred based on online behavior. Interestingly, there is a segment called “Unknown” that is highly correlated with teenagers, potentially opening up a way to target vulnerable youth populations with ads simply based on their attributes being unknown.
Advertisers generally aren’t able to target individuals, but rather audiences/segments. However, this targeting is further complicated by algorithms that platforms use to deliver ads. For example, Facebook runs auctions to figure out who in the advertiser’s target audience gets the ad. This used to be entirely budget based. Today, however, Facebook will decide who is more likely to engage and so whether an ad is displayed depends on predicted engagement and not just the auction. This can lead to dramatic skews, as demonstrated in an IMC paper. In short, the platform wants advertisers to do less explicit targeting because Facebook prefers to do the targeting instead. And while this can lead to engagement (i.e., clicks on ads), it is not necessarily optimizing for revenue for the company doing the advertising (i.e., the segments that result are not necessarily better than those without the algorithm). In related work, there is evidence of similar behavior on the YouTube platform.
Use of data for generative AI
Another concern was raised about some companies’ use of personal data in OpenAI and other generative AI platforms. These companies are gathering massive volumes of publicly available and private data about individuals, using it for training without express consent, and essentially “asking for forgiveness later.” On OpenAI, it was observed that custom GPTs and their privacy risks aren’t reviewed by the platform, and this can lead to substantial harms as mentioned in a recent publication: https://www.arxiv.org/pdf/2408.13247
Surveillance pricing
The topic of using data about individuals to inform individualized pricing was discussed. While disparate impacts can be measured, and information about individuals that might lead to such impacts can be accessed via data subject requests, there are key challenges that need to be addressed to identify the use of data for pricing. Historically, researchers identified numerous ways that consumers can be squeezed for more revenue, including product and price steering (e.g., toward more expensive products), and giving different prices to different consumers for the same product. While empirical methods can reveal trends for pricing, it remains an open challenge to use such measurements to identify individualized harms.
Harms and other concerns
The group discussed numerous harms and other concerns from commercial surveillance and use of the corresponding data. These include lack of transparency in how data is collected and used, targeting vulnerable groups like children and teenagers (as reported in a IMC 2024 paper), elderly, those with addictions, and protected groups. These groups are often targeted in harmful ways (e.g., via scams or ads for gambling).
When vulnerable groups are targeted in inappropriate or illegal ways, it can be challenging to know why (e.g., could it be mistargeting or incorrect inferences?). However, the responsibility for these harms is still on the platforms that support the targeting/inferences. It is hard to disambiguate whether it’s the algorithm or the advertiser when groups are targeted, though an area for future exploration is how to measure this, e.g., by becoming an advertiser.
Several other questions for exploration were raised, including: What are experiences for users? Do people who fall for scams automatically get targeted for more scams?
Summary
The roundtable event was a great opportunity for researchers to convey their policy-relevant findings, and to learn what topics are interesting for future research. It is clear that commercial surveillance continues to evolve with technology and platform changes, and there is a need for continuous evaluation of the ways that data about consumers is collected, used, and what harms result. We identified many areas of shared interest for future exploration and recommended future focused roundtables to encourage exchanges of ideas, interests, and opportunities to make an impact.
We would like to thank Stephanie Nguyen and Amritha Jayanti for their work at the Office of Technology