Bringing transparency to personalized services through statistical inference
Sujet proposé par
Directeur de thèse:
Unité de recherche
Laboratoire de recherche d'EURECOM
Domaine: Sciences et technologies de l'information et de la communication
Personalized services are online services that use information about their users to offer to
each user a service that is more adapted to her. With the proliferation of personal data over
the Internet, personalized services have become omnipresent in our daily life, including for
instance all services offering recommendations. Although this data-based personalization
has increased the utility of services for users and for service providers, it has also raised
privacy concerns that became increasingly serious in recent years. One example of
personalized service for which this issue is particularly stringent is targeted advertising.
Advertisement is the main source of revenue for many free web services such as Facebook
and Google. The ad ecosystem is complex and can be composed of many actors; here we
abstract away this complexity and we refer to the whole chain of organizations that are
responsible for sending an ad (e.g., companies that want to advertise, data brokers,
advertising platforms) as the ad engine. The prominent advertisement model today is payper-
click, which has led to an increasing amount of targeted advertising to increase the
likelihood that a user clicks on an ad. Targeted advertising has increased advertisement
revenues significantly. However, targeted advertising has been also raising more and more
concerns from users who often feel that it constitutes an invasion of their private sphere. In
particular, users often wonder “what data do advertisers have about me?” or “why am I being
shown this ad?”. In a nutshell, users’ concerns are mainly kindled by the lack of transparency
of current targeted advertising systems.
The main objective of this thesis is to increase the transparency of targeted
advertising by providing users with tools and methods to understand why they are
targeted with a particular ad, to infer what information the ad engines possibly have
about them, and ultimately to control it. Concretely, we propose to build a browser plugin
that collects the ads shown to a user and provides her with analytics about these ads and
tools to control them. The browser plugin can either give information for a particular ad such
as “you are being shown this ad because the ad engine likely thinks that you are a student”
or give analytics on a longer term such as “given the ads you have been shown in the last 3
months, the ad engine likely thinks that you earn less than $50k per year”.
One of the main challenges to build such a tool is to infer the information that the ad engine
knows about a user from the ads received. To explain our approach we abstract the system
into three components: the information the ad engine collects about a user either online from
tracking, or offline from data brokers (inputs), the ad engine that processes the inputs to put
users in certain marketing categories (the black box), and the ads sent to the user (outputs).
In this thesis, we propose to observe only the outputs and to infer the categories the user
was put in by the ad engine, regardless of whether this was due to a particular input or not. In
order to do that, we will simply collect the ads users receive, then group together all the
users that received the same ad, and look at the most common demographics and interests
of users in the group. We detail in Section B.1.b. the methods that we propose to develop to
do this statistical inference task. The main novelty of our technique is that it relies only on the
output, i.e., the ads observed by users and not on any input data the users may have
Athanasios Andreou (EURECOM), directeur: Patrick Loiseau (EURECOM)
explicitly given. This makes our approach much more realistic. Then, we propose ways to
control the information services have about a user by noise addition rather than by trying to
directly block leakage of information, which is also a much more realistic process.
2. History and related work
Previous works made a number of contributions either by discovering problems , or by
proposing methods to bring more transparency to the ad ecosystem [1, 3, 4, 2]. We focus on
the studies that are the closest to our proposal and refer the reader to  for an overview.
Two studies [1, 4] proposed techniques to detect whether an ad is contextual, re-targeted or
behavioral. While this is an important first step for transparency, the studies did not take the
next step to detect why the ads are being targeted. Towards this direction, two studies
proposed techniques to see how the activities of a user influence the ads she receives [3, 2].
At a high level, these approaches monitor the input of users (e.g., the emails users receive
and send, the videos users see on youtube, the sites users visit) and they propose methods
to estimate the likelihood that a given ad was shown due to a given input. Thus, these
studies look at the inputs and outputs of the ad engine and infer which inputs triggered which
outputs. On the contrary, our goal is to look at the outputs of the ad engine and infer what the
ad engine knows about the user regardless of whether this was due to a particular input. This
has numerous advantages: it requires less invasive monitoring, it has a much lower
overhead and it captures ads that are not triggered by any particular input.
B. Contenu Scientifique
1. Approach, detailed content and expected results
a. System architecture: the three main components
Browser plugin: The browser plugin has two functionalities: collect ads and present ads
analytics to users. First, the browser plugin parses the web pages a user is browsing and
collects all the ads the user receives, and sends the ads to the storage server. We plan to
build this functionality based on an existing open-source low overhead ad blocker plugin
) that has already been largely tested. Second, the browser
plugin provides analytics to the user about the ads she receives, for a particular ad or over a
longer period. Finally, the plugin will include a webpage where users can optionally provide
personal information such as demographics, and include popup functionalities for active
labeling (see Method 1 below).
Data storage server: All the ads parsed by the browser plugin are sent and stored in an
SQL database. The server will be placed behind a firewall to secure the data from unwanted
access. The users will be tracked on our server by a user ID that will be randomly generated
for each plugin installation and we will not collect and store any identifying data about the
user (except the demographic and interest data the user willingly provides us).
Data analysis server: The server will run all data analysis scripts that infer why a given ad is
being shown to a particular user at a particular time. The scripts will infer, for each ad, what
are the likely marketing categories to which the ad is targeted by analyzing the ad itself and
all the users that received the ad. The output results will be stored in an SQL database
The main methodological challenge of this thesis is to infer the reasons why a given ad was
shown to a given user. We propose three methods to solve this challenge. Method 1
Athanasios Andreou (EURECOM), directeur: Patrick Loiseau (EURECOM)
corresponds to what we already mentioned and we envision to use it in the long run once the
tool has sufficiently many users. However, to bootstrap the tool and not let the project rely
entirely on getting a large deployment, we propose Methods 2 and 3 that can provide
analytics starting from day one. On a high-level, all three methods rely on defining a
probabilistic identity (i.e., users are associated with distributions over demographics rather
than a specific one) for each user and inferring these probabilities from the ads received.
Method 1: Ask users. We can simply ask a subsample of users that installed the tool to
provide us with their demographics and interests and use this as training data to infer why
other users are being targeted with a particular ad. To collect the data we will include in the
browser plugin an opt-in option that will trigger an initial questionnaire that will ask users
about their general demographics and interests. In addition, to improve efficiency even with
few labeled samples, we plan to use active learning. Active learning is a set of techniques
that combine machine learning algorithms with real-time input from users to optimize the
accuracy of the overall process by selecting the examples to label that will be the most
informative for the learning algorithm. Concretely, for users and categories optimally selected
using active learning techniques, we will trigger quick questionnaires where we just ask users
to confirm whether they have a particular interest or demographics.
Once we have labeled examples, the process of inferring why a given ad is targeted consists
essentially in grouping users that received the particular ad and analyzing the demographics
and interests of the labeled examples in this group. The confidence we have in the prediction
will depend on the number of labeled examples we have in the group and how homogenous
the examples are with respect to a particular category. Developing this method will require
advanced researches in statistical methods to measure similarity between ads, optimally
group users to maximize the information inferred and evaluating the estimation confidence.
Method 2: Analyze the ads. A different technique to infer why an ad is being targeted is to
simply analyze the ad. We can do so with multiple sources. We can use sites such as Alexa
or Web of Trust to infer the categories of the ad’s landing page. We can additionally use
natural language processing tools (such as Mashape, CoreNLP, AlchemyAPI, OpenCalais,
Semantria,TAGME) to analyze the content of the landing page and infer entities, context,
topics or sentiment related to the ad. (To avoid spending the advertiser’s budget, we will not
click on the ads; we will copy the URLs, remove any user identifier and paste them in another
browser.) Finally, when available, we can use information provided by Quantcast.
Method 3: Infer from controlled experiments. Finally, the last method is to build controlled
experiments to do the mapping between ads and interests/demographics. We will create
different browsing profiles that reflect particular demographics and interest using the
techniques in [1, 2], and monitor what ads are shown to these browsing profiles. To compute
the probability that a given ad was targeted due to a certain demographic or interest we will
build on the technique proposed by  extended to our setting.
Strategy to evaluate the accuracy of our inferences. To evaluate the accuracy of our
results, we plan to collect data from the new ‘Why am I seeing this” functionality on
Facebook. This functionality will provide ground truth data to evaluate our tool and methods.
Yet, our tool goes much beyond for several reasons: (i) Facebook does not always give all
the reasons why an ad is targeted; (ii) companies can come with a list of contacts (emails,
cookies or phone numbers) and ask Facebook to send ads to the users in their list, in this
case Facebook simply says that “you were in the list” whereas our tool might be able to infer
why the user is in the list; and (iii) we analyze ads on all websites and not just Facebook.
Methods to control the information known about a user. Lastly, we will investigate
methods for users to act on the information that is known about them. Since controlling the
information gathered by services is almost impossible, we propose to instead add noisy
Athanasios Andreou (EURECOM), directeur: Patrick Loiseau (EURECOM)
information to obfuscate the real information. We will investigate methods that add noise in
order to achieve a given wanted probabilistic identity for the user. Our tool to infer this
probabilistic identity will make it possible to verify the effectiveness of our method.
c. Deployment strategy and risks
Incentives for users to install the tool. As evidenced by the success of other similar
projects such as Ghostery (with > 3.5 million adopters), many users are interested in
transparency. Still, to minimize this risk further, we will take the following actions. To increase
the tool’s utility, we will package it with an ad blocker (just Adblock Plus has more
than 50 million adopters on Chrome alone). To incentivize users to provide their
demographics and interests we will investigate different incentive techniques based on
lotteries and gift certificates proposed in our prior work [6,7].
Privacy risks. To use our tool, users will need to donate the ads they see when browsing
the Internet. Even if such data does not include any PII, some users might feel that ads could
reveal information that is personal and the data collection might therefore entail privacy
concerns. Users installing the plugin will be provided guarantees about the treatment of their
data. In particular: no information will be collected beyond their ads (unless they voluntary
consent to providing demographics), all information will be stored and communicated
securely and the data will be used solely for the purpose of providing ads analytics. We
believe that these guarantees will be sufficient for users to confidently adopt our plugin.
2. Qualifications involved and collaborations
The main qualifications needed for this thesis are network measurement, statistical inference
and incentives design, which exactly correspond to the director’s expertise. The student has
also excellent qualification on these aspects and a excellent potential for the topic. In
addition, the thesis will be performed in collaboration with Prof. Krishna Gummadi and
Dr. Oana Goga from the Max-Planck institute for Software Systems. They have
expertise in systems building and in online social systems that will be useful for the thesis
and this collaboration with a top EU institution will strengthen the student’s education.