logo EDITE Sujets de doctorat

Bringing transparency to personalized services through statistical inference

Sujet proposé par
Directeur de thèse:
Encadré par
Doctorant: Athanasios ANDREOU
Unité de recherche UMR 7102 Laboratoire de recherche d'EURECOM

Domaine: Sciences et technologies de l'information et de la communication


Personalized services are online services that use information about their users to offer to each user a service that is more adapted to her. With the proliferation of personal data over the Internet, personalized services have become omnipresent in our daily life, including for instance all services offering recommendations. Although this data-based personalization has increased the utility of services for users and for service providers, it has also raised privacy concerns that became increasingly serious in recent years. One example of personalized service for which this issue is particularly stringent is targeted advertising. Advertisement is the main source of revenue for many free web services such as Facebook and Google. The ad ecosystem is complex and can be composed of many actors; here we abstract away this complexity and we refer to the whole chain of organizations that are responsible for sending an ad (e.g., companies that want to advertise, data brokers, advertising platforms) as the ad engine. The prominent advertisement model today is payper- click, which has led to an increasing amount of targeted advertising to increase the likelihood that a user clicks on an ad. Targeted advertising has increased advertisement revenues significantly. However, targeted advertising has been also raising more and more concerns from users who often feel that it constitutes an invasion of their private sphere. In particular, users often wonder “what data do advertisers have about me?” or “why am I being shown this ad?”. In a nutshell, users’ concerns are mainly kindled by the lack of transparency of current targeted advertising systems. The main objective of this thesis is to increase the transparency of targeted advertising by providing users with tools and methods to understand why they are targeted with a particular ad, to infer what information the ad engines possibly have about them, and ultimately to control it. Concretely, we propose to build a browser plugin that collects the ads shown to a user and provides her with analytics about these ads and tools to control them. The browser plugin can either give information for a particular ad such as “you are being shown this ad because the ad engine likely thinks that you are a student” or give analytics on a longer term such as “given the ads you have been shown in the last 3 months, the ad engine likely thinks that you earn less than $50k per year”. One of the main challenges to build such a tool is to infer the information that the ad engine knows about a user from the ads received. To explain our approach we abstract the system into three components: the information the ad engine collects about a user either online from tracking, or offline from data brokers (inputs), the ad engine that processes the inputs to put users in certain marketing categories (the black box), and the ads sent to the user (outputs). In this thesis, we propose to observe only the outputs and to infer the categories the user was put in by the ad engine, regardless of whether this was due to a particular input or not. In order to do that, we will simply collect the ads users receive, then group together all the users that received the same ad, and look at the most common demographics and interests of users in the group. We detail in Section B.1.b. the methods that we propose to develop to do this statistical inference task. The main novelty of our technique is that it relies only on the output, i.e., the ads observed by users and not on any input data the users may have Thesis description Athanasios Andreou (EURECOM), directeur: Patrick Loiseau (EURECOM) __________________________________________________________________________________ explicitly given. This makes our approach much more realistic. Then, we propose ways to control the information services have about a user by noise addition rather than by trying to directly block leakage of information, which is also a much more realistic process. 2. History and related work Previous works made a number of contributions either by discovering problems [2], or by proposing methods to bring more transparency to the ad ecosystem [1, 3, 4, 2]. We focus on the studies that are the closest to our proposal and refer the reader to [5] for an overview. Two studies [1, 4] proposed techniques to detect whether an ad is contextual, re-targeted or behavioral. While this is an important first step for transparency, the studies did not take the next step to detect why the ads are being targeted. Towards this direction, two studies proposed techniques to see how the activities of a user influence the ads she receives [3, 2]. At a high level, these approaches monitor the input of users (e.g., the emails users receive and send, the videos users see on youtube, the sites users visit) and they propose methods to estimate the likelihood that a given ad was shown due to a given input. Thus, these studies look at the inputs and outputs of the ad engine and infer which inputs triggered which outputs. On the contrary, our goal is to look at the outputs of the ad engine and infer what the ad engine knows about the user regardless of whether this was due to a particular input. This has numerous advantages: it requires less invasive monitoring, it has a much lower overhead and it captures ads that are not triggered by any particular input. B. Contenu Scientifique 1. Approach, detailed content and expected results a. System architecture: the three main components Browser plugin: The browser plugin has two functionalities: collect ads and present ads analytics to users. First, the browser plugin parses the web pages a user is browsing and collects all the ads the user receives, and sends the ads to the storage server. We plan to build this functionality based on an existing open-source low overhead ad blocker plugin (e.g., https://adblockplus.org/) that has already been largely tested. Second, the browser plugin provides analytics to the user about the ads she receives, for a particular ad or over a longer period. Finally, the plugin will include a webpage where users can optionally provide personal information such as demographics, and include popup functionalities for active labeling (see Method 1 below). Data storage server: All the ads parsed by the browser plugin are sent and stored in an SQL database. The server will be placed behind a firewall to secure the data from unwanted access. The users will be tracked on our server by a user ID that will be randomly generated for each plugin installation and we will not collect and store any identifying data about the user (except the demographic and interest data the user willingly provides us). Data analysis server: The server will run all data analysis scripts that infer why a given ad is being shown to a particular user at a particular time. The scripts will infer, for each ad, what are the likely marketing categories to which the ad is targeted by analyzing the ad itself and all the users that received the ad. The output results will be stored in an SQL database


The main methodological challenge of this thesis is to infer the reasons why a given ad was shown to a given user. We propose three methods to solve this challenge. Method 1 Thesis description Athanasios Andreou (EURECOM), directeur: Patrick Loiseau (EURECOM) __________________________________________________________________________________ corresponds to what we already mentioned and we envision to use it in the long run once the tool has sufficiently many users. However, to bootstrap the tool and not let the project rely entirely on getting a large deployment, we propose Methods 2 and 3 that can provide analytics starting from day one. On a high-level, all three methods rely on defining a probabilistic identity (i.e., users are associated with distributions over demographics rather than a specific one) for each user and inferring these probabilities from the ads received. Method 1: Ask users. We can simply ask a subsample of users that installed the tool to provide us with their demographics and interests and use this as training data to infer why other users are being targeted with a particular ad. To collect the data we will include in the browser plugin an opt-in option that will trigger an initial questionnaire that will ask users about their general demographics and interests. In addition, to improve efficiency even with few labeled samples, we plan to use active learning. Active learning is a set of techniques that combine machine learning algorithms with real-time input from users to optimize the accuracy of the overall process by selecting the examples to label that will be the most informative for the learning algorithm. Concretely, for users and categories optimally selected using active learning techniques, we will trigger quick questionnaires where we just ask users to confirm whether they have a particular interest or demographics. Once we have labeled examples, the process of inferring why a given ad is targeted consists essentially in grouping users that received the particular ad and analyzing the demographics and interests of the labeled examples in this group. The confidence we have in the prediction will depend on the number of labeled examples we have in the group and how homogenous the examples are with respect to a particular category. Developing this method will require advanced researches in statistical methods to measure similarity between ads, optimally group users to maximize the information inferred and evaluating the estimation confidence. Method 2: Analyze the ads. A different technique to infer why an ad is being targeted is to simply analyze the ad. We can do so with multiple sources. We can use sites such as Alexa or Web of Trust to infer the categories of the ad’s landing page. We can additionally use natural language processing tools (such as Mashape, CoreNLP, AlchemyAPI, OpenCalais, Semantria,TAGME) to analyze the content of the landing page and infer entities, context, topics or sentiment related to the ad. (To avoid spending the advertiser’s budget, we will not click on the ads; we will copy the URLs, remove any user identifier and paste them in another browser.) Finally, when available, we can use information provided by Quantcast. Method 3: Infer from controlled experiments. Finally, the last method is to build controlled experiments to do the mapping between ads and interests/demographics. We will create different browsing profiles that reflect particular demographics and interest using the techniques in [1, 2], and monitor what ads are shown to these browsing profiles. To compute the probability that a given ad was targeted due to a certain demographic or interest we will build on the technique proposed by [3] extended to our setting. Strategy to evaluate the accuracy of our inferences. To evaluate the accuracy of our results, we plan to collect data from the new ‘Why am I seeing this” functionality on Facebook. This functionality will provide ground truth data to evaluate our tool and methods. Yet, our tool goes much beyond for several reasons: (i) Facebook does not always give all the reasons why an ad is targeted; (ii) companies can come with a list of contacts (emails, cookies or phone numbers) and ask Facebook to send ads to the users in their list, in this case Facebook simply says that “you were in the list” whereas our tool might be able to infer why the user is in the list; and (iii) we analyze ads on all websites and not just Facebook. Methods to control the information known about a user. Lastly, we will investigate methods for users to act on the information that is known about them. Since controlling the information gathered by services is almost impossible, we propose to instead add noisy Thesis description Athanasios Andreou (EURECOM), directeur: Patrick Loiseau (EURECOM) __________________________________________________________________________________ information to obfuscate the real information. We will investigate methods that add noise in order to achieve a given wanted probabilistic identity for the user. Our tool to infer this probabilistic identity will make it possible to verify the effectiveness of our method. c. Deployment strategy and risks Incentives for users to install the tool. As evidenced by the success of other similar projects such as Ghostery (with > 3.5 million adopters), many users are interested in transparency. Still, to minimize this risk further, we will take the following actions. To increase the tool’s utility, we will package it with an ad blocker (just Adblock Plus has more than 50 million adopters on Chrome alone). To incentivize users to provide their demographics and interests we will investigate different incentive techniques based on lotteries and gift certificates proposed in our prior work [6,7]. Privacy risks. To use our tool, users will need to donate the ads they see when browsing the Internet. Even if such data does not include any PII, some users might feel that ads could reveal information that is personal and the data collection might therefore entail privacy concerns. Users installing the plugin will be provided guarantees about the treatment of their data. In particular: no information will be collected beyond their ads (unless they voluntary consent to providing demographics), all information will be stored and communicated securely and the data will be used solely for the purpose of providing ads analytics. We believe that these guarantees will be sufficient for users to confidently adopt our plugin. 2. Qualifications involved and collaborations The main qualifications needed for this thesis are network measurement, statistical inference and incentives design, which exactly correspond to the director’s expertise. The student has also excellent qualification on these aspects and a excellent potential for the topic. In addition, the thesis will be performed in collaboration with Prof. Krishna Gummadi and Dr. Oana Goga from the Max-Planck institute for Software Systems. They have expertise in systems building and in online social systems that will be useful for the thesis and this collaboration with a top EU institution will strengthen the student’s education.