logo EDITE Sujets de doctorat

A Generic Framework for Functional Fraud Detection

Sujet proposé par
Directeur de thèse:
Doctorant: Remi DOMINGUES
Unité de recherche UMR 7102 Laboratoire de recherche d'EURECOM

Domaine: Sciences et technologies de l'information et de la communication

Projet

The airline industry is exposed to numerous external factors such as the global economy, the exchange rate fluctuation and fuel costs. Besides these elements, fraud is a persistent threat that causes important financial losses. In 2008, for example, airlines lost around USD1.4 billion to fraud [1], representing 1.3% of the world total airline revenues. The average attack rate is about 1% - 1.5% of revenue. In some regions, including Middle East and Latin America, this rate even reaches 3% - 4% of the revenue. Among different kinds of frauds, the following are the most common in the airline industry:

• Payment Frauds: these are the classical frauds, aiming at subverting payment systems, which can affect both end-users and service providers. For this family of frauds, known techniques from the vast literature of financial fraud detection can be applied, and are thus not the main focus of this Thesis proposal. • Booking Frauds: this family of frauds aim at misusing booking systems by altering passenger name records or related information. As the name implies, this kind of frauds targets a very specific kind of information, which could limit the applicability of detection techniques in a general context. As a consequence, also this kind of frauds will not be the main focus of this Thesis proposal. • Functional Frauds: this family of frauds are the most complex and general, as they derive from an improper use of the service APIs exposed by potentially all components of a system. For example, there can be frauds targeted at the authentication and security services, at corporate booking services, and even frauds originating from bot-traffic that strives at scraping financially sensitive data and sell them on the black market.

Most of previous works dealing with fraud detection are rule based system. Typically, a static set a predefined rules is specified by an application expert, based on domain knowledge or information from what have been used to commit fraud in the past. The main drawbacks of the rule based approach is that the static rules requires a manual update. As soon as a business is able to figure out a new feature that will help catch future fraud attempts, someone has to take extra steps to create new rules.

The objective of this Thesis is to design, analyse and implement a functional fraud-detection framework capable of raising alerts on suspicious connections and activity, and take appropriate corrective actions on time. The main challenges identified to build the framework: 1) General nature of the framework. The framework should be of a general nature, allowing users to specify a handful of working parameters and let the system adapt to data and objective functions describing anomalies.

2) Scalability problems. In this work, the fraud detection framework learns about users behaviour based on historical data and highlights the anomalies as outliers with respect to a normal behaviour model. Historical data takes the form of functional log files, which are massively generated by all connected services of an infrastructure. As a consequence, it is immediate to pinpoint at scalability problems related to the massive scale of log data: for example, in Amadeus, functional logs account for roughly 10 GB worth of data generated each second.

3) Characteristics of the training data. In addition, the main challenges of this Thesis proposal relate to the characteristics of the training data, that is, the functional logs generated by system components. Indeed, such data is heterogeneous in nature, blending numerical, categorical and text based information. This makes the definition of comparison functions (e.g., distance functions, similarity functions, etc.) difficult, thus hindering the task of data analysis. Furthermore, the kind of anomalies targeted by this Thesis are hard to be known and specified in advance: this means that there is little scope for supervised techniques for fraud detection, as frauds and system misuse are largely unknown and can evolve in time. As a specific example of a functional misuse that result in a fraud is that of unusual API calls to system components: API calls can be treated as sequences of actions involving two or more parties (including end-users and internal services). As such, relevant literature works in the area of pattern matching will constitute the starting point to address the problems outlined above.

Enjeux

The airline industry is exposed to numerous external factors such as the global economy, the exchange rate fluctuation and fuel costs. Besides these elements, fraud is a persistent threat that causes important financial losses. In 2008, for example, airlines lost around USD1.4 billion to fraud [1], representing 1.3% of the world total airline revenues. The average attack rate is about 1% - 1.5% of revenue. In some regions, including Middle East and Latin America, this rate even reaches 3% - 4% of the revenue. Among different kinds of frauds, the following are the most common in the airline industry:

• Payment Frauds: these are the classical frauds, aiming at subverting payment systems, which can affect both end-users and service providers. For this family of frauds, known techniques from the vast literature of financial fraud detection can be applied, and are thus not the main focus of this Thesis proposal. • Booking Frauds: this family of frauds aim at misusing booking systems by altering passenger name records or related information. As the name implies, this kind of frauds targets a very specific kind of information, which could limit the applicability of detection techniques in a general context. As a consequence, also this kind of frauds will not be the main focus of this Thesis proposal. • Functional Frauds: this family of frauds are the most complex and general, as they derive from an improper use of the service APIs exposed by potentially all components of a system. For example, there can be frauds targeted at the authentication and security services, at corporate booking services, and even frauds originating from bot-traffic that strives at scraping financially sensitive data and sell them on the black market.

Most of previous works dealing with fraud detection are rule based system. Typically, a static set a predefined rules is specified by an application expert, based on domain knowledge or information from what have been used to commit fraud in the past. The main drawbacks of the rule based approach is that the static rules requires a manual update. As soon as a business is able to figure out a new feature that will help catch future fraud attempts, someone has to take extra steps to create new rules.

The objective of this Thesis is to design, analyse and implement a functional fraud-detection framework capable of raising alerts on suspicious connections and activity, and take appropriate corrective actions on time. The main challenges identified to build the framework: 1) General nature of the framework. The framework should be of a general nature, allowing users to specify a handful of working parameters and let the system adapt to data and objective functions describing anomalies.

2) Scalability problems. In this work, the fraud detection framework learns about users behaviour based on historical data and highlights the anomalies as outliers with respect to a normal behaviour model. Historical data takes the form of functional log files, which are massively generated by all connected services of an infrastructure. As a consequence, it is immediate to pinpoint at scalability problems related to the massive scale of log data: for example, in Amadeus, functional logs account for roughly 10 GB worth of data generated each second.

3) Characteristics of the training data. In addition, the main challenges of this Thesis proposal relate to the characteristics of the training data, that is, the functional logs generated by system components. Indeed, such data is heterogeneous in nature, blending numerical, categorical and text based information. This makes the definition of comparison functions (e.g., distance functions, similarity functions, etc.) difficult, thus hindering the task of data analysis. Furthermore, the kind of anomalies targeted by this Thesis are hard to be known and specified in advance: this means that there is little scope for supervised techniques for fraud detection, as frauds and system misuse are largely unknown and can evolve in time. As a specific example of a functional misuse that result in a fraud is that of unusual API calls to system components: API calls can be treated as sequences of actions involving two or more parties (including end-users and internal services). As such, relevant literature works in the area of pattern matching will constitute the starting point to address the problems outlined above.