logo EDITE Thibault DEBATTY
Identité
Thibault DEBATTY
État académique
Thèse en cours...
Sujet: Theory and practice of scalable machine learning algorithms
Direction de thèse:
Laboratoire:
Voisinage
Ellipse bleue: doctorant, ellipse jaune: docteur, rectangle vert: permanent, rectangle jaune: HDR. Trait vert: encadrant de thèse, trait bleu: directeur de thèse, pointillé: jury d'évaluation à mi-parcours ou jury de thèse.
Productions scientifiques
oai:hal.archives-ouvertes.fr:hal-01133324
Building k-nn graphs from large text data
International audience
In this paper we present our new design of NNCTPH, a scalable algorithm to build an approximate k-NN graph from large text datasets. The algorithm uses a modified version of Context Triggered Piecewise Hashing to bin the input data into buckets, and uses NN-Descent, a versatile graph-building algorithm , inside each bucket. We use datasets consisting of the subject of spam emails to experimentally test the influence of the different parameters of the algorithm on the number of computed similarities, on processing time, and on the quality of the final graph. We also compare the algorithm with a sequential and a MapRe-duce implementation of NN-Descent. For our datasets, the algorithm proved to be up to ten times faster than NN-Descent, for the same quality of produced graph. Moreover, the speedup increased with the size of the dataset, making NNCTPH a sensible choice for very large text datasets.
IEEE International Conference on Big Data (Big Data), 2014 https://hal.archives-ouvertes.fr/hal-01133324 IEEE International Conference on Big Data (Big Data), 2014, Oct 2014, Washington DC, United States. 2014, <10.1109/BigData.2014.7004276>Conference papers 2014-10-27
oai:hal.archives-ouvertes.fr:hal-01525743
Scalable Graph Building from Text Data
International audience
In this paper we propose NNCTPH, a new MapReduce algorithm that is able to build an approximate k-NN graph from large text datasets. The algorithm uses a modified version of Context Triggered Piecewise Hashing to bin the input data into buckets, and uses an exhaustive search inside the buckets to build the graph. It also uses multiple stages to join the different unconnected subgraphs. We experimentally test the algorithm on different datasets consisting of the subject of spam emails. Although the algorithm is still at an early development stage, it already proves to be four times faster than a MapReduce implementation of NN-Descent, for the same quality of produced graph.
ISSN: 1938-7228 Proceedings of Machine Learning Research https://hal.archives-ouvertes.fr/hal-01525743 Proceedings of Machine Learning Research, 2014, Proceedings of the 3rd International Workshop on Big Data, Streams and Heterogeneous Source Mining: Algorithms, Systems, Programming Models and Applications, 36, pp.120-132ARRAY(0x7f03fe5be4c8) 2014
oai:hal.archives-ouvertes.fr:hal-01525732
Multi-agent System for APT Detection
International audience
Advanced Persistent Threats (APTs) are targeted cyber attacks committed over a long period of time by highly skilled attackers. The ever increasing number of successful attacks indicates that classical network protection solutions (firewalls, Intrusion Detections Systems, proxies etc.) are no longer sufficient. Therefore, in this paper we propose a new system that combines multiples approaches using advanced aggregation techniques to achieve a better detection performance. We also test the system on real data from a small corporate network, and show that our system is able to attain a high probability of detection to probability of false alarm ratio.
2014 IEEE International Symposium on Software Reliability Engineering Workshops (ISSREW) 2014 IEEE International Symposium on Software Reliability Engineering Workshops (ISSREW 2014) https://hal.archives-ouvertes.fr/hal-01525732 2014 IEEE International Symposium on Software Reliability Engineering Workshops (ISSREW 2014), Nov 2014, Naples, Italy. 2014 IEEE International Symposium on Software Reliability Engineering Workshops (ISSREW), pp.401-406, 2014, <10.1109/ISSREW.2014.86>ARRAY(0x7f03fe5bef30) 2014-11-03
oai:hal.archives-ouvertes.fr:hal-01525708
Determining the k in k-means with MapReduce
International audience
In this paper we propose a MapReduce implementation of G-means, a variant of k-means that is able to automatically determine k, the number of clusters. We show that our implementation scales to very large datasets and very large values of k, as the computation cost is proportional to nk. Other techniques that run a clustering algorithm with different values of k and choose the value of k that provides the " best " results have a computation cost that is proportional to nk 2. We run experiments that confirm that the processing time is proportional to k. These experiments also show that, because G-means adds new centers progressively, if and where they are needed, it reduces the probability to fall into a local minimum, and finally finds better centers than classical k-means processing.
EDBT/ICDT 2014 Joint Conference https://hal.archives-ouvertes.fr/hal-01525708 EDBT/ICDT 2014 Joint Conference, Mar 2014, Athènes, Greece. 2014, Proceedings of the Workshops of the EDBT/ICDT 2014 Joint Conference. <http://www.edbticdt2014.gr/> http://www.edbticdt2014.gr/ARRAY(0x7f03fee71f50) 2014-03-24
oai:hal.archives-ouvertes.fr:hal-01525701
Scalable k-NN based text clustering
International audience
Clustering items using textual features is an important problem with many applications, such as root-cause analysis of spam campaigns, as well as identifying common topics in social media. Due to the sheer size of such data, algorithmic scalability becomes a major concern. In this work, we present our approach for text clustering that builds an approximate k-NN graph, which is then used to compute connected components representing clusters. Our focus is to understand the scalability / accuracy tradeoff that underlies our method: we do so through an extensive experimental campaign, where we use real-life datasets, and show that even rough approximations of k-NN graphs are sufficient to identify valid clusters. Our method is scalable and can be easily tuned to meet requirements stemming from different application domains.
2015 IEEE International Conference on Big Data https://hal.archives-ouvertes.fr/hal-01525701 2015 IEEE International Conference on Big Data, Oct 2015, Santa Clara, United States. pp.958 - 963, 2015, Big Data (Big Data), 2015 IEEE International Conference on. <10.1109/BigData.2015.7363845>ARRAY(0x7f03ff188b30) 2015-10-29
oai:hal.archives-ouvertes.fr:hal-01525697
Fast distributed k-nn graph update
International audience
In this paper, we present an approximate algorithm that is able to quickly modify a large distributed k-nn graph by adding or removing nodes. The algorithm produces an approximate graph that is highly similar to the graph computed using a naïve approach, although it requires the computation of far fewer similarities. To achieve this goal, it relies on a novel, distributed graph based search procedure. All these algorithms are also experimentally evaluated, using both euclidean and non-euclidean datasets.
2016 IEEE International Conference on Big Data (Big Data) 2016 IEEE International Conference on Big Data https://hal.archives-ouvertes.fr/hal-01525697 2016 IEEE International Conference on Big Data, Dec 2016, Washington, DC, United States. 2016 IEEE International Conference on Big Data (Big Data), pp.3308-3317, 2016, <10.1109/BigData.2016.7840990>ARRAY(0x7f03fee6ef38) 2016-12-05