État académique
Thèse soutenue le 2000-07-01
Titulaire d'une HDR (ou équivalent) 2010-05-01
Laboratoire: personnel permanent
Direction de thèses (depuis 2007)
Propositions de sujets de thèse
Ellipse bleue: doctorant, ellipse jaune: docteur, rectangle vert: permanent, rectangle jaune: HDR. Trait vert: encadrant de thèse, trait bleu: directeur de thèse, pointillé: jury d'évaluation à mi-parcours ou jury de thèse.
Productions scientifiques
Batched Cholesky Factorization for tiny matrices
International audience
Many linear algebra libraries, such as the Intel MKL, Magma or Eigen, provide fast Cholesky factorization. These libraries are suited for big matrices but perform slowly on small ones. Even though State-of-the-Art studies begin to take an interest in small matrices, they usually feature a few hundreds rows. Fields like Computer Vision or High Energy Physics use tiny matrices. In this paper we show that it is possible to speedup the Cholesky factorization for tiny matrices by grouping them in batches and using highly specialized code. We provide High Level Transformations that accelerate the factorization for current Intel SIMD architectures (SSE, AVX2, KNC, AVX512). We achieve with these transformations combined with SIMD a speedup from 13 to 31 for the whole resolution compared to the naive code on a single core AVX2 machine and a speedup from 15 to 33 with multithreading compared to the multithreaded naive code.
Design and Architectures for Signal and Image Processing (DASIP) https://hal.archives-ouvertes.fr/hal-01361204 Design and Architectures for Signal and Image Processing (DASIP), Oct 2016, Rennes, France. pp.1--8, 2016, <https://ecsi.org/dasip> https://ecsi.org/dasipARRAY(0x7f4f38c453e0) 2016-10-12
Parallel Light Speed Labeling: an efficient connected component algorithm for labeling and analysis on multi-core processors
International audience
In the last decade, many papers have been published to present sequential connected component labeling (CCL) algorithms. As modern processors are multi-core and tend to many cores, designing a CCL algorithm should address parallelism and multithreading. After a review of sequential CCL algorithms and a study of their variations, this paper presents the parallel version of the Light Speed Labeling for Connected Component Analysis (CCA) and compares it to our parallelized implementations of State-of-the-Art sequential algorithms. We provide some benchmarks that help to figure out the intrinsic differences between these parallel algorithms. We show that thanks to its run-based processing, the LSL is intrinsically more efficient and faster than all pixel-based algorithms. We show also, that all the pixel-based are memory-bound on multi-socket machines and so are inefficient and do not scale, whereas LSL, thanks to its RLE compression can scale on such high-end machines. On a 4×15-core machine, and for 8192×8192 images, LSL outperforms its best competitor by a factor ×10.8 and achieves a throughput of 42.4 gigapixel labeled per second.
ISSN: 1861-8200 EISSN: 1861-8219 Journal of Real-Time Image Processing https://hal.archives-ouvertes.fr/hal-01361188 Journal of Real-Time Image Processing, Springer Verlag, 2016, <http://link.springer.com/article/10.1007/s11554-016-0574-2>. <10.1007/s11554-016-0574-2> http://link.springer.com/article/10.1007/s11554-016-0574-2ARRAY(0x7f4f39145728) 2016-03-24
A new SIMD iterative connected component labeling algorithm
International audience
This paper presents a new multi-pass iterative algorithm for Connected Component Labeling. The performance of this algorithm is compared to those of State-of-the-Art two-pass direct algorithms. We show that thanks to the parallelism of the SIMD multi-core processors and an activity matrix that avoids useless memory access, such a kind of algorithm has performance that comes closer and closer to direct ones. This new tiled iterative algorithm has been benchmarked on four generations of Intel Xeon processors: 2×4-core Nehalem, 2×12-core Ivy-Bridge, 2×14-core Haswell and 57-core Knight Corner. Macro meta-programming was used to design a unique code for SSE, AVX2 and KNC SIMD instruction set.
WPMVP '16 Proceedings of the 3rd Workshop on Programming Models for SIMD/Vector Processing Principles and Practice of Parallel Programming / WVMVP https://hal.archives-ouvertes.fr/hal-01361101 Principles and Practice of Parallel Programming / WVMVP, Mar 2016, Barcelone, Spain. WPMVP '16 Proceedings of the 3rd Workshop on Programming Models for SIMD/Vector Processing <http://conf.researchr.org/home/ppopp-2016>. <10.1145/2870650.2870652> http://conf.researchr.org/home/ppopp-2016ARRAY(0x7f4f391492e0) 2016-03-12