Does Shared-Memory, Highly Multi-Threaded, Single-Application Scale on Many-Cores?
Nowadays, single-chip cache-coherent multi-cores up to 100 cores are a reality. Many-cores of hundreds of cores are planned in the near future. Due to the large number of cores and for power efﬁciency reasons (performance per watt), cores become simpler with small caches. To get efﬁcient use of parallelism offered by these architectures, applications must be multi-threads. The POSIX Threads (PThreads) standard is the most portable way to use threads across operating systems. It is also used as a low-level layer to support other portable, shared-memory, parallel environments like OpenMP. In this paper, we propose to verify experimentally the scalability of shared-memory, PThreads based, applications, on Cycle-Accurate-Bit-Accurate (CABA) simulated, 512-cores. Using two unmodiﬁed highly multi-threads applications, SPLASH-2 FFT, and EPFilter (medical images noise-ﬁltering application provided by Phillips) our study shows a scalability limitation beyond 64 cores for FFT and 256 cores for EPFilter. Based on hardware events counters, our analysis shows: (i) the detected scalability limitation is a conceptual problem related to the notion of thread and process; and (ii) the small per-core caches found in many-cores exacerbates the problem. Finally, we present our solution in principle and future work.
4th USENIX Workshop on Hot Topics in Parallelism USENIX Workshop on Hot Topics in Parallelismconference proceeding 2012-06-07