In this work we study the I/O performance of long, sequential workloads that mimic those of Big Data applications, to understand the implications of system virtualization on data-intensive frameworks such as Apache Hadoop and Spark, which are frequently run in clusters of Virtual Machines (VMs). We do so through an experimental measurement campaign that collects low-level traces and metrics, to show the role played by important parameters such as the I/O schedulers and caching mechanisms involved in the I/O path, and the VM configuration in terms of dedicated resources. Our findings are important, especially for determining appropriate deployment strategies for today's emerging Analytics Services hosted both on public and private clouds.
The 2nd International Conference on Cloud Computing Technologies and Applications http://hal.upmc.fr/hal-01513070 The 2nd International Conference on Cloud Computing Technologies and Applications, May 2016, Marrakesh, Morocco. pp.31 - 38, 2016, Cloud Computing Technologies and Applications (CloudTech), 2016 2nd International Conference on. <http://www.macc.ma/cloudtech16/>. <10.1109/CloudTech.2016.7847722> http://www.macc.ma/cloudtech16/ARRAY(0x7f5471aca998) 2016-05-24