As cloud computing has emerged as an important part in almost every sector of IT industry, organizations are starting to setup their private cloud which serves the resource demand from different offices of an organization. But, when demand of resources increased beyond the point where private cloud can handle, these organizations shift their processing to public cloud. This kind of setup is called hybrid cloud computing, as shown in Figure 1.
Figure 1. Hybrid Cloud Setup
Shifting of jobs on public cloud from private cloud can be based on multiple factors such as deadline sensitive jobs moves to cloud if it predicts that deadline can-not be met whereas sometimes cost is also a factor on deciding which job to send to public cloud. Using public cloud will increase the cost of running the applications because it will be charged based on the running time. If organization is low on cost investment, they do not prefer to send jobs to public cloud until and unless it is very urgent.
One of the issues we investigated in this type of scenario is to how to bundle the applications with their data when sending to public cloud for processing. Jobs send to public cloud didn't get started at the instant we send them because it takes time to start the virtual machine and send the required data to the public cloud. Time taken by computation logic to send on virtual machine is negligible as compared to data transfer time in case of big data applications. As most of the organizations are moving toward in-house data mining and analysis so data transfer cost and time cannot be ignored.
Public cloud provides different types of machines ranging from 1 vCPU to 64 vCPUs and even more. Many applications in data mining and analysis use same dataset to mine different relations from it. So, it would be more effective if we bundle jobs which require same data to mine and send them to machine which has appropriate number of vCPUs. This will reduce the data transfer time considerably. For example, if the scheduler must send 8 jobs to public cloud which has 300MB of shared data and if it provisions 1 vCPUs for each job it would send 300*8=2400MB of data. But, if the scheduler is shared file aware and send all 8 jobs to one machine with 8 vCPUs total data transfer will be of only 300 MB.
Making the scheduler aware of shared files and total jobs that will require that shared file it would decrease the transfer time to many folds which will make the user experience more seamless and effective.
By - Dr. Rajinder Sandhu, CURIN, Chitkara University, Punjab