We show that distributed Infrastructure-as-a-Service (IaaS) compute clouds can be eﬀectively used for the analysis of high energy physics data. We have to design a distributed cloud system that works with any application using large input data sets requiring a high throughput computing environment. The system uses IaaS-enabled science and commercial clusters situated at different places. We describe the process in which a user prepares an analysis virtual machine (VM) and submits batch jobs to a central scheduler. The system boots the user-speciﬁc VM on one of the IaaS clouds, runs the jobs and returns the output to the user. The user application accesses a central database for calibration data during the execution of the application. Similarly, the data is located in a central location and streamed by the running application. The system can easily run one hundred simultaneous jobs in an eﬃcient manner and scalable to many hundreds and possibly thousands of user jobs.
Infrastructure as a Service (IaaS) cloud computing is emerging as a new and eﬃcient way to provide computing to the research community. The growing interest in clouds can be attributed, in part, to the ease of encapsulating complex research applications in Virtual Machines (VMs) with little or no performance degradation . Studies have shown, that high energy physics application code runs equally well in a VM. Virtualization technologies not only oﬀers several advantages such as abstraction from the underlying hardware and simpliﬁed application deployment, but in some situations where traditional computing clusters have hardware and software conﬁgurations which are incompatible with the scientiﬁc application’s requirements, virtualization is the only option available. A key question is how to manage large data sets in a cloud or distributed cloud environment. We have developed a system for running high throughput batch processing applications using any number of IaaS clouds. This system uses software such as Nebula  and Nimbus , in addition to custom components such as a cloud scheduling element and a VM image repository. The results presented in this work use the IaaS clouds based on Amazon EC2. The total amount of memory and CPU of each computational cluster in the clouds are divided evenly into what we call VM slots, where each of these slots can be assigned to run a VM. When a VM has ﬁnished running, that slot’s resources are then released and available to run another VM. The input data and analysis software are located on one of the clouds and the VM images are stored in a repository on the other cloud. The sites are connected by a research network while the commodity network is used to connect the clouds to Amazon EC2. Users are provided with a set of VMs that are conﬁgured with the application software. The user submits their jobs to a scheduler where the job script contains a link to the required VM. A cloud scheduling component is implemented (called Cloud Scheduler) searches the job queue, identiﬁes the VM required for each queued jobs, and sends out a request to one of the clouds to boot the user speciﬁc VM. Once the VM is booted, the scheduler submits the user job to the running VM. The job runs and returns any output to a user speciﬁed location. If there are no further jobs requiring that speciﬁc VM, then Cloud Scheduler shuts it down. The system has been demonstrated to work well for applications with modest I/O requirements such as the production of simulated data . The input ﬁles for this type of application are small and the rate of production of the output data is modest (though the ﬁles can be large). In this work, we focus on data intensive high energy physics applications where the job reads large sets of input data at higher rates. In particular, we use the analysis application of the BaBar experiment  that recorded electron-positron collisions at the SLAC National Accelerator Laboratory from 2000-2008. We show that the data can be quickly and eﬃciently streamed from a single data storage location to each of the clouds. We will describe the issues that have arisen and the potential for scaling the system to many hundreds or thousands of simultaneous user jobs.
Analysis jobs in high energy physics typically require two inputs: event data and configuration data. The configuration data also includes a BaBar conditions database, which contains time-dependent information about the conditions under which the events where taken. The event data can be the real data recorded by the detector or simulated data. Each event contains information about the particles seen in detector such as their trajectories and energies. The real and simulated data are nearly identical in format; the simulated data contains additional information describing how it was generated. The user analysis code analyzes one event at a time. In the BaBar experiment the total size of the real and simulated data is approximately 2 PB but users typically read a small fraction of this sample. In this work we use a subset of the data containing approximately 8 TB of simulated and real data. The event data for this analysis was stored in a distributed file system at one cloud. The file system is hosted on a cluster of six nodes, consisting of a Management/Metadata server (MGS/MDS), and five Object Storage servers (OSS). It uses a single gigabit interface/VLAN to communicate both internally and externally. This is an important consideration for the test results presented, because these same nodes also host the IaaS frontend (MGS/MDT server) and Virtual Machine Monitors (OSS servers) for the cloud.
The jobs use Xrootd to read the data. Xrootd  is a file server providing byte level access and is used by many high energy physics experiments. Xrootd provides read only access to the distributed data (read/write access is also possible). Though the implementation of Xrootd is fairly trivial, some optimization was necessary to achieve good performance across the network: a read-ahead value of 1 MB and a read-ahead cache size of 10 MB was set on each Xrootd client.
The VM images are stored at the other cloud and propagated to the worker nodes by http. For analysis runs that includes the Amazon EC2 cloud, we store another copy of the VM images on Amazon EC2.
In addition to transferring the input data on demand using Xrootd, the BaBar software is also staged to the VMs on demand using a specialized network file system to reduce the amount of data initially transferred to the clouds when the VM starts by reducing the size of the VM images transferred from the image repository to each cloud site. This not only makes the VM start faster, but also helps mitigate the network saturation after job submission by postponing some of the data transfer to happen later after the job has started.
A typical user job in high energy physics reads one event at a time where the event contains the information of a single particle collision. Electrons and positrons circulate in opposite directions in a storage ring and are made to collide millions of times per second in the center of the BaBar detector. The BaBar detector is a cylindrical detector with a size of approximately 5 meters in each dimension. The detector measures the trajectories of charged particles and the energy of both neutral and charged particles. A fraction of those events are considering interesting from a scientific standpoint and the information in the detector is written to a storage medium. The size of the events in BaBar are a few kilobytes depending on the number of particles produced in the collision. One of the features of the system is its ability to recover from faults arising either from local system problems at each of the clouds or network issues. We list some of the problems we identified in the processing of the jobs. For example, we find that Cloud resources can be brought down for maintenance and back up again. In our test, the NRC cloud resources were added to the pool of resources after the set of jobs was submitted. The Cloud Scheduler automatically detected the new resources available and successfully scheduled jobs to these newly available resources without affecting already running jobs.
Conclusions: From the users’ perspective, the system is robust and is able to handle intermittent network issues gracefully. We have shown that the use of distributed compute clouds can be an effective way of analyzing large research data sets. This is made possible by the power of the cloud computing and distributed file systems.