SciSpark is implemented in a Java and Scala Spark environment. Although Spark also offers a Python environment and we recognize that most scientists are comfortable using Python as a programming language, this Spark environment was chosen to avoid the known latency issues related to the communication overhead involved with copying data from the worker JVMs to Python daemon processes in the PySpark environment. Furthermore, we wish to maximize on in-memory computation but in the PySpark environment the driver JVM writes results to local disk that are then read by the Python process.
Please see our latest paper "Palamuttam, Rahul, Renato Marroquín Mogrovejo, Chris Mattmann, Brian Wilson, Kim Whitehall, Rishi Verma, Lewis McGibbney, and Paul Ramirez. "SciSpark: Applying In-memory Distributed Computing to Weather Event Detection and Tracking." for more details. This document require the Adobe Reader. Download here if you do not have this browser plug-in installed.
Scientific array-based data from the network Common Data Form (netCDF) and Hierarchical Data Format (HDF) files on local disks or from remote sources via protocols like OPeNDAP are loaded into SciSpark using a Partition, Extract, Transform and Load (PETaL) process. More specifically, the PETaL layer first partitions the files by time and/or by space (the latter to be added) then distributes the extraction of the data, transforms it into a data type usable in SciSpark, and loads it into the SciSpark engine to the compute nodes.
In the same way Spark exploits Resilient Distributed Datasets (RDDs), SciSpark contributes and exploits the Scientific RDD (sRDD) that corresponds to a multi-dimensional array representing a scientific measurement (grid) subset by time, or by space. The RDD notion directly enables the reuse of array data across multi-stage operations and it ensures data can be replicated, distributed and easily reconstructed in different storage tiers, e.g., memory for fast interactivity, SSDs for near real time, and spinning disk for later operations.
SciSpark defines the Scientific Resilient Distributed Dataset (sRDD), a distributed-computing array structure that supports multidimensional data and processing of scientific algorithms in the MapReduce paradigm. This is currently achieved through a self-documented array class called the sciTensor that defines the data in the multi-dimensional format that scientists are accustomed to.
SciSpark currently provides methods to create sRDDs that (1) loads data from network Common Data Form (netCDF) and Hierarchical Data Format (HDF) files into the Hadoop Distributed File System (HDFS); (2) preserve the logical representation of structured and dimensional data in ; and (3) create a partition function that divides the multidimensional array by time (to be expanded to space as well). sRDDs are cached (in-memory) in the SciSpark engine support data reuse between multi-staged analytics.
One of the key components of SciSpark is interactive sRDD visualizations. To accomplish this SciSpark delivers a user interface through Apache Zeppelin. Zeppelin provides a notebook-type interface with an interpreter that allows for any language or data-processing-backend to be plugged in.
Our current development can be followed at the SciSpark GitHub repo. SciSpark will eventually be delivered as open source project under the Apache License, version 2 (“ALv2”) to the Apache Software Foundation (ASF).