Apache spark is an open source engine for data processing on clusters. It is a generalized processing engine for handling variety of workloads by extending Hadoop’s MapReduce programming model with primitives for data sharing. Spark also carries all the desirable properties such as distribution, scheduling, fault tolerance and so on. It is one of the fastest growing open source project with very large community of users, developers and contributors.

Why Spark when you have MapReduce?

Spark does not provide any file system such as Hadoop/HDFS.Rather, Spark’s execution engine allows to leverage all the benefits that MapReduce model offers. Spark’s core engine works 10x to 100x faster for some batch processing as compared to traditional MapReduce. Moreover, high-level functional programming makes it faster to put the code together much faster as compared to MapReduce.

For better control over memory

Spark provides better control over distributed memory across the cluster. The user can spill huge datasets across the memory of a bunch of nodes and keep them across the queries to make it faster while taking care of fault tolerance automatically.

To reduce operational overhead

Applications where you need to chain multiple MapReduce jobs equally increases the read/write overhead to the file system. Moreover, those read/writes must be replicated across the cluster making it more expensive. Whereas in Spark it is general graph with multiple stages so, by merging such multiple jobs into one execution graph you can avoid those read/write overhead completely without compromising on fault tolerance.

When to use Spark?

Spark is much talked about in the big data landscape. However, it depends on the organizational needs if they need to use Spark. Most enterprises have different types of workloads. At times this results into complex data pipelines. Apache Spark streamlines those operating overhead and allow the user to write functions locally and run them on clusters. Spark’s rich standard library makes it most suitable when you want to use multiple workloads form the following:

  • Iterative algorithms with Machine learning
  • Interactive algorithms with graphs
  • Stream processing
  • Standard batch processing

Spark provides a unified framework to manage variety of workloads that not only reduces operational overhead but also decreases the learning curve for the one who is managing these workloads. For example, once stream processing data pipeline is set up and working perfectly the user can quickly define a time window and start interactive queries on that data.

It allows the use of concise APIs in python, Scala or Java to parse its data at scale. Moreover, the upcoming release of Spark 2.0 will have first version focused on ETL. Thus, modernizing data warehouse with Spark and overcome longer ETL will be easier. At Brevitaz our databricks certified developers have broad experience in managing variety of workloads and customizing spark-based advanced analytics solutions so that even non-technical members of your organization can take advantage of insights.