Apache Spark’s momentum is unstoppable!

Spark is one of the fastest growing project in open source bandwagon with more then 600 contributors. Due to spark’s proven performance gains, almost all the hadoop vendors are shipping Spark along with hadoop.

Spark is able to deliver 10 to 100 times faster performance for variety of batch-processing jobs simply by reducing the number of read/writes to the disk. At the core of Spark’s computational engine is RDD – Resilient Distributed Dataset. RDDs allow spark to store data into any integrated file system including hadoop’s HDFS, MongoDB or Amazon’s S3. These fault tolerant parallel data structures are most suited in in-memory cluster computing because it allow users to explicitly persist intermediate results in-memory- without the need to write intermediate results in external source. Spark’s architecture embeds enough information in each RDD about how it was derived from other datasets in the lineage. Thus, you are able to store data at best possible granular level and perform parallel computations on the fly in a distributed manner.

Above mentioned computing power makes spark the most suitable for large scale data intensive applications. With growing need for interactive analytics in real-time for streaming data, spark seems to be the solution for organisations to leverage their existing system.

One of the most popular module that is driving Spark’s enterprise adoption is ‘Spark SQL‘. It allows you to use advanced form of RDDs called DataFrames. Using python,scala,java or even HiveQL users can leverage SparkSQL APIs along with broad support for structured data sources like parquet, JSON, Cassandra and so on. SparkSQL’s core, DataFrames, makes it possible to use declarative queries as well as optimize the storage. This capability allow Spark manage in-memory caching in an efficient manner. Unlike normal RDDs that require to keep JVM objects in memory, SparkSQL can reduce memory footprint drastically by automatically selecting the best compression in a columnar manner.

Spark’s core, DataFrame, offers another great advantage by merging two different stacks, batch processing and real-time processing, as Spark streaming. Spark can treat streaming data as a sequence of DataFrames( 0.5 second or less) and offer stateful stream processing system in distributed manner. So,users can seamlessly run batch-processing on the same setup without worrying about managing totally different framework for doing the same job.

Biggest example of spark adoption is SAP HANA, it is tuned to leverage power of mature capabilities of its in-memory engine and libraries. Most of the MapReduce tools like Hive, Pig, Apache Solr and so on are moving towards leveraging spark’s architecture to offer applications with lower latency.

Apart from that Spark’s machine learning library, available as core module, empowers users to setup complex machine learning pipelines. Users can quickly tie out-of-the-box algorithms or apply custom algorithms to different sets of data while running those algorithms at 10x to 100x faster speed than previously used big data based machine learning architecture.

To take it further, Spark’s another module GraphX, allows users to seamlessly represent data as graphs. Spark core, RDD’s interoperability between various spark module allows setting up complex pipelines that can merge results from machine learning, graph processing and relational tables. Thus, it allows users to do sophisticated graph analysis on top of machine learning algorithms offered by Spark’s MLLib and build unified spark application such as product ratings and recommendations.Some interesting examples and use cases are discussed here.