Apache Spark, the in-memory and real-time data processing framework for Hadoop, turned heads and opened eyes after version 1.0 debuted. The feature changes in 1.2 show Spark working not only to improve, but to become the go-to framework for large-scale data processing in Hadoop.
Among the changes in Spark 1.2, the biggest items broaden Spark’s usefulness in multiple ways. A new elastic scaling system allows Spark to better use cluster nodes during long-running jobs, which has apparently been requested often for multitenant environments. Spark’s streaming functionality, a major reason why it’s on the map in the first place, now has a Python API and a write-ahead log to support high-availability scenarios.
The new version also includes Spark SQL, which allows Spark jobs to perform Apache Hive-like queries against data, and it can now work with external data sources via a new API. Machine learning, all the rage outside of Hadoop as well, gets a boost in Spark thanks to a new package of APIs and algorithms, with better support for Python as a bonus. Finally, Spark’s graph-computing API GraphX is out of alpha and stable.
Spark’s efforts to ramp up and expand speak to two ongoing efforts within the Hadoop world at large. The first is to shed the straitjacket created by legacy dependencies on the MapReduce framework and move processing to YARN, Tez, and Spark. Gary Nakamura, CEO of data-application infrastructure outfit Concurrent, believes the “proven and reliable” MapReduce will continue to dominate production over Spark (and Tez) in the coming year. However, MapReduce’s limitations are hard to ignore, and they put real limitations on the work that can be done with it.
Another development worth noting is Python’s expanding support for Spark — and Hadoop. Python’s popularity with number-crunchers remains strong and is ideal for use in Hadoop and Spark, but most of Python’s support there has remained confined to MapReduce jobs. Bolstering Spark’s support for Python broadens its appeal beyond the typical enterprise Java crowd and with Hadoop in general.
Much of Spark’s continued development has come through contributions from Hadoop shop Hortonworks. The company has deeply integrated Spark with YARN, is adding security and governance by way of the Apache Argus project, and is improving debugging.
This last issue has been the focus of criticism in the past, as programmer Alex Rubinsteyn has cited Spark for being difficult to debug: “Spark’s lazy evaluation,” he wrote, “makes it hard to know which parts of your program are the bottleneck and, even if you can identify a particularly slow expression, it’s not always obvious why it’s slow or how to make it faster.”