Apache Spark Continues to Forge Ahead with Help from IBM’s Spark Technology Center

Apache Spark™ puts both deep and broad advanced analytics capabilities in the hands of the masses. Whether a data scientist, data engineer, analytics app developer or citizen analyst – Spark delivers sophisticated analytics simpler, faster and more efficiently than ever before.

Spark is currently one of the most active open source project for big data. The latest release, Spark 2.0, is the result of nearly 2,500 contributions, with consistently more than 100 contributors per month. The new release is a significant milestone, and builds upon the input of the rapidly growing user and developer community. The Spark 2.0 release has been summarized as easier, faster, and smarter [1]. Notable improvements include streamlined APIs, expanded SQL capabilities, improved performance, and structured streaming. The release solidifies its leadership position as the premiere big data platform.

IBM has had a long tradition of supporting open source projects, and was the very first supporter of the Apache Foundation from its inception in 1999. Through IBM’s flagship organization, the Spark Technology Center (STC), IBM’s role in the Apache Spark Project is expected to have a massive impact on industry wide adoption of the technology, which has fed into the excitement felt by the big data community at large about the project.

The STC’s core mission is to contribute to the Spark Community, as well as to expand the core technology to make it enterprise and cloud ready. It is also fueling the adoption of Spark in the business community. Through education and outreach, IBM is building data science skills and driving intelligence into business applications.

The commitment of IBM to open source is shown especially in this latest Spark release. Nearly 18% of all non-trivial Features, Improvements and Bug fixes and 16% of all JIRAs were contributed by the Spark Technology Center, placing IBM as the number two contributor to the Apache Spark Project.

In Machine Learning, the Spark Technology Center contributed no less than 42% of the new features, and 24% of the enhancements. The STC has contributed 44% of all lines of code (LOC) worldwide to the PySpark component and over 25% of LOC in Spark ML. Significant code contributions were also made in SparkR, WebUI and many others. In Spark SQL, Spark’s most active component, IBM leveraged its long-standing SQL experience by resolving 25% of all bug fixes for the new release.

All this makes the level of commitment and contribution to Spark by IBM’s Spark Technology Center undeniable. The future is bright for Apache Spark, and IBM is proud to be an active contributor, and looks forward to continuing the tradition of excellence.

To find out more about the work of the Spark Technology Center, visit spark.tc and follow us at @ibmcodait.

Dinesh Nirmal – Vice President, IBM Analytics Development

Follow me on Twitter: @dineshknirmal

TRADEMARK DISCLAIMER: Apache®, Apache Spark™, and Spark™ are trademarks of the Apache Software Foundation in the United States and/or other countries.

[1] http://goo.gl/zSbFq9