IBM stole the day-one headlines at Spark Summit 2015 in San Francisco with a big endorsement of the open-source, big-data-analysis platform. But it’s sure to be a selective embrace, as IBM, like other commercial vendors, plans to offer its own software and services on top of Spark.
IBM threw its significant weight behind Apache Spark on Monday, calling the open-source, in-memory platform “potentially the most significant open-source project of the next decade.”
Among the moves announced, IBM will offer Spark as a service on its BlueMix cloud, opening a Spark development center in San Francisco and redirecting more than 3,500 IBM researchers and developers to work on Spark-related projects. It also promised to educate more than 1 million data scientists and data engineers on Spark through community partnerships and support for online courses.
All of the above is great news for the Spark community. But is Databricks, the Spark development, certification and support firm, in danger of being eclipsed by big companies embracing the platform? Spark is the darling of the conference circuit this year, with Databricks executives often showing up at Informatica World, Alteryx Inspire15 and other events as keynote speakers. Even when official representatives aren’t there, Spark is often mentioned as a “Spark inside” enabler of new big data initiatives, as was the case at the Teradata Influencers’ Summit.
But the embrace of Spark isn’t always wholehearted. That’s because the platform supports multiple modes of analysis, including machine learning, SQL, R, graph and streaming. Hadoop distributor Cloudera, for example, was early to jump on the Spark bandwagon, but it tout’s the platform’s machine learning capabilities, not Spark SQL, which presents a threat to Cloudera’s Impala SQL-on-Hadoop component. Hortonworks and MapR also support Spark, but they give equal billing to Hive and Drill, their favored SQL-on-Hadoop options, while invariably showing Apache Storm in architectural diagrams as the streaming option instead of (or in addition to) Spark Streaming.
I’m set to hear more about IBM’s specific Spark plans here in San Francisco this week, but at last week’s Hadoop Summit in San Jose, a few IBMers informally told me the company is mostly interested in using the Spark in-memory platform and machine learning options. As for Spark SQL and Spark Streaming? These are two areas where IBM can offer its own technologies. What’s more, IBM is contributing its own SystemML machine learning software to the Spark community, building influence in this core area.
With a Spark service now available on BlueMix and thousands of IBMers now working Spark-based applications, Databricks will see new competition to its eponymous Databricks platform (formerly called Databricks Cloud), which runs on Amazon Web Services. IBM’s move is also a challenge to analytics leader SAS, which has spent the last three years developing SAS Visual Analytics and Visual Statistics as it’s choice for in-memory big-data analysis (either on top of Hadoop or on a dedicated distributed cluster).
Even if commercial plans lie behind IBM’s embrace of Spark, Databricks executives weren’t about to throw cold water on any endorsements of the platform. “It’s great to see some of the large vendors in the community throwing their weight behind Spark,” Databricks executive Arsalan Tavakoli-Shiraji told me last week. “SAP is integrating Hana with Spark, IBM is embracing it, and Intel is also making a lot of contributions, so it’s great to see the community growing.”
Stay tuned for more from me this week from IBM, SAS and the Spark Summit as the fast-moving big-data analysis world moves even faster.