Are there tangible benefits of integrating Apache Spark with SAP HANA, or is it just spin?
SAP and Databricks recently announced integration between SAP's HANA platform and the Apache Spark ecosystem. HANA is SAP's in-memory database and application platform. Spark is an Apache project that provides a memory-centric alternative to the Apache Hadoop data management toolset. Spark focuses on processing data in RAM, but also enables processing disk-based data. It was designed partially to facilitate interactive analysis of data in HDFS (the Hadoop file system), which has historically been a very weak point for the Hadoop ecosystem. Databricks was founded by the creators of Apache Spark and employs many of the main Spark developers.
According to SAP, the HANA-Spark integration is via HANA's Smart Data Access (SDA) mechanism. This same mechanism is used for the HANA-Hive integration, and is effectively a way of making data in another database available in HANA in a federated manner. Spark supports the Hive Query Language (HQL -- a close variant of SQL) through Shark (Hive on Spark) and through Spark SQL. It appears that SAP accomplishes this integration via an interface included in SAP's Spark distribution that sits on top of Spark SQL.
This could allow HANA to access Hive through Spark, but it's unclear what integration is offered beyond that. If only Hive integration is offered, it would be a stretch to call this "Spark integration," just as it is a stretch to describe integration with Hive as "Hadoop integration." We'll have to wait to hear more from SAP on the exact access scenarios that are supported under SAP's Spark distribution.
Databricks' blog on the topic gives a bit more detail on accessing HANA from the Spark side. According to Databricks, Spark will support push-down of SQL operators into HANA. This plan has not yet been realized, but I take it to mean that it will be possible to use Spark to do data analysis on HANA data, and operations such as filters and aggregations will be offloaded to HANA where appropriate. This may be accomplished by adding HANA as a data source for Spark SQL, just as Hive is supported currently, but details are even slimmer here, so we'll have to see what develops on the Spark end.
The meager details seem to be the primary feature of this announcement. For now, we need to wait and see how things play out. The biggest value here may just be introducing the Spark and SAP HANA communities to each other. Both systems have important strengths that are worth looking at carefully.
About the author:
Ethan Jewett is an independent consultant and SAP Mentor. He focuses on business intelligence, information management and performance management, and works with clients on data management and performance management tools. For more information, check out his blog.
Are you a BI professional? Get everything you need to know about Apache Hadoop.
Cut through the SAP-Hadoop fog of hype and confusion
Get help on designing a Hadoop strategy