Hadoop is hot. But what is Hadoop? Actually, it is not a specific piece of software. Hadoop is an umbrella project under the Apache Software Foundation (ASF) that includes several core tools for handling data processing on large computing clusters. There is also a large ecosystem of related tools around the core Hadoop project, and there are multiple "Hadoop distributions" from companies like Cloudera, Hortonworks, IBM, Intel and MapR. Each distribution offers some combination of the core tools, ecosystem tools and (often) proprietary replacements for other pieces of the Hadoop pie that the distribution packager considers better in some way.
There is no one tool or set of tools called "Hadoop," so it is wise to react cautiously when vendors claim to offer "Hadoop integration." The vendor may integrate with a single tool in the Hadoop core or ecosystem, or with several, or with none at all. SAP's integration with Hadoop suffers from this confusion as much as any vendor, so I thought it would be worthwhile to dig into exactly how SAP's software really integrates with the various Hadoop tools.
First, let's define Hadoop. As I mentioned, Hadoop includes a few core tools. Those are:
- Hadoop Distributed File System (HDFS), a distributed file system that can run on a large cluster of computers to store huge amounts of data. Other Hadoop tools tend to be set up to use data stored on HDFS.
- YARN (Yet Another Resource Negotiator) is the core cluster resource management framework. Most (but certainly not all) Hadoop ecosystem tools run on a YARN cluster.
- MapReduce is a system for doing parallel processing of large data sets, based on a Google research paper from 2004. This was the original Hadoop but, interestingly, few vendors that offer "Hadoop integration" actually use MapReduce directly.
Hadoop also has a massive ecosystem of tools built around or on top of these core tools. Some ecosystem projects are also hosted at the ASF. Others live elsewhere. The following are a few key projects hosted at the ASF:
- Hive -- Billed as the Hadoop data warehouse, Hive is actually a distributed database with a Data Definition and Query Language (called HQL) that is similar to standard SQL. Hive tables can be completely managed by Hive, or they can be defined as "external" tables on top of files on HDFS, hBase, and many other data sources. In this way, Hive is often a gateway to data stored in Hadoop ecosystem tools.
- Pig -- A language and execution platform for creating data analysis programs.
- HBase -- A massively parallel, short-request database, originally modeled on Google's BigTable research paper.
- Other projects include Spark (in-memory cluster computing and streaming framework), Shark (Hive on Spark), Mahout (analytics algorithms library), ZooKeeper (a centralized service for maintaining information on configuration and other factors), and Cassandra (similar to hBase).
So how do SAP's products integrate with Hadoop tools? They integrate in a variety of ways depending on the product. At the moment, SAP offers what it calls Hadoop integration in SAP HANA, Sybase IQ, SAP Data Services, and SAP BusinessObjects Business Intelligence (BI). Each of these integrates with Hadoop tools differently.
SAP HANA and Sybase IQ both support forwarding queries and other operations to a remote Apache Hive system as if the Hive tables were local tables. In Sybase IQ, this setup is called a "remote database" and in HANA the setup is through the Smart Data Access mechanism. IQ also supports a type of user-defined function to process data on the database server called a MapReduce API. Despite SAP lumping this API under its Hadoop integration marketing, it has nothing to do with Hadoop.
SAP BusinessObjects BI supports access to Apache Hive schemas through the Universe concept, much like you might connect to any other database. It's worth noting that this type of connection theoretically allows access to data in many different storage systems through Hive's external table concept, including hBase, Cassandra and MongoDB, to name just a few.
So far we've seen that SAP's Hadoop integration is usually just Hive integration. Integrating with Hive via HQL is great and is actually what most vendors mean when they claim Hadoop integration. But it's a bit different than the image of deep integration across the varied Hadoop ecosystem tools that these vendors want to project.
SAP Data Services actually starts to deliver on the Hadoop integration promise a bit more. In addition to the ability to load data to and from Hive, Data Services can create and read HDFS files directly and do some transformation push-down operations using Pig scripts. This means that data can be joined and filtered directly in the Hadoop cluster rather than needing to move to the Data Services server to be processed. Data Services also is able to offload its text data processing onto a Hadoop cluster as MapReduce jobs. So here, SAP is justified in implying deeper integration across multiple Hadoop tools.
Lastly, a word of warning: The Hadoop ecosystem moves fast and enterprise software often lags Hadoop painfully. According to SAP's product availability matrix, support for Hive, Pig and HDFS are limited to fairly old versions that don't support the latest improvements in performance, high availability and cluster capacity. Check vendor claims of support for your versions of specific Hadoop tools carefully because Hadoop versioning is confusing and enterprise software vendor representatives may not understand it fully.