Apache Parquet, Apache Hive, SAP Vora and Exasol are the big data solutions commonly found in the SAP ecosystem.
From a technical point of view, I would like to group these storage options into three categories:
- Files. All the data is stored as simple files and used like tables. These files are by no means trivial comma separated text files but binary files with structured information – columns and data types. They include indexes and often compression as well. The most prominent format in the big data space is the Parquet format. A columnar-orientated, compressed file format is the defacto standard. In fact, it mirrors how Hana organizes the data.
- Database Process. Instead of reading the files directly, the user interacts with an active database server process. This process can cache frequently used data, has a data dictionary of all tables, and allows to read data via JDB/ODBC. Apache Hive is one popular example.
- In-Memory. For maximum performance the data is copied into memory, indexed and at the end it is a database like Hana. Exasol and SAP Vora are such in-memory big data storage systems.
Option #3 sounds like the perfect solution as it would mean Hana performance and unlimited scaling in compute and storage power. Who needs Hana, then? Or in other words, what’s the catch?
The essence of big data
At its core, the big data world’s sole principle is to build a powerful system out of many small and inexpensive servers. The system can scale without any limits while the hardware costs just grow in a linear manner.
The problem with this approach is any kind of synchronization between the nodes. A query like “count the number of records” does not need much synchronization. Every node counts the records of its local data and produces a single value. These intermediate results are then summed up to get the overall count. This will indeed work nicely.
A join of three tables and more requires a deep synchronization and multiple at that. To join the local partition of table #1 with table #2, each node needs to collect the entire table #2 data from all other nodes. Best case: Each node has an equally sized join result and collects the table #3 partitions from all nodes. Worst case: The intermediate result must be redistributed across the cluster for the next join. This problem is called “reshuffle”. Even a simple query like “revenue per region and materialtype” is inefficient in such a case! If the data resides in-memory, it doesn’t help the re-distribution of the data one bit.
In contrast, SAP Hana is a real database. Single records can be changed, it is blazing fast for searching, the join performance is brilliant, it has full transactional consistency for reads and writes. All of these things need a higher degree of synchronization. Hence it cannot scale indefinitely.
Optimizing the data model for queries
A common solution for the reshuffle problem is to align the data partitioning with the queries. As there are different queries, the same data is stored multiple times but with different partitioning rules. Which in turn increases the costs and limits the flexibility – the sole reason a data lake was built in the first place. Does not sound appealing.
Transactional consistency between the nodes requires synchronization as well, obviously. As synchronization hinders the indefinite scaling, the only way out is to loosen up on the consistency. Terms like “eventual consistency“ are used in that context. At the end, such systems resemble less and less what we would call a database.
This problem is called the CAP theorem. Out of the three requirements Consistency-Availability-Partitioning it is impossible, especially in case of an error, to achieve all. A highly distributed system (=many partitions) will not be perfect in terms of consistency. A system that is perfect in terms of consistency will not rank high on availability and/or scaling.
Hana + big data
Putting technology questions aside for a minute, what data is stored in the database and what kind of data will be put into the big data storage? In most of cases, the big data store will contain raw data. The free form text of a support message, an image of the faulty part, billions of repeating temperature readings – you name it.
There is interesting information buried within the raw data, but it needs to be extracted; extracted with rather complex algorithms and certainly not via SQL queries.
But if that is the case, an SQL-based big data solution – and therefore SAP Vora – does not make sense. The combination of the data lake and the data warehouse does.
The data lake is storing the massive amounts of raw data in Parquet files, which is the cheapest option and fast enough. The raw data is turned into information and added to the data warehouse, which is best suited for sub second queries with joins, comparisons and other SQL operations. The combination of the two combines the advantages of both and eliminates each system’s disadvantage.