One of these big data rules is to transform the data at query time instead of preparing the key performance indicators (KPI) upfront and allow only those to be analyzed.
All too often the rule is executed by dumping all data as-is into the data lake. CSV files, database exports, spreadsheets, text documents,… it is chaos. As a result the data is physically present but essentially inaccessible. “We drown in data but crave for information” as John Naisbitt said.
A data warehouse is supposed to be the single point of truth. Everybody can query the “Revenue” and everybody has the same definition of what “Revenue” contains and what it doesn’t. While this is a good idea for many measures, the definition of others is not always as clear.
In the weblog analysis the KPI “reading duration” is defined as “time between accessing one page and within 10 minutes another”. Why 10 minutes? What would be the impact on the calculated duration if using 9 minutes instead? While it is definitely a good idea to provide a KPI with the official definition “reading duration” for correlations, the data scientist will want to do more.
Hence the requirement to make the raw data (and consequently big data) accessible. But does that also mean in the original source format, meaning a 1:1 copy of the webserver logs? What will happen if that data gets converted from the text format into a compressed binary format easier to read? Would that violate one of the data lake rules if 100 percent of the information is present but in a different format?
The practical advantages are huge. What would usually take hours now executes in minutes. The query does not fail constantly because some rows contain unexpected characters in the text file. The source file format might have changed over time, but the data lake dealt with that already (Schema Evolution). Simply put, the user experience is snappier. It is fun processing the data.
Because storing the data in databases was the only option in the past, many data sources could not be added to the data warehouse or only in a pre-aggregated fashion. Today, I would use the data lake as data warehouse staging area on steroids.
All data are added in a binary compressed columnar table format. Parquet format* is the defacto standard. This makes it easier for the data scientist as well as the data warehouse to consume the data. The data warehouse still contains the aggregated data, using the official definition of the KPI.
I use Parquet files even for binary data like an image, as an image contains a lot of metadata. When was the image taken, where, what resolution, what image size,..? The image itself is just one more binary field with the row.
Another point to consider during the conversion is personalized data according to the GDPR regulation. For example, I would convert the IP addresses into anonymized values.
The reason the data warehouse plays a significant role in this context is because of its ease of use and speed. Most of the users only care about the predefined KPIs. Their typical analysis is to create correlations between the data. And that is the sweet spot for databases like SAP Hana – smaller volumes, fast aggregations, fast search, many joins.
A data lake is the exact opposite. It can store unlimited amounts of data cheaply, can parallelize processing on a single table efficiently and provides complex processing way beyond SQL queries.
The combination of the two, data lake and data warehouse, is the winning proposition.
*What is the Parquet format?
Parquet supports storing relational and nested data in a columnar, dictionary compressed, partitioned table format. The format looks very much like Hana database files after the delta merge but is not in memory and not a database. Just query-able tables as files.