The Past, Present And Future Of Big Data

Once I was asked to help a team building a transactional, scalable, low-latency service. It has been mathematically proven, however, that these three qualities cannot be achieved together.

This is called the CAP theorem or, more applicable to this situation, the PACELC theorem. Lots of money was spent by this project team already on trying to do the logically impossible.

Immutable data

The reason is obvious: Scaling means many servers. Consistency requires tight synchronization and hence communication between the servers. Communication costs time and, consequently, latency and speed. More servers mean more communication.

All big data solutions must deal with this dilemma one way or the other. In the Hadoop days, the solution was to analyze static data. No changes meant no need for synchronization.

That evolved quickly into the concept of immutable files. Files can be added but not changed. A process might consider a recently added file or not, nobody knows. This is obviously far from databases and their consistency guarantees, but it works in most batch-orientated scenarios.

When moving towards real time and low latency, more, smaller files are created. Theoretically, every change could be a single file – which would put us back at square one. We just replaced the term row with the term file.

The only solution is to keep streaming the data and to once in a while merge the historical and the change data into a common structure. Building such a structure in a transparent way is harder than it looks. Thankfully, Databricks provides Delta Lake as an open source solution.

Processing engine

The next required component is a query engine to utilize the data. In the Hadoop days, that was Map Reduce, which got generalized into the Functional Programming paradigm. Apache Spark is a prime example. With Spark, the chain of transformations is defined as a set of functions where each transformation function works based on the results of the previous function. Because of the Functional Programming style, it can be parallelized automatically. Spark provides the programming environment and the engine to distribute sub tasks across the entire cluster.

Interestingly enough, the most frequently used Spark transformations are SQL-style transformations, which are translated into the actual functions under the covers.

Enterprise readiness

To recap, big data needs cheap storage, a Delta-Lake-like feature and a query engine, preferably with the focus on SQL. Using the above components, such a system is possible, but it would be far from enterprise ready. Stability, response times, TCO – hardly anything is optimal. Looking at these requirements, the best option would be a database with cheap storage that combines everything into one product.

There have been various tries to create this solution, but Snowflake is the first company that has achieved it and is successful with it. The company stores the data in its own file format. Because the files are immutable, storage is cheap. It also has its own version of Delta Lake. As a user, one just executes a load command, and everything happens under the covers. Querying the data can be done via SQL. It runs exclusively in the cloud – with all the pros and cons associated with that, though.

Productivity

While the costs of big data systems are going down, using the data is a tough business. It requires way more than a typical Business Intelligence tool requires from its users: statistical methods, machine learning algorithms, cluster analysis, discrete simulation, finding the limits the model works within – to name a few.

This costs time and money; the financial impact is unknown. The economist Robert Solow created the productivity paradox, claiming, “You can see the computer age everywhere but in the productivity statistics.” While there is some controversy to his theory, companies should keep it in mind during big data projects. They should start a project not because they think that “data is the new oil” or that “everybody is doing big data”, but to create financial value. It might easily be the case that, while big data is successful in creating meaningful insight and thus reduces costs, the project costs are still higher than the savings.

The solution is, as I wrote almost two years ago, to keep the costs of such projects low. Worst-case scenario, the company only loses a little bit of money; best-case scenario, the ROI is huge. Here are some more guidelines:

Stop throwing away data. If a manufacturing system creates sensor data and they are visualized in a central dashboard, can the data be stored permanently at essentially no costs?
Collect more data. At least in cases where it can be done easily.
Integrate data to see the full picture. For example, combining SAP data, sensor data, and website statistics gives a more holistic overview of business.
Don’t add more sensors just to be on the safe side. Take a good, hard look at the business case and decide if more sensors are truly necessary.
If nothing else can be done, at least get your feet wet. For example, Google Colaboratory or similar tools from other cloud vendors offer the opportunity to get acquainted with the topic.
Look for solutions that have a concrete use case. Look for use cases that can be tried in a proof-of-concept together with the vendor. Fraud analysis and process automation are good examples.

As harsh as it may sound, especially for me with a background in math and data science, it might be too early to invest in big data analytics and machine learning projects today, simply from a price vs. benefit point of few. Especially the required manpower and skillsets, albeit necessary, are expensive.

Future

The systems will get better, easier to use, smarter, cheaper. Tensorflow as a technology package is already going in that direction, for example. It provides low-level functions to build everything and provides ready-to-run solutions for very specific projects.

The real value will be prepackaged use cases, something SAP with all its business knowledge is well-suited to provide. I would look out for products in this space and evaluate them individually. Start collecting data for later use, but don’t rely on a long-term big data strategy just yet.

In the first paragraph, I said no system can be database-consistent, scale in a linear way, and have sub-second query times simultaneously. Consequently, there are still a lot of uncertainties. Hana, Hana Cloud, Snowflake, distributed databases, all have lots to improve within the physical boundaries. Every long-term investment made now will likely have to be revisited in the next few years.

Immutable data

Processing engine

Enterprise readiness

Productivity

Future

Links:

You may also like

About the author

Werner Daehn, rtdi.io

Add Comment