The biggest obstacle is the term big data itself. It’s reminiscent of mass data. However, all data in an ERP system and other databases are also mass data. Big data means quantities too big for traditional databases – too big in the absolute sense or relating to cost-effectiveness.
Data structuring presents another obstacle. In an ERP system, 99 percent of data are structured. The remaining one percent are texts like orders and invoices. With big data, it’s the other way around. All important information is unstructured. Of course, it’s interesting to know when and where a picture was taken, but it’s more interesting to know what’s on it.
In my opinion, the most important definition of big data is ‘all data which cannot yet be used to generate value’.
Here’s an example as to what I mean by that. Purchases are always documented. What isn’t documented, however, is everything else. How did the customer notice the product? Did they see an ad of a specific product? Do customers only skim the product details and buy right away? Or do they meticulously read through technical details and still don’t buy the product?
Now that we’ve discussed what big data is, we have to answer the question of the right big data architecture.
Especially in big data, innovations come and go. A few years ago, Map Reduce on Hadoop was a must-have, now we have Apache Spark which offers better performance. Some time ago, Apache Hive was the way to go; now, it’s Parquet Files. This dynamic environment makes cost-efficiency and flexibility an imperative.
Apache Spark offers great performance while still offering the desired flexibility, which is why the majority of projects worldwide leverage it. The installation is easy, complex transformations only need a few lines of code and the software is free of charge.
By adding Apache Spark to existing data warehouses, customers bypass having to install expensive BI systems and offer users new figures for their tried-and-tested tools.
The future of big data
Up until now, storing and analyzing the information in big data was simply not worth the costs. The only way to process big data was through database tools which were not able to effectively deal with so much unstructured data.
Now, new tools are leveling the playing field. With the Apache Hadoop Filesystem (HDFS), cheap computer components create big filing systems, making expensive disk arrays obsolete. Apache Spark is able to process big data with complex algorithms, statistical methods and machine learning.
Data warehouse tools, including SAP ones, have adapted to big data and offer direct access to Hadoop files or transfer transformation tasks to connected Spark clusters. One of these solutions is the SAP Hana Spark Connector.