There are, however, important differences and immanent enhancements in Snowflake’s capabilities to be aware of.
Both systems use the SQL standard to read data with all its bells and whistles. Lots of scalar functions, aggregations, nested queries, window functions – everything you need to analyze business data stored in a data warehouse. (If you’re curious, Snowflake’s functions can be found here, Hana’s functions here.)
However, Hana is a HTAP database, a hybrid transactional and analytics processing database. It provides excellent performance for lookups based on a table primary key and joins based on key relationships. In Snowflake, a separate engine is required for that, the search optimization. It still does not test for primary key violations like a relational database requires, because it is not a database. Snowflake is certainly no OLTP database.
A huge difference between the two is in the area of data manipulation via SQL statements. Again, both support the commands, but, while Hana has no issues with modifying single rows or large amounts of rows, Snowflake handles data manipulation more like moving a file. It does not matter if the file is large or only has a single row, the amount of work remains the same. Consequently, modifying large amounts of rows is fine, single rows not so much.
Does a data warehouse even need single-row updates? A data warehouse is, by definition, loaded by a batch process once every 24 hours. From this point of view, single row changes are not needed. However, today’s business users require data in real time, thus would favor loading the changes within sub-seconds like Hana supports. Snowflake has acknowledged that limitation by providing some function through Snowpipe ETL. Such single-row operations are considered one of the most difficult tasks in database design because of locking, concurrency, and data consistency guarantees. If there is just a single session making the single-row changes like in a data warehouse, the task becomes significantly simpler. Snowflake will soon get better in this area for sure.
A data warehouse is loaded using either products provided by the database vendor or third-party tools. Both solutions provide all options. Every major data integration tool can load data into both Snowflake and Hana. Even SAP Data Services, SAP’s ETL tool, provides Snowflake connectivity. And both provide web-based frontends to load data. Snowflake is using a more scripted approach while SAP provides a graphical data flow editor. Since there are so many options, I’d call it a draw here.
End users use business intelligence tools to query the data because writing a SQL statement that returns the desired data and correctly is harder than it looks. Simple example: select sum(amount) from orders; returns the total order value, doesn’t it? Wrong! It would add up 5$ plus 5€ to 10 something. A currency conversion is needed before summing up the data.
Usually, the BI tool vendors feel responsible for this functionality – good old SAP Business Objects with its Information Design Tool is a prime example, SAP BW another.
The Hana team felt that such semantics should be added to the database and suggests building calculation views. Frankly, I prefer having the semantic layer in SQL as well but in the end, it’s a personal choice. Snowflake, like all other databases, has no semantic layer and relies on BI tools. Nothing wrong with that.
That’s a tough one. No matter what I say here, others will have valid counter arguments. I’d still like to try to give an overview.
Snowflake is a big-data tool. It has been designed from scratch to run on a large cluster and integrates well with the big data world of Hadoop, Blob Storage, Parquet files, and all the other popular technologies and formats. The main property of a big data tool is that it scales (linear) with every additional cluster node of which there will be a lot. In case of Snowflake, the secret are micropartitions. Lots of small files with a few 100,000 rows. If users execute a query, each cluster node looks at the metadata of the file to decide if it contains data. If it does, the file is scanned for the matching record. Double the number of nodes and each node will only have to look at half of the micropartition files.
This concept is also called a separation of storage and compute. All files (storage) can be read by all nodes (compute), and the relationship between the two can be any, for example one larger node doing all the work or many small nodes working together.
Another requirement for clusters is auto-scaling, the ability to start and stop nodes on demand. In an ideal world, a query would be executed and, while it was running, more nodes would be added to keep the response time in the sub-second area, regardless of how difficult the query was. If the system was idle, nodes would shut down automatically to save costs. In the most extreme situation, not a single node would be active if nobody was using the system at the time.
Snowflake cannot automatically scale within a single query, only across queries. If more queries are executed, more nodes are dynamically added and later stopped, all within a given maximum size.
Hana is a database. It is designed from scratch to run on single servers, maybe as a scale-out cluster of two servers. Initially, it did not even offer Numa support, a hard requirement for larger servers. That came on Hana’s fifth birthday, with Hana 1.0 SPS 10.
But there is nothing wrong with that. Instead of using lots of CPU power across many nodes, Hana is using clever algorithms and the speed of RAM to achieve excellent performance. That limits its size to multiple TBs of data in main memory, but, by connecting Hana with other solutions, it can deal with cases where even that is not sufficient. It supports a warm storage via the Native Storage Extension (NSE) feature and, using Smart Data Access, external solutions can be added to Hana. It is mostly SAP’s marketing strategy that creates confusion.
SAP is using smoke and mirrors to deflect from the fact that it has no big data solution. For example, here’s an article about separation of storage and compute in SAP Hana Cloud. SAP does not have a data lake as big data storage, though; it only has an old Sybase IQ disk database rebranded as a data lake. All connections of Hana to the big data world, like Hana Spark Controller, are dying on the vine.
Just imagine Hana and Snowflake working together, wouldn’t that be awesome? Combining the advantages of both worlds without the disadvantages? However, with this approach, SAP would provide a solution for customers that is simple and cheap, and its own offerings would become even harder to sell than they are today. Anyway, how to interact with customers is for SAP to decide.
The obvious point simply is that Hana Cloud does not support auto-scaling. It cannot resize itself to zero nodes, it cannot scale based on the current workload and for sure not dynamically to speed up the currently running query.
The Hana data is in the main memory. Adding a new node would mean that some data is removed from the other nodes and added to the new node. This rebalancing operation would consume time and resources. You’d have to essentially stop the database and start it on a different server.
Costs, costs, costs
The impact of all the technology decisions listed above usually aren’t the top priority; the total cost of ownership is. Hana Data Warehouse Cloud is built based on the solutions of Hana Cloud and Data Intelligence Cloud. Both are expensive from a licensing and operational point of view.
Snowflake is so much cheaper. As it is catching up on the aforementioned current limitations, there are lots of SAP customers moving to Snowflake. Interestingly enough, even Snowflake’s brute force method of scanning micropartitions can be optimized with indexing, so that other cloud data warehouse solutions have a chance to compete in the space. However, one thing is clear: The time of Hadoop, Map Reduce, and Apache Spark is essentially over.