bpaneural's comments

bpaneural · on Nov 17, 2021

Essentially with Databricks making Delta open source, you can move away from Databricks to EMR or Presto (with their own optimizations) without incurring a data tax. You're also able to move between cloud providers at ease as data sits in low cost buckets.

bpaneural · on Nov 15, 2021

They must do. But if you've been in this area for long enough, I'd put my money on Databricks, if anything, because of their open source integrity

drawturkey · on Nov 16, 2021

Photon, which was used in their benchmark, is not open source. Don't be fooled by DB.

hack_daddy · on Nov 16, 2021

Apache Spark is an open API. You can build your ETL with it and run it on an open source Spark cluster, an AWS EMR cluster, or a Databricks cluster. It will work across all three (and others) because the API is open.

Vendors can implement that API with their own optimizations. EMR makes optimizations in their implementation and so does Databricks. Photon is a new engine, but it implements the Apache Spark API for better performance. There's nothing to stop EMR or any other Apache Spark vendor from undertaking the same strategy.

This openness has allowed customers of Hortonworks and Cloudera to migrate their workloads to the cloud easier than if they had to refactor from something completely different, like from Oracle PL/SQL routines.

Snowflake does not have an open ETL API. If you write stored procedures in Snowflake, you can only run them on Snowflake. This is one of the reasons people choose to use dbt with Snowflake. It gives them an open ETL layer to provide future optionality.

There's no reason why you couldn't use Snowflake as the datastore and Spark as the ETL. However, it would be prohibitively expensive to do so. You would need to pay for the Spark cluster, but also a Snowflake cluster to export and import the data. Exporting a handful of terabytes from Snowflake can also take hours depending on your cluster configuration.

By storing your data on S3 in an open format, like Apache Parquet or Delta Lake, you can just use a different engine on it without needing to export / import it. In addition to Spark, Presto & Trino are popular engines to use when querying a data lake.

This optionality is ultimately good for customers. If Apache Spark is best for your use case, then you can choose to host Spark yourself, EMR, Databricks, Cloudera, etc. If Presto is best for your use case, you can choose AWS Athena, Starburst, Ahana, etc. Once you pick the best tech for your use case, you have several vendors to compare against for the best deal.

If I want to move off Snowflake to Firebolt or some other data warehouse, I need to pay both vendors to get my data out and get my data in. Snowflake wasn't around 10 years ago, and if they are not still a good option 10 years from now, I don't want to have to pay them for the privilege to export my data out. I could rectify that by keeping all my data in a data lake, but now I'm paying to store the data twice.

Open APIs enables an open ecosystem, which encourages competition.

buttaphingas · on Nov 15, 2021

Databricks isn't open source, as they keep hold of all the IP that makes it much better than OS Spark. Whether you buy Snowflake or Databricks, you're buying proprietary software.

ttmahdy · on Nov 16, 2021

With Snowflake data is locked away in a proprietary format not accessible by other compute platforms. You need to export/copy your data to a different system to train an ML model in python or R. With the Databricks, you can use python, R and Scala, (not just SQL) to interface with your data. You can use multiple compute engines (Spark, presto and other engines that support Delta) so you are not locked into one compute engine.

drawturkey · on Nov 16, 2021

This is very true. They make the lowest common denominator parts "open source" but control all of the commits. Also the query engine used for this benchmark is proprietary, closed source (Photon)

feqgmmr2 · on Nov 16, 2021

The 'open' here refers to the data. Delta lake can be read/written by multiple open source engines, not just Spark. Not to mention, if you want you can use Databricks with Parquet, though the experience won't be as good.

But with Snowflake, the data never comes out. Can't use Spark/Trino/Flink... on data in SF.

feqgmmr2 · on Nov 16, 2021

Do you have to pay to export data out of Snowflake? Yes. They have a nice guide on how to spend money doing it (https://docs.snowflake.com/en/user-guide/data-unload-overvie...).

Do you have to pay to export data out of Databricks? No, it's already sitting where you want it.

Which one is open? I wonder

geoduck14 · on Nov 16, 2021

I used Snowflake in my previous company. When we loaded data into Snowflake, we loaded it FROM S3/Blob where we also kept it.

hack_daddy · on Nov 16, 2021

So you were paying to store the same data twice. Once in S3 and once in Snowflake. Why not just purge it from S3 and only keep it in Snowflake?

drawturkey · on Nov 16, 2021

Not entirely true. There is a bi-directional Spark connector for Snowflake written by Databricks. And exporting your data in bulk out of Snowflake into any number of open formats is incredibly easy using the COPY INTO command. You can also use Snowflake on top of Parquet and even Delta Lake.

This is the problem. Both Snowflake and Databricks are spreading FUD and otherwise smart people are falling for it.

feqgmmr2 · on Nov 16, 2021

It is not a "small" cost. The cost is proportional to the size of the data exported.

For all intents and purposes, large amounts of data are locked into Snowflake. Is it theoretically possible to export a petabyte out of SF? Sure.

Do I want to spend money on it? Not really. That is what I mean by the "data doesn't come out".

"Exporting" a petabyte out of Databricks is a no-op. I can already read Deltalake from other open source tools.

glogla · on Nov 16, 2021

"Exporting PB from Snowflake" is only ever relevant if you want to move from Snowflake to something else. In that case, all other migration costs (recoding, redocumenting and especially revalidating everything, if in regulated environment) are going to make any cost of data movement irrelevant.

This is just FUD.

feqgmmr2 · on Nov 17, 2021

I think it's important to understand how this kind of scenario comes up. It's unusual to want to move a whole PB at one time, and yeah in that case these other costs would come up. Problem is, the cost is more insidious than that.

Consider a scenario where data is coming in periodically, say daily, from some source, server logs, sensor data, whatever. And the user wants to train models daily on the data and they also want to do some SQL. Maybe they ingest the data directly into SF and copy it out for training, or they do it the other way round, land it in object store and the ingest into SF. This is unlikely to be a humongous amount of data, it's probably not a PB. However, this adds up, maybe for some use cases it becomes a PB in a month, maybe in a quarter, maybe it only adds up to a PB in a year.

Thing is, without a Lakehouse architecture, the user will pay to store and copy that data multiple times (at least twice) no. matter. what. They may not pay for a PB in one shot, but you can bet that eventually they'll pay multiple times to store and copy that PB.

feqgmmr2 · on Nov 16, 2021

It's very relevant if you ever want to do serious ML or anything other than SQL. Of course Snowflake wants you to think that you never need another platform. Every customer knows that's not the case.

drawturkey · on Nov 16, 2021

So if I stop paying Databricks, I can no longer use their proprietary query engine (Photon), right? I have to use something else, like Open Source Spark SQL which is slower and will cost a lot more money.

There are different ways to lock customers in and both Databricks and Snowflake are playing the game.

blueglassfish · on Nov 16, 2021

I’m not sure this locks anyone in. The APIs are open and Spark code will run on, say EMR, just fine.

Every vendor, be it Snowflake, Databricks, EMR, Athena, BQ, … charges for use of the engine. The difference with a Lakehouse is that one doesn’t have to pay the vendor for the simple ability to use the data with another offering. That’s what you have to pay for with closed systems, whether it’s data on the way in or data on the way out.

drawturkey · on Nov 16, 2021

Agreed there is a small cost, but it is possible, which is at odds with your statement "with Snowflake, the data never comes out".

bpaneural · on Nov 15, 2021

They could in principle. GCP, for instance, does do that. So does HP. And Databricks don't mind that as they have a strong open source legacy. But that takes away the proprietary lock-in strategy of Snowflake.

bpaneural · on Nov 14, 2021

So much to read. TLDR; Databricks still holds the world record and they beat us on price/performance

bpaneural · on Nov 13, 2021

You'll need to revisit this again. In the last two years Databricks has built a lead and a bigger moat. They're essentially nice chaps with a huge community backing them. And we all love their open source tools which essentially powers not only their big data platforms, but everyone else's too (AWS, GCP).