Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Moving product recommendations from Hadoop to Redshift saves us time and money (monetate.com)
84 points by jpat on June 19, 2014 | hide | past | favorite | 62 comments


These type of articles baffle me, you're comparing a high-performance analytical database to a batch-orientated SQL engine.

The whole point behind these query engines on Hadoop (Hive, Presto, Impala, etc) is to separate the database from the query engine. With these engines you can project schemas over raw data in its original form, without having to load it into a table. With Redshift, or other similar analytical databases, you're forced to define a schema, and then load the data in row by row...bulk inserts are very slow in comparison to Hadoop technologies.

Regardless, Hive in general should nver be used for interactive analytics, that's not what it's intended for. Where Hive shines is when you can dump 250TB of raw text data into a folder and then run a SQL query to extract useful information out of it. The extracted data could then be loaded into a RDBMS like RedShift for real-time reporting.

With all that being said, if you want to run SQL queries on data in Hadoop at the speeds of Redshift, you should have used Impala with Parquet, which is known to be even faster than Redshift in many cases, and is based on the same technology Google uses (Dremel and F1). The benefits of keeping your data in Hadoop are enormous, not every problem can be solved using SQL. The same data you're querying with Impala could actually be used to do machine learning using Spark or Mahout. Maybe you want to start indexing one of your tables into Solr to provide search capabilities on a subset of your columns to your users...or maybe you want to use Giraph or Sparks' GraphX to do parallel graph computation. The data never moves, there's still only ONE copy of that data in Hadoop, and you can bring any kind of workload to it.


So we recently did this same switch. I can say that in our experience data pulls that were taking (on a good day) 6 hours now take under 10 seconds. There are some jobs we had that literally took days to run that now complete in minutes. Redshift is amazingly simple to use, crazy fast and so very cost effective.

Generally we're dealing with datasets that are around 1-3TB, and pretty well organized. Its just amazing how forgiving Redshift is when it comes to unusually written SQL and how useful it is to us as a business. Extracting serious insights was once a job that only a few people could do, now its something that anyone with a SQL background can manage.


Redshift is an especially limited SQL engine considering it doesn't support UDFs. It is wicked fast, but what you get in speed you lose in flexibility. Current (well, February, but fairly current) benchmarks[0] place Impala and Shark (SQL on top of Spark) within grasp of Redshift while pulling data from disk and, for certain workloads, on par or faster than Redshift. This is without using a columnar file format.

Impala is impressive technology, but it does require you to run dedicated Impala daemons as it doesn't use map reduce under the hood. Shark is especially interesting, however, because it is fast AND build on top of spark, so you can run raw Spark jobs, SQL queries, graph processing and ML all on the same cluster. Shark currently uses Hive to generate it's query plans, but the Spark project is working on implementing it's own SQL engine called Catalyst[1] that promises to be a significant improvement.

[0] https://amplab.cs.berkeley.edu/benchmark/

[1] https://spark-summit.org/talk/armbrust-catalyst-a-query-opti...


Although I have a lot of respect for the amplab, they did not do their due diligence with that benchmark. Mainly for a few reasons, they didn't test using columnar storage in Hadoop (ORC / Parquet), which is what Redshift is using underneath (a proprietary columnar store). Also, the most complicated query they ran was a two table join, and from what I can tell, there wasn't any concurrent workload testing.

(disclaimer: I'm a Cloudera employee):

I recommend checking out the following blog, not because my employer wrote it, but because the guys behind the benchmark did an incredible job making the benchmark competitive. They also show metrics that a lot of the other people are not showing, for example concurrent workload capabilities, CPU efficiency, etc.

Impala, Hive (on Tez), Shark, Presto

http://blog.cloudera.com/blog/2014/05/new-sql-choices-in-the...


Impala does not currently support Serde last when I checked, which limits its usage for certain cases. And I would not treat any benchmark too seriously since every vendor probably would only know/be willing to tune its own products. Check the latest Spark SQL benchmark. http://databricks.com/blog/2014/06/02/exciting-performance-i...


You pay a significant resource penalty when using Serdes, and since performance is one of the biggest priorities to the Impala team, we decided to leave this out for now. A very common workaround is to use Hive to generate Parquet data from your custom data (using Serdes), and then use Impala for querying the Parquet data.

I disagree with your statement regarding not treating benchmarks from vendors seriously. As the article mentions, we made an effort to make these queries run as efficient as possible, even going so far as re-writing queries on competing engines to make them run faster. In fact, Databrick's engineers assisted us in making the Shark benchmarks as good as they could possibly get. The benchmark that I linked is very thorough, and even supplies the exact queries / scripts we used to perform the tests so you can do them yourself.


Much like the Clouderan commenter, I wouldn't put a lot of stock in Berkeley's Big Data Benchmark. I reran a similar test with columnar storage and found Impala handily beats Shark. Operationally it's also much easier to deploy (provided you're on EMR or CDH). The "dedicated nodes" argument is kind of FUD, you can use LLAMA for resource sharing, and you need to colocate imapalad with DataNodes to achieve decent performance anyways. So YARN, Spark and Impala can all play nice together on the same cluster.

The queries and data set Berkeley chooses are bizarre. TPC-DS or TPC-H are much more representative of real-world performance, and the differences are more pronounced as the queries get more complex.

edit: I also don't understand why the Spark team is reinventing the wheel for Spark SQL when Hive running on Tez will produce very similar query plans. The two projects are converging to the same place, but they insist on having divergent code bases ;)


Llama+Impala isn't quite ready for prime time in my experience. The biggest issue is the reliance on Impala's query size estimates to determine how many resources to request from Yarn. We find that these estimates are frequently an order of magnitude or so away from reality.


Agreed, and also LLAMA doesn't support high-availability at the moment (soon to be fixed). We rely heavily on up to date table/column statistics in order to accurately determine resource consumption, and unfortunately Impala doesn't currently have incremental/background stats, something that should be in the 2.0 release.


Actian Matrix (Paraccel) does support UDFs, but you'd have to run your own cluster on premise.


Completely agree with everything you said. I disagree about Impala, right now.

It has great potential, but I don't think it's prod-ready yet.

Also, why no mention of HBase?


Although Impala is still a fairly new product, my team has been using it internally at Cloudera in production for over a year for real-time log analysis to our support engineers (http://bit.ly/USFQdh), among other ad-hoc BI analytics. We also have a bunch of customers who are using Impala to power very critical interactive workloads. What about Impala makes you feel like it's not production ready?

Good question about HBase, I didn't mention HBase because although it's a super fast NoSQL database, it's a lousy analytical database. Sure it's great at doing really fast scans over small slices of data (and/or updating data), but full table scans are extremely slow when compared to analyzing flat files in HDFS. For doing analytics in Hadoop, the format you almost always want is Parquet. Not only is reading files directly from HDFS faster, but Parquet is a true columnar store, so you pay a minimal IO penalty for queries since you only read necessary data. Also, Parquet uses some really efficient column encoding formats like (dictionary, delta, run length, ..) to reduce both IO and to increase the effectiveness of compression.


I knew I smelled me some Cloudera... :)

HBase:

I think HBase (based on the sorting of qualifiers within rows) would be suited toward the "ranking" problem, that's why I brought it up. I see this as being a map-only job (and possibly suited toward streaming, or not even using Hadoop at all). It would just be a quick scan/filter/pagination and then a quick ranking algo in some sort of API middle layer (how I envision this).

Impala:

I started using Impala around the 1.2.(don't remember) version which was at the tail-end of CDH4. I found that minimal increments (for instance from 1.2.1 to 1.2.2), would change query behavior and results. We were also using Impala with it's HBase connectivity, which I found to be very poor and about 100x slower than Hive+HBase. If I wanted parallelism to my queries against HBase tables, I had run my queries between row keys for each region and use some sort of "union all", which would increase performance and parallelize the query. Honestly, I'd consider dropping HBase from Impala until it can be made more stable and consistent with what you might expect with SQL queries. Some of the results from Impala didn't make any sense with regards to Impala + HBase (it's just a storage engine for Impala, right?), like joins and null handling. If I were to create these tables as Parquet (or even MySQL) with the same data, and run the same queries, Parquet + MySQL would agree, but Impala+HBase would diverge.

I think that Impala really kicks ass for ADHOC and infrequently run queries, but if you have a lot of concurrent queries, I don't think it handles the load very well (compared with something like Vertica). Perhaps this could be improved upon? We'd love to replace Vertica, and it seems that the only other product in its class is Impala.

I tried to use Parquet, but Parquet is really only suitable for bulk loads (not trickle loading). I was impressed with Parquet's query speed, but I had hard requirements preventing me from doing bulk loads. Impala+Parquet does deliver real-time queries/results, but the data can't be put in there in real-time, so I think this deserves a little asterisk.

BTWs:

BTW #1, do you have any matrices/data for the newer HBase (0.96.1.1+) and table scans? I find that I can table scan pretty well with a POC I put together on EC2. I can scan ~ 3 bn records (about 500m rows) per hour on a 8 node (7 active) cluster with 30.5 gb RAM and 800 gb SSD (i2.xl) on EC2. The company I'm currently at may be taking up some serious HBase. After pre-splitting my regions and disabling region splitting, I was able to keep it very stable without doing batched mutations with concurrent read and write. Before I disabled splitting, I was having a split/compaction storm that kept downing HBase. I use snappy compression on all CFs and I use bloom filters on the row-level.

BTW #2, your Cloudera retargeting for ads for me is wasting your money. We're already under the belt of Cloudera-paying customers. Just an FYI. :)

BTW #3, if you put "kill -9"'s (this may just be CDH 4-specific) into the GC on certain Cloudera-infused services (like HBase region servers), it would be nice if we could turn it off. Sometimes I don't mind some GC, but a cascading of region servers getting a "kill -9" just causes a cascade of badness.

Please don't think I'm shitting on you. I love Cloudera. As far as the Hadoop ecosystem goes, Cloudera is my _only_ choice. I cringe when people say MapR (very pushy inside sales, pain to install) or HortonWorks (too young). I've been using Hadoop since 2007, if it matters.


Yes, the HBase scanners in Impala are not very fast, and we know that. This is an area that needs improvement to maximize parallelism, but as of right now there are a bunch of things on the Impala roadmap that takes priority (disk-based aggregations/joins, window functions, nested data, order by without limit) to name a few.

As for Parquet, that file format is not designed for streaming, but instead is like you mentioned, it's meant for converting large datasets that you plan on running analytics on. Queries against data in parquet is _fast_, like really fast...I've seen queries go from 200 seconds down to 5 seconds by just converting the dataset to Parquet from text.

Concurrency in Impala is actually pretty good, and has always been a design goal from the beginning. I wouldn't compare Impala to Vertica or other analytical databases just yet, there's still a lot of room for improvement, but concurrency in Impala is much better than the other SQL on Hadoop engines (Hive, Presto, etc), and we've demonstrated that on our latest rounds of benchmarks.

BTW #1 - As I mentioned, HBase support Impala is pretty minimal at the moment, but still works fine for ad-hoc queries over small key spaces.

BTW #2 - hehe! I'll let our marketing team know :P

BTW #3 - I'm not sure what you mean here, I'll ask around to see if someone knows.

Thanks for your kind words! Sorry for the late response, I just saw that I had a response :)


I chuckled when I read "We have a legacy data warehouse based in Hive and Elastic MapReduce, with backing data stores in S3.". I guess things have come full circle. It wasn't long ago that a relational database solution would have been "legacy".


But "legacy" is not only when technology gets old, but also when solutions get old. Maybe it was just a bad solution and they are moving it to a new solution, not necessarily a new platform.


I did too, mainly because S3 + Hadoop isn't going to provide the locality and speed that HDFS would. I don't think it's a fair comparison.


I've gone through a similar transition (hive to redshift) in a very large scale data environment. Raw Hadoop / cascading is still very useful for more complicated workflows, but redshift is so vastly superior to hive it's not even funny. I thought I would miss adding my own UDFs, but this hasn't been an issue at all. I'm under the impression presto is a similar improvement, but I haven't spent any time with it.

One huge advantage of redshift over hive: you can connect with plain old Postgres libraries, so you can build redshift results into your admin interfaces, one off scripts, and anywhere else you're fine trading a few seconds of latency for extra data.


Just as a quick note: You can use Postgres libraries because Redshift is a slightly modified version Postgres 8.1 under the covers. In fact, almost all massively-parallel-processing (MPP) databases are Postgres under the covers (including Microsoft's PDW). It really speaks to how impressive Postgres is at scaling. Even old releases, like 8.1!


I work with the folks who built PDW. It's all SQL Server now. That said, I'm often amazed at how many commercial db products are based on Postgres and other open source dbs. Postgres has a nice page on their site showing all the products derived from it - for example, Netezza or Pivotal's Greenplum. Relational dbs and SQL (especially SQL) are far from dead.

Link: https://wiki.postgresql.org/wiki/PostgreSQL_derived_database...


Anyone who likes Postgres and is looking for a good analytics DB should check out the cstore_fdw Postgres extension. [1] It allows Postgres to create and query files in the Optimized Row Columnar (ORC) format [2] from Hive.

It was created and recently open-sourced by Citus Data (YC S11), who've made it a key component of their MPP Postgres offering.

I don't work there. I'm just a fan.

[1] https://github.com/citusdata/cstore_fdw [2] http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.0.0.2/ds...


There may have been Postgres inside when Microsoft purchased the company whose technology ended up in PDW, but the last two versions shipped by Microsoft have definitely been based on SQL Server.


Yup! My experience with redshift has actually made me curious to try out Postgres (I've always used MySQL before this). The stricter SQL dialect was a little odd at first, but I think I've become more comfortable with it over a few months.


Do it, I switched to Posgress for pretty much everything last year and I absolutely adore it, it just works, no stupid edge cases, excellent documentation, nice tooling (pgadmin3 is better than commercial products I've seen costing hundreds if not thousands).

I'm barely scratching the surface of what it is capable of yet as well (mostly because I'm in ORM land most of the time).


Postgres is an amazing database, and has some great features. Sadly, you don't get the full power of it in Redshift, but man, are some of the datatypes and functions just so useful, especially in a warehousing environment!


I'm not surprised, given that my experiences with Hive are that it's extremely quirky and hardly ever the fastest way to do anything. Given the fact that people seem to be falling over themselves to reinvent better solutions to the same sorts of problems in the Hadoop space (see: Impala, Shark), I don't think I'm alone on that.


A teammate of mine wrote a post about our redshift setup a few months ago with some more details: http://engineering.pinterest.com/post/75186894499/powering-i...


Is it me or are people switching to non-relational data warehouse architectures simply because it's en vogue? How many companies do you know that have enough data where a non-relational DW would actually make sense? I wonder, have we really pushed relational databases to their breaking point?


I've looked at and avoided doing anything serious with hdfs/mr for 6 years now. I'm glad some people are starting to realize that re-processing your entire dataset every single time you want to do something isn't very efficient. I'm still waiting for lightbulb moment where the usefulness of it really makes sense to me.

Can anyone point me to a book or blog that discusses good uses of hadoop/map-reduce?


I'm waiting for the day people realize that materialized views in databases are awesome and decide to incorporate them into a framework.


At least if you're using Oracle they are, as it supports auto-refreshing. Postgres has only had them since 9.3 (and have to be manually refreshed). Meanwhile MySQL is still struggling with regular views.


Simplistically speaking, you don't always have to do table scans. I run into this every day: "Let's use Hadoop and keep doing full table scans! It's scalable! We just add more machines!" Yeah, except continuing to scan all of your growing data each time you need it is inherently unscalable. :(


It's often cheaper to use the "wrong" architecture than optimising the right architecture. I know I could have a CouchDB datastore searching a few GB with an afternoon of work. I imagine I could get MySQL fast enough with a few days of optimisation. In terms of time, which is by far the biggest cost in most development, CouchDB is the better option.


In terms of time, which is by far the biggest cost in most development, CouchDB is the better option.

For a single, one-off utility, sure. For anything that you ever planned for production, that would be crazy.

Just to be clear, the mentality that onion proposes (at least from my interpretation, though I apologize if I'm misunderstanding), usually justified under a gross misinterpretation of the "premature optimization" warning, is exactly how disaster implementations that end up failing or requiring enormous amounts of engineering time to try to triage and bandage into something usable.


At least for the startup world, it's about prioritisation of concerns. Will that disaster implementation take me to my next(or first) round of funding? If yes, I'll happily go with it. After that, I can throw money at the problem.


Disaster recovery is easy to put off forever because you don't need it until you do. When it happens it can also kill off your company.

I've been involved in companies that went 14 years without a disaster. Another company I was involved with had 2 in a span of 2 months, each taking between 2 and 3 days to recover from.

Regardless of whether I need it or not, I sleep better at night knowing a decent plan is in place. Which means I can perform better during the day.


Yeah, but was it your choice of DB that killed you or something else? That something else is always more likely to happen and more dangerous than 'oh noes all my data is gone stupid mongo/couch!' as if that ever really happens.


Granted, I was making some assumptions about the original post. For me it's not about the choice of DB, it's about having the proper knowledge, time, and team to be able to setup a production environment that isn't seriously flawed in one or more ways.

I would need a hell of a lot more than a time savings of 1-2 days to add a whole new database technology to my production environment. Even if I know the tech the installation, configuration, automated backups, and automated validation of backups will likely consume more than 1-2 days to get setup. Then add on the learning curve aspect if I've never used it in a real production environment. Then add on the learning curve for any team members who might not be familiar with it.

Weeks, maybe. A day or two: not worth it.


Well, what about other value besides time, which is pretty subjective to begin with, and it doesn't really matter anyway. Like maybe more robustness, bigger dev communities, one of your team members is an expert and can do this quickly, etc.

I think if your architecture is likely to collapse when you shop around for databases, then you built the whole thing wrong to begin with.


That's just like your opinion though. You ever used CouchDB in production before?


I passed zero judgment on CouchDB, but was responding specifically to the notion that doing something "wrong" if it saves you a small amount of development time at the outset is fine. When these are foundational things like your data tier, such an attitude is a primary ingredient in project failure.


Nah, dude, you're saying that it's ok for one-off projects, but you'd be crazy to use it in production because it's gonna blow up on your eventually. That's not true at all. Plus, ya'll arguing about conjectures with that catastrophic failure stuff.


We switched to Redshift for our data warehouse because it was MUCH cheaper, declarative, allowed us to retain the relational model, and abstracted away most of the admin. Very happy so far. ~5,000 tables, 7-10TB all maintained by one guy in his spare time.

The workbench options weren't amazing so we made this for our BI team: https://github.com/zalora/redsift/


A lot are. I used to want to just because it was cool. Now that I actually do though my main use case is sifting through hundreds of GB of unstructured data. I use Hadoop to get the data into a structured form that I can then load into Redshift. It's awesome knowing that just about anything I throw at Redshift, it can handle.


I have another question: Why are people still doing joins in this day and age? Big data + joins = teh suck.

I'm a big fan of compressed denormalized data.


I wonder, have we really pushed relational databases to their breaking point?

The primary limitation of relational databases were traditionally that without expert level optimizations (which, realistically, data-focused organizations should have. But they very seldom do, especially in the start-up space), many queries would generate large numbers of effectively random IO. When you're rolling with magnetic drives, each drive offers maybe 60-150 IOPS, so this quickly becomes an enormous scaling problem. A large storage array offered maybe 2000 IOPS. Scaling becomes entirely about scaling IOPS, as CPU is seldom a limitation in databases.

Add that many firms were starting on EC2 which not only gave you minimal memory, it offered absolutely miserable IOPS performance.

Digg famously, and disastrously, solved this problem by essentially "denormalizing" every bit of data, enormously exploding the raw data they stored, but allowing for individual queries to be entirely localized, often served in a single, large IO: Instead of looking up all of your friends and finding the things they dug, the system would push every bit of data proactively to containers for every possible user. This is the model promoted by many advocates of alternative storage (e.g the advantage of MongoDb is always the "pull a single giant data bag versus pulling it together from various places").

If Kevin Rose dug something, it would update the "things my friends liked" containers for 40,000 or so of his friends, rather than having those 40,000 users check on-demand to see what each of their friends liked.

But they did that right when flash storage was coming into the mainstream. A technology that offers, on simple, inexpensive cards, 100s of thousands to millions of IOPS. Add that RAM has exploded, such that servers with 256GB of memory are very affordable (that was enough to put the entire universe of Digg's data in memory, where of course random IO is in the tens to hundreds of millions).

So now we're at a situation where having non-duplicated, highly relational database is often the highest performance, outside of all of its other advantages, because it fits in memory, and fits on economical flash storage. It has completely flipped the equation.

http://www.commitstrip.com/en/2014/06/03/the-problem-is-not-...


The Digg thing is interesting...they pretty much took the complete opposite approach of Reddit, who basically store everything in two big SQL tables.


If you optimize for latency relational databases won't cut it.


I'm the author of the article. At Monetate, we've chosen our data warehouses to maximize throughput, rather than minimize latency. That's where something like Redshift really shines, it's great a large bulk ingests and running large queries relatively quickly, but awful at running lots of small queries quickly.

On our busiest day last year, we ingested over a quarter billion page views across all of our clients' websites. I'm sure someone has made MySQL scale to that volume, but for us Redshift has been working great for a relatively low price point.


Thank you for sharing your experience! It's always inspiring to read well-written articles as is yours!


This might be the most inaccurate statement on the internet.


OK, I should have written "if you optimize for read-access latency". Better?


Another commenter pointed this out, but what you're trying to compute is cosine similarity, in which case you're missing the normalizing part in the denominator (the product of the magnitude of both vectors). In other words, two items which both occur frequently will score higher than two items which occur infrequently, but which co-occur higher than usual. This leads to a tendency to over-recommend popular items.

When you were on EMR, you could have used Mahout's distributed collaborative filtering, which has the benefits of being correct, and requiring zero coding.

Wikipedia explains here: http://en.wikipedia.org/wiki/Cosine_similarity


Thanks for the tip, I'll look into this more.


You should use something like tf-idf to normalize your cooccurance counts for your recommender, otherwise you'll just end up recommending the most globally popular products.


It's also unclear how many rows they're trying to do this on, and at what frequencies; that's the crux of what turns this from a small-to-medium-data problem, which you can easily solve on a large box with 10 lines of code, to a big data problem, which requires completely different tooling


In my testing of this query, I ran it against a time range that included over 40 million purchase lines, and our configuration of Redshift returned the result in ~6 minutes. That was much quicker than our legacy EMR implementation.

Currently, we update our product recommendations nightly. However, the speed up we see here from this reimplementation may allow us to update product recommendations more frequently.


I generally love sql and use it in all My stuff more or less.

The one thing I fucking hate is how hard it is to get the database to execute querys in a efficient way on anything that is more than a simple join and select.


Wasn't Hadoop the first of it's kind in Big Data?


Hadoop was the first (major) open-source implementation of Google's MapReduce framework. http://research.google.com/archive/mapreduce.html

In terms of data warehousing and near-real-time query over Big Data, Google's framework for that is called "Dremel", http://research.google.com/pubs/pub36632.html

Google offer Dremel as a service known as BigQuery.


Those who want on-premise Redshift should try Actian Matrix. Redshift is just Amazon's version of Actian Matrix (formerly Paraccel)


aside from the example in the original article, what sort of questions are these tool used to answer?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: