Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

A few things have caught my attention in your post.

Your biggest problem was that the configuration of your services was not sized/tuned properly for the hardware resources you've got. As a result of this your servers have become unresponsive and instead of fixing the problem, you've had to wait 30+ minutes until the servers recovered.

In your case you should have limited Solr's JVM memory size to the amount of RAM that your server can actually allocate to it (check your heap settings and possibly the PermGen space allocation).

If all services are sized properly, under no circumstance should your server become completely unresponsive, only the overloaded services would be affected. This would allow you or your System Administrator to login and fix the root-cause, instead of having to wait 30+ minutes for the server to recover or be rebooted. In the end it will allow you to react and interact with the systems.

The basic principle is that your production servers should never swap (that's why setting vm.swappines=0 sysctl is very important). The moment your services start swapping your performance will suffer so much that your server will not be able to handle any of the requests and they will keep piling up until a total meltdown.

In your case OOM killing the java process actually saved you by allowing you to login to the server. I wouldn't consider setting the OOM reaction to "panic" a good approach - if there is a similar problem and you reboot the server, you will have no idea what caused the memory usage to grow in the first place.



Appengine.

You're a development shop, not scalable system builders. Deciding to build your own systems has already potentially cost you the success of this product - I doubt you'll get a second chance on HN now. If you were on appengine, you'd be popping champagne corks instead of blood vessels, and capitalising on the momentum instead of writing a sad post-mortem.

I'd recommend you put away all the Solr, Apache, Nginx an varnish manuals you were planning to study for the next month, and check out appengine. Get Google's finest to run your platform for you, and concentrate on what you do best.


I wish I could vote this comment up 10 times over.

I know that I know little to nothing about sysadmin, so when I built a recent app I used AppEngine for this very reason. And when it got onto the HN front page it scaled ridiculously easy without any configuration changes. (No extra dynos, no changes at all.)

And when I've occasionally screwed up and done stupid stuff, it still doesn't go down. (To be honest, I first saw the problem when I noticed my weekly bill was ~$5 instead of the baseline $2.10. It helped that being a paid app pushed the limits up a lot higher.)


Any PAAS, for example Appfog would do,


I'd say that the biggest problem is that they tried to launch their product on what appears[1] to be a 4G host, representing maybe $3-400 of hardware cost (maybe more if you buy premium, I doubt Linode does).

I mean, careful configuration and capacity planning is important. But what happened to straightforward conservative hardware purchasing where you get a much bigger system than you think you need? It's not like bigger hosts are that expensive: splurge for an EC2 2XL ($30/day I think) or three for the week you launch and have a simple plan in place for bringing up more to handle bursty load.

[1] The OOM killer picked a 2.7G Java process to kill. It usually picks the biggest thing available, so I'm guessing at 4G total.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: