Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I had 48 hours of downtime on a DigitalOcean node last week. All events on the node was stalled so I could not boot down or take an image to spin up a new instance. Had to hammer their support with a dozen ticket before someone didn't just give me a canned reply. Of course they did not acknowledge this very long outage on their status page. I like DO but stuff like this just can't happen without anyone checking on it stat. I have become very vary about using them for critical infrastructure since then.


Using a budget IaaS for "critical infrastructure" is your mistake.


You seem to think every business has a budget for a big contract with DO or Amazon or Google.

Some people run their email server on these budget IaaS. Some prefer to host outside of Google or Amazon's power. so where else should they host their own server? Home?

Ideally if we have continuous streaming backing up a node, then when the host machine failed a second machine can pick up to serve the last backup. This is of course expensive for any provider for every customer. But asking DO to actually report the status of the node, its host machine and the region is the right thing to do.

Customers don't need to know the full technical detail but even a nice friendly message (email, sms or even on the status page) will ease the conflict: "Your host now appears offline because the host machine is offline. Don't worry! Your data is safe with our backup! If you have any concern, please contact XXXXX@digitalocean.com or at xxx-xxx-xxx."

Conclusion:

* report the status of the droplet on personal dashboard

* for non-isolated incident, report it on both droplet personal dashboard and public dashboard.


Digital Ocean is absolutely not meant for critical infrastructure, nor is it meant for running a production mail server (there's a good chance the IP has already been flagged somewhere for spam in any shared cloud server IP space). You're paying for a low-budget VPS with no phone support. Yes, they have a 99.9% SLA, but the penalty to them if they exceed that is minimal.


The right way to defend a company is not by contradicting their own marketing.


Whether this is for critical infrastructure or not, the provider should tell the customer the problem automatically via the dashboard.

It should take some engineering work, but not whole lot.

How is that demand too much? Should we discard that demand because DO is a low-budget VPS? If you truly value your customer, you would take that suggestion seriously. I don't have millions to employ someone to manage an AWS farm for me. Instead of me asking DO why my nodes are down every time that happens, I want DO to tell me once that happen. It's a simple customer demand.


As others have pointed out, that 5$ box might actually be a 20$ or 40$ box, so I disagree with your notion of not putting critical infrastructure on DO.

That being said, their penalties are ridiculous. I got a 10$ "SLA Credit" for 2 days of downtime. So I agree their SLA is useless.


there are many other hosting solutions that offer Virtual Private Servers with a high percentage of guaranteed uptime.

For example, with Dreamhost you can get a VPS for ~$15 a month. If your e-mail isn't worth that much to you, then why are you bothering to self-host your e-mail anyways?


> If your e-mail isn't worth that much to you, then why are you bothering to self-host your e-mail anyways?

Because Google Apps no longer offers free accounts?


On this note, what is the easiest way to transfer my free Google Apps account from a .co.cc domain to a real domain and keep my free status? Can I change domains within Google Apps itself?


AFIAK, you can't but you can instead add the real domain as an alias.


>Using *aaS for "critical infrastructure" is your mistake.

Fixed that for you.


> All events on the node was stalled so I could not boot down or take an image to spin up a new instance.

This should not surprise you. This is a common failure method of VMs. So, let's say a host was down. Depending on their storage methods, this means that the all the images on the disk are inaccessible. This means that you can't interact with it, which is why you couldn't take snapshots.

> Had to hammer their support with a dozen ticket before someone didn't just give me a canned reply.

Abusing support is never the right answer, I'm not surprised you only got canned replies.

> I like DO but stuff like this just can't happen without anyone checking on it stat.

It's almost like every problem cannot be solved instantly.


This wasn't a very nice or helpful response. The OP relayed their experience and you picked it apart as if they were uninformed and incompetent. Pointing out a common failure doesn't mean that the service should not handle it quickly and keep their system status updated. Filing a new support ticket after getting a canned response that does not address an issue is reasonable. Finally, 48 hours is a long time. The post sounds like they weren't looking for instant service -- just some kind of service.


48 hours is nuts! How did you get to sleep?


After the first 24 hours i started to get a bit twitchy. Most of all I was worrying about data loss. (RAID degradation turned out to be the core issue, but I managed to boot the server up after 48 hours and copy over everything before spinning up a new instance, so no data loss occured.)




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: