Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Anyone want to guess how much data they actually have stored?

Backblaze's pods are a data-loss nightmare -- lots of single points of failure which will wipe out many TB of data at a time -- and backblaze has stated that they replicate data across multiple pods. Given that the 10 PB seems to be the amount of raw storage backblaze has, I'm guessing that the amount of actual data stored is much less -- depending on what sort of erasure correction scheme they're using, of course. (They're still much bigger than Tarsnap, of course!)



At SpiderOak we get a 3x replication equivalent for about 35% overhead, using Reed-Solomon at the cluster level (on top of RAID6 at the machine level.) Not nearly as expensive as outright replication.

Agree those SATA port multiplies are worrisome. In the beginning, our prototype machines used them to squeeze as many drives into a single machine as possible. They have unusually low tolerance for electrical interference and make it possible for one badly malfunctioning drive to take an entire array offline until manually serviced. We've seen occasions where just touching a cable attached to a port multiplier caused the Linux kernel to emit "dazed and confused" NMI events. I am not brave enough to try them again, even in a redundant setup.


3x replication equivalent for about 35% overhead

How did you compute this "replication equivalent"?


Picked a number from thin air? Raid six requires 2 drives for back and is normally used in set's of 8 or 16 drives but looks like they are using 45 drives. So 45/43 = 4.65% overhead from using RAID.

Not if they lose 35% on top of that they are around 41% overhead. But, they are taking a huge it on write speeds, network traffic and reliability for doing so.

Edit: Looks like they have 10,058 TB before partitioning the drives so my guess is ~3-6TB of actual user data.


Does SpiderOak only provide backup service? Erasure encoding is efficient for cold data. Do you use erasure encoding to distribute the hot data across clusters?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: