Anyone want to guess how much data they actually have stored? Backblaze's pods a...

rarrrrrr · on Jan 15, 2011

At SpiderOak we get a 3x replication equivalent for about 35% overhead, using Reed-Solomon at the cluster level (on top of RAID6 at the machine level.) Not nearly as expensive as outright replication.

Agree those SATA port multiplies are worrisome. In the beginning, our prototype machines used them to squeeze as many drives into a single machine as possible. They have unusually low tolerance for electrical interference and make it possible for one badly malfunctioning drive to take an entire array offline until manually serviced. We've seen occasions where just touching a cable attached to a port multiplier caused the Linux kernel to emit "dazed and confused" NMI events. I am not brave enough to try them again, even in a redundant setup.

cperciva · on Jan 16, 2011

3x replication equivalent for about 35% overhead

How did you compute this "replication equivalent"?

Retric · on Jan 16, 2011

Picked a number from thin air? Raid six requires 2 drives for back and is normally used in set's of 8 or 16 drives but looks like they are using 45 drives. So 45/43 = 4.65% overhead from using RAID.

Not if they lose 35% on top of that they are around 41% overhead. But, they are taking a huge it on write speeds, network traffic and reliability for doing so.

Edit: Looks like they have 10,058 TB before partitioning the drives so my guess is ~3-6TB of actual user data.

schumihan · on Jan 16, 2011

Does SpiderOak only provide backup service? Erasure encoding is efficient for cold data. Do you use erasure encoding to distribute the hot data across clusters?