As so many people learned over the years in storage org, essential modus operand...

tytso · on Aug 2, 2022

Oh, you can certainly do big projects. My project[1] spanned 3 departments, and involved dozens of engineers, and required that we work with multiple hard drive vendors (our first two partners for Hybrid SMR were Seagate and WDC) on an entirely new type of HDD, as well as the T10/T13 standards committees so we could standardize the commands that we need to send to these HDD's. So this was all a huge amount of "new shit" that was not only new to Google, it was new to the HDD industry. You just have to have a really strong business case that shows how you can save Google a large amount of money.

[1] https://blog.google/products/google-cloud/dynamic-hybrid-smr...

[2] https://www.t10.org/pipermail/t10/2018-September/018566.html

On the production kernel team, colleagues of mine worked on some really cool and new shit: ghOSt, which delegates scheduling decisions to userspace in a highly efficient manner[3]. It was published in SOSP 2021/SIGOPS [4][5], so peer reviewers thought it was a pretty big deal. I wasn't involved in it, but I'm in awe this cool new work that my peers in the prodkernel team created, all of which was not only described in detail in peer-reviewed papers, but also published as Open Source.

[3] https://research.google/pubs/pub50833/

[4] https://www.youtube.com/watch?v=j4ABe4dsbIY

[5] https://dl.acm.org/doi/10.1145/3477132.3483542

We have some really top-notch engineers in our production kernel team, and I'm very proud to be part of an organization has this kind of talent.

vl · on Aug 3, 2022

I'm not saying storage didn't do big projects. I'm saying that over time it got calcified and instead of doing proper stack refactoring and delivering features beneficial for customers, it continued to sadly chug along team boundaries.

For example:

RePD is at just wrong level at all. It should have been at CFS/chunk level and thus benefit other teams as well.

BigStore stack is beyond bizarre. For years there were no object-level SLOs (not sure if there are now), which meant that sometimes your object disappeared and BigStore SREs were "la-la-la, we are fully within SLO for your project". Or you would delete something and your quota would not get back, and they would "or, Flume job got stuck in this cell, for a week...".

Not a single cloud (or internal, for that matter) customer asked for a "block device", they all want just to store files. Which means that cloud posix/nfs/smb should have been worked on from the day 1 (of cloud), we all know how it went.

tytso · on Aug 3, 2022

No one asked for a "block device"? Um, that's table stakes because every single OS in the world needs to be able to boot their system, and that requires a block device. Every single cloud system provides a block device because if it wasn't there, customers wouldn't be able to use their VM, and you can sure they would be asking for it. Every single cloud system has also provided from day one something like AWS S3 or GCE's GCS so users can store files. So I'm pretty sure you don't know what you are talking about.

As far as "proper stack refactoring" is concerned, again, the key is to make a business case for why that work is necessary. Tech debt can be a good reason, but doing massive refactoring just because it _could_ help other teams requires much more justification than "it could be beneficial". Google has plenty of storage solutions which work across multiple datacenters / GCE zones, including Google Cloud Storage, Cloud Spanner and Cloud Bigtable. These solutions or their equivalent were available and used internally by teams long befoe they were available as public offerings for Cloud customers. So "we could have done it a different way because it mgiht benefit other teams" is an extraordinary claim which requires extraordinary evidence. Speaking as someone who has worked in storage infrastructure for over a decade, I don't see the calcification you refer to, and there are good reasons why things are done the way that are which go far beyond the current org chart. There have been a huge amount of innovative work done in the storage infrastructure teams.

I will say that the posix/nfs/smb way of doing things is not necessarily the best way to provide lowest possible storage TCO. It may be the most convenient way if you need to lift and shift enterprise workloads into the cloud, sure. But if you are writing software from scratch, or if you are internal Google product team which is using internal storage solutions such as Colossus, BigTable, Spanner, etc., it is much cheaper, especially if you are writing software that must be highly scalable, to use these technologies as opposed to posix/nfs/smb. All cloud providers, Google Cloud included, will provide multiple storage solutions to meet the customer where they are at. But would I recommend that a greenfield application start by relying on NFS or SMB today? Hell, no! There are much better 21st century technologies that are available today. Why start a new project by tying yourself to such legacy systems with all of their attendant limitations and costs?

vl · on Aug 3, 2022

> So I'm pretty sure you don't know what you are talking about.

Trust me, I intimately know what I’m talking about.

Without personal jabs, let me explain in a bit more detail:

App in VM (kinda posix) -> ext4 (repackaging of data to fit into “blocks”) -> NVMe driver -> (Google’s virtualization/block device stack, aka Vanadium/PD) -> CFS. The moment data got into ext4, it goes through legacy stack that only exists because many years ago there were hardware devices that had 512 byte sectors (as illustration, upgrade to 4K took forever). All repackaging, IO scheduling to work with 4kb block abstraction is wasted performance and cycles.

From customer perspective, all they want is VM with scalable file system. With Kubernetes, etc. they don’t want to ever think about volume size, which is major hurdle to size correctly and provision. BTW, both small and large customers run into volume sizing issues all the time.

There are also internal customers that need posix-compliant storage “on borg” because they run oss lib/software.

Anyway, optimal stack in this case is to plug in into VM on a file system level. Now, is it hard problem to solve? Yes. Would it eliminate PD? No, still required for legacy cases. Would it be enormously beneficial for modern conteinerized cloud workloads? Absolutely.

tytso · on Aug 3, 2022

As I said, there are apps that need a Posix interface although my contention is the vast majority of them are "lift and shift" from customer data centers into the cloud. Sure, they exist. But from a cost, efficiency, and easy of supporting cross-data center reliability and robustness, the Posix file system interface was designed in the 1970's, and it shows.

If you have an app which needs a NoSQL interface, then you can do much better by using a cloud-native NoSQL service, as opposed to using Cassandra on your VM and then hoping you can get cross-zone reliability by using something like a Regional Persistent Disk. And sure, you could use Cassandra on top of cifs/smbfs or nfs, but the results will be disappointing. These are 20th century tools, and it shows.

If customers want Posix because they don't want to update their application to use Spanner, or Big Table, or GCS, they certainly have every right to make that choice. But they will get worse price/performance/reliability as a result. You keep talking about ossification and people refusing to refactor the storage stack. Well, I'd like to submit to you that being wedded to a "posix file system" as the one true storage interface is another form of ossification. Storage stacks that feature NoSQL, relational database, and object storage WITHOUT an underlying Posix file systems might be a much more radical, and ultimately, the "proper stack refactoring". A "modern containerized cloud workload" is better off using Cloud Spanner, Cloud BigTable, or Cloud Storage, depending on the application and use case. Why stick with a 1970's posix file system with all of its limitations? (And I say this as an ext4 maintainer who knows about all of the warts and limitations of the Posix file interface.)

Of course, for customers who insist on a Posix file system, they can use GCE PD or Amazon EBS for local file systems, or they can use GCE Cloud Filestore or Amazon EFS if they want an NFS solution. But it will not be as cost effective, or performant as other cloud native alternatives.

Finally, just because you are using "oss lib/software" does not mean that you need "Posix-complaint storage". Especially inside Google, while those internal customers do exist, they are a super-tiny minority. Most internal teams use a much smarter approach, even if that means that an adaption layer is needed between some particular piece of OSS software and a more modern, scalable storage infrastructure. (And for many OSS libraries, they don't need a Posix-complaint interface at all!)

Posix-complaint means sticking with an interface invented 50 years ago, with technological assumptions which may not be true today. Sometimes you might need to fall back to Posix for legacy software --- but we're talking about "modern containerized cloud workloads", remember?

dekhn · on Aug 3, 2022

Don't get me started on how many times Google Cloud started and then killed cloud NFS (I see now they have an EFS-like product). Or how hard it was to buy a spindle.