Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I see many debatable points:

- Blood is far from being that dense, but I'll assume you mean to calculate a theoretical limit.

- If you use the whole of a cell's DNA to encode information, the cell will die. It can't be used to freely encode information like a hard disk.

- There is no way to retrieve information in this system. This is a bit like packing a large amount of extremely high-density magnetic platters in a box and calling this a storage system, which is very different from using actual hard drives with all the complex reading/writing system (heads, space between the platters, magnets...). Even if they're not connected, it still takes room.

- Reliability of hard disks is far, far better than this. Encoding information as a single copy in a single cell is wildly unsafe. Even using multiple copies, it's likely to degrade with mutations.

Edit: I was under the impression that a base pair could only encode one bit of information, since only T-C and A-G are valid combinations. Wikipedia appears to disagree with me but I see no source; does anyone know why Wiki says a base pair could encode 2 bits?



> - Blood is far from being that dense, but I'll assume you mean to calculate a theoretical limit.

There are more ways of having cells arranged, I explicitly mentioned red blood cells because they are exceptional in not containing any DNA, but any chunk of tissue with that volume would do.

> If you use the whole of a cell's DNA to encode information, the cell will die. It can't be used to freely encode information like a hard disk.

Yes, that's obvious, but a cell's DNA does hold that much information, it's just not our information.

> - There is no way to retrieve information in this system.

There actually is, the information retrieval mechanism that is used to 'express' the DNA (actually, the RNA, an 'unzipped' strand of DNA, but who's counting) is a wonderful little nano machine called a ribosome, it's probably the most amazing structure that I know of outside of the DNA itself.

They're in the volume quoted, the DNA only occupies about 25% of that volume iirc.

> Reliability of hard disks is far, far better than this.

The error correction mechanism that allows your cells to be copied through very large numbers of generations is actually pretty good, most 'mutations' are lethal and only very few actually result in viable copies passing their changes on to newer generations. Mutations are also pretty rare on the whole.

You are right that only TC and AG are valid, but those combinations can be attached 'in reverse' as well (CT / GA) so that makes for four possible combinations in all.

If it weren't for that the movie 'GATTACA' would have been unpronouncable :)


> There actually is, the information retrieval mechanism that is used to 'express' the DNA (actually, the RNA, an 'unzipped' strand of DNA, but who's counting) is a wonderful little nano machine called a ribosome, it's probably the most amazing structure that I know of outside of the DNA itself.

Actually I was going to say that one way to think of using that information is to compute with it, in other words, an organism is simply the result of a computation on its DNA.


- There is no way to retrieve information in this system.

DNA is natively a content addressable storage system, due to natural base-pairing. But to first address your question in your edit: think of DNA as a pair of singly-linked lists, each with an alphabet of four characters. Each singly-linked list is the "reverse complement" of the other: the reverse sequence with an A-T swap and a C-G swap. At each position there's 2 bits of information, and the other linked list allows for some redundancy.

To probe for information in a DNA database, you construct the reverse-complement of the desired bit of information, attach a marker to your probe (such as a fluorescent dye, biotin, magnetic bead), then physically mix it in to your DNA database. A couple cycles of melting and cooling, and your probe will eventually find it's target DNA.

Of course, the thermodynamics of a physical database like this aren't particularly great. I'm not sure of the asymptotic behavior; my intuitive guess is lg(N) just like in B-trees or what have you, but I've never run the numbers or heard of anyone else running it. Also, the constant in front may be just a few orders of magnitude larger than our current systems :)

Reading DNA is getting super cheap these days, and the pace of DNA sequencing technology makes Moore's law look positively wimpy. There are about 30 serious startups working on technologies that fall into a few broad categories, and some like PacBio had an IPO this year. Writing DNA is a much more difficult challenge, I don't know of many people looking into it yet. The market for writing DNA isn't nearly as obvious as it is for sequencing. Of course if it becomes feasible to write your own pets/plants/children instead of breeding them, the market may explode.


There are several companies that will "write" arbitrary sequences of DNA, to order. One is http://www.mrgene.com , but at $0.39/bp it is still much more economical to isolate and amplify desired sequences using PCR.


> To probe for information in a DNA database, you construct the reverse-complement of the desired bit of information

To build the reverse-complement of the information, don't you need to have the information in the first place?

I know it's possible to store information in DNA through various means, but I don't believe it can be done at the density the OP calculated. If we're going to take into account only information storage while ignoring retrieval considerations, then we shouldn't compare naked cells with no DNA duplication to reliable whole hard drives.

Call it human pride, but I think we've beaten mother nature in several aspects ;)


Think of it as a massive key-value store: you construct your query off the key, use that to pull out the key-value pair, and when you sequence your key you continue to sequence more in order to pull out the value. If you prefer sequential addresses, your key could be just that.

And actually, this could be done at a much higher density than what the original poster described, as he's counting the full cell in the density calculation, and DNA is only a small fraction of the cellular volume. You could duplicate all the DNA 10-100 times in the same amount of space once you take out all the ribosomes, proteins and extra water. And as long as it's not stored in direct sunlight or next to your pile of plutonium, DNA is going to be much much more stable than aligning magnetic fields. We're still getting good DNA sequence out of bones that are tens of thousands of years old.

When you think of nanotechnology and miniaturization, think of biology, because that's where all the real nanotechnology is going on. We've not done any better than nature when it comes to making small machinery. Nature has already invented the commodity interchangeable parts (amino acids and nucleic acids) that can self-assemble into rather fantastic machines.

However, we have beaten mother nature on latency: as I alluded to, a DNA database like this would have latency on the order of days for a lookup. On the other hand, as much parallel access as you can imagine is built in, without additional volume. And this isn't a system that has been engineered at all, I'm just talking about the fundamental properties of a little puddle of DNA and water. If half the engineering that went into modern computer hardware were put into a DNA database, it could be quite competitive with our electronic systems.


Sure, a key-value store would work. My point is that the OP's system is not such a store. He just stores 10 PBs of raw data with no indexing and no duplication, so there is no way to retrieve data and comparison with hard disks is meaningless. My post was an answer to his "please correct my math".


The reason one base pair can encode 2 bits is that T-C and C-T are different valid base pairs. The same is true for A-G and G-A as well. A "single" strand (the famous double helix) of DNA actually contains 2 copies of genetic information.


If I had to guess, it's because the way you read DNA is only along one of the strands (starting at the 5' end, IIRC), in which case you've still got four possible "characters" to use in your representation- A-G is different from G-A, etc. People have thought up all sorts of clever ways to think of how one might encode data using DNA base pairs, but it's been too long since I did any bioinformatics for me to be able to remember exactly how they calculated it all out.


Re: Number of bits per pair: TC, AG, CT and GA? I.e. composition (1 bit) and direction (1 bit)? Not sure, but that seems to be the most logical way.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: