DNA is an Ancient Form of Data Storage. Is it Also a Radical New Alternative?
We are reaching the limits of our data storage capabilities. The global demand for data storage is enormous, with data centers popping up like electric mushrooms from the fundament in nearly every corner of the globe. There are some 11,000 data centers worldwide and 3,077.8 MW worth of data center space was under construction in primary markets in 2023.
With the advent of AI technology, which has worked its way into nearly every aspect of the digital world, the demand for data storage will only increase — likely at an exponential rate.
Thus, we must turn to novel technologies that allow us to store it in more efficient, cost-effective ways. One of the most revolutionary approaches involves the use of the codes that enable life itself — deoxyribonucleic acid or DNA.
DNA is the original IT. Humans have been recording our existence in written form since 3,400 BC but life itself is the physical expression of data — and may date to 4.1 billion years ago.
This most ancient of codes may allow us to archive information for thousands of years or even longer. DNA is amazingly persistent. An entire ecosystem was reconstructed from 2-million-year-old DNA in Greenland, revealing the plants and animals that once inhabited the region. DNA has been sequenced from 700,000 year old horse bones and recovered from million-year-old mammoth remains as well.
The double helix structure of DNA was discovered in 1953 by Francis Crick, James Watson, Rosalind Franklin, and Maurice Wilkins. Since then, we have learned to synthesize artificial DNA to serve various purposes in biology and medicine. Sequences of up to 500 nucleotides can be created but then they must be assembled into larger strands, which must also be indexed so they can be later retrieved.
A book and several images, as well as a JavaScript program, were encoded in DNA in 2012. The word ‘Hello’ was written in DNA in 2018. A range of techniques for converting the binary code that powers our current digital data systems to organic molecules — and then retrieving it — have been developed. They remain, however, highly inefficient. A chip developed in 2021 that can control synthesis of DNA relies on the toxic chemical solvent acetonitrile, which could damage other server components and thus is an undesirable large-scale solution.
Still, the technology remains promising. While traditional coding allows for a single bit — a zero or a one — to be stored in a position, DNA might allow two bits to be stored in a single base: adenine (A), cytosine (C), thymine (T), or guanine (G). Typically, however, it only stores one. In a double stranded helix, A pairs with G and C pairs with T. Both double strands and single strands have been proposed for data storage.
A single human genome contains 100 zettabytes of DNA, only a bit less than all the data currently online. So, the density of this storage method has significant appeal — and is thus a major target for technologists who wish to optimize our current, inefficient data storage methods.
Here, InformationWeek delves into the technology behind DNA storage, with insights from Erfane Arwani, CEO and co-founder of digital data storage startup Biomemory, and Emily Leproust, CEO and co-founder of synthetic DNA manufacturer Twist Bioscience.
Reasons for DNA Storage
The exploration of DNA as a storage medium may seem like a novelty, a scientific indulgence pursued for its own sake. But in fact, it addresses some very real concerns.
Beginning in 2005, more data was stored in digital formats than in analog formats such as books or film. According to Gartner, demand for storage may exceed its capacity by some two thirds by 2030. To meet that demand, we will need enormous numbers of additional data centers– and the manufacturing capacity to supply the technology that powers them. Demand for silicon may be exceeded by 2040 according to one estimate. Looking at other, more efficient, technologies is not just prudent but exigent.
DNA storage could potentially store the roughly 147 zettabytes of information currently on the Internet in a space the size of a sugar cube. Of course, even far less efficient versions of the technology would also be incredibly helpful.
Current storage methods have their own vulnerabilities. For example, the magnetic tape that stores much of our data requires very specific conditions, including low temperature and humidity. This necessitates significant energy expenditure for climate control. While storage capacity has increased, it is not keeping pace with demand. And the lifespan of these tapes is limited. While some tapes may last up to a century, most will degrade after 10–20 years. They must also be replaced as new generations of hardware are produced.
DNA is a thousand times denser than even our most efficient flash memory technology and, depending on the technology used, can store data without sophisticated, energy-intensive controls once it is encoded. This is a significant benefit given that climate-control may constitute close to 50% of running a data center. And unlike current data storage methods, DNA is data.
But properly stored data stored in DNA can last as long as, if not longer than, mammoth fossils or other DNA that have been recovered in natural conditions. The DNA Storage Alliance comprises a partnership between a wide array of organizations working on optimizing conditions for storing data in DNA.
Encouragingly, the cost of DNA sequencing has dropped significantly. For example, the cost of the project that sequenced the first human genome was nearly $3 billion. Now, a single human genome can be sequenced for around $600.
Types of DNA Storage
Many techniques emerged for storing data written as DNA. It can be stored in test tubes, either in dried form or in liquid. It can be dried and stored on pieces of paper or on glass. It can be encapsulated in particles of sugar, silica or other substances. And it can even be encoded in the genomes of living microorganisms.
“One of the advantages of DNA as a storage system is that it’s immutable. It’s nature’s storage system and it’s designed to store crucial data for as long as life exists,” Leproust enthuses.
Emily Leproust, Twist Bioscience
Storage in test tubes has proven to be less than ideal in other applications. Light and moisture can quickly degrade the sequences. One experiment demonstrated that DNA stored at ambient temperatures in test tubes with no moisture degraded to 10% of its original concentration within a year.
Other forms of storage have proven more effective. “There isn’t anything special that needs to be done to store it once it’s in its proper container,” Leproust says. “It consumes virtually no energy once it’s stored. It doesn’t need to be kept cool and the data doesn’t need to be migrated, unlike hard disc drives or tape.”
Storage in dehydrated form on paper is common practice in DNA research laboratories and chemicals have been developed to stabilize it. One company, Catalog, is developing technology that uses inkjet technology to apply nucleotides and enzymes to a sheet of film. An incubation process facilitates the assembly of the particles into DNA.
DNA encased in various substances can be labeled using short DNA sequences that identify the contents so they can later be extracted. Silica appears to be a leading candidate for encapsulating DNA. Sequences are sealed in silica particles and tagged with short strands of DNA to identify them for later retrieval. Experiments that simulate degradation over time have shown that silica is highly protective.
Sugars and silk protein are also used. Salts such as calcium phosphate mimic the bone that has stored DNA later extracted from long-dead animals — suggesting that it might have similar long-term storage capabilities. While not all these techniques have been deployed in data-to-DNA applications, they may have potential. DNA is, as some researchers have claimed, artificially fossilized using these techniques.
As Arwani explains, multiple technologies may be employed. “We concentrate on a combination of encapsulation, desiccation, error correction codes (ECC), and nano-coating,” he says of Biomemory’s approach. “Encapsulation encloses DNA within protective materials, safeguarding it from environmental degradants and significantly enhancing its stability. Desiccation and cooling further contribute to preserving DNA by reducing degradation rates in dry and cool environments. Error correction codes are crucial for ensuring data integrity, encoding data in ways that allow for the detection and correction of errors, thereby preventing data loss despite DNA degradation. Nano-coating offers additional protection against physical and chemical damage at the nanoscale, ensuring the long-term stability and integrity of DNA data storage.”
Bacterial and yeast cells can be programmed to carry DNA data as well. In one experimental technique, data is stored in the plasmids — miniscule rings of DNA — of bacteria. Because bacteria exchange this information in a primitive form of mating called conjugation, other bacteria can be recruited to retrieve the DNA from these living databases. The bacteria that house the DNA are trapped by antibiotics but the bacteria that pick up the information they carry are resistant and can thus traverse the boundary. While this method is not yet practical on a large scale, it is suggestive of what might be possible with minor modifications to current technology.
Some organisms can be induced to sporulate or go into a dormant form that can later be reawakened under the right conditions. This may be an appealing approach because processes in living organisms lead to mutations and thus corruption of the data. Storing the information in dormant organisms may reduce these effects.
Cellular machinery can also be adjusted to reduce the rates of mutations and multiple backup copies can be created for comparison and the later reassembly of a relatively unaltered dataset. Yeast has proven to be useful in this regard because their nucleosomes, which package DNA, have inherent mechanisms that repair mutations.
Transferring Data to DNA
For DNA to be a viable means of storing data, the 0s and 1s that are used to encode it must be translated into nucleobases — A, G, C and T. Metadata must be included to ensure that it can later be decoded. The repetition of more than three nucleobases in the same sequence can lead to errors, so they must be avoided. Once the code has been translated, the bases themselves must be synthesized.
“Naturally, DNA occurs as a double strand, which, due to covalent bonds in this conformation, is incredibly durable and can degrade very slowly — ranging from several decades to several million years. In contrast, aging experiments with single-stranded DNA have shown significant fragility,” Arwani explains.
Chemical synthesis of DNA relies on the phosphoramidite method. It uses modified nucleosides — essentially the same as a nucleotide, minus a phosphate group — to build DNA sequences. In addition to resulting in toxic byproducts, chemical synthesis methods are fairly slow — writing a single base may take as long as six minutes.
Current chemical synthesis technologies allow for the creation of stretches of DNA some 200-300 bases long and are thus limited.
More sustainable methods that use enzymes, notably terminal deoxynucleotidyl transferase, have been developed. Enzymatic methods are also faster and can potentially create longer strands. Their byproducts are less toxic. However, they have only been able to encode relatively small amounts of data.
Both methods can be used in ligation, which uses pre-assembled sequences of DNA to more efficiently encode data.
Once the data has been encoded in DNA sequences, these sequences must be replicated using polymerase chain (PCR) reactions, which use repeated cycles of heating to separate DNA strands and then exposure to enzymes that replicate the sequences. Luckily this is a more efficient process. These duplicates serve as backups — retrieving DNA-encoded data is destructive so backups are necessary to ensure its persistence.
“It’s fast, easy, and very inexpensive to make thousands of copies of DNA using PCR. You can store different copies in different locations for safety,” Leproust says.
Tagging and mapping are essential as well, using either DNA or other chemicals to label the segments and enable efficient retrieval. Some researchers have pointed out that artificially created DNA should be tagged as such, so that hypothetical analysts in the distant future do not mistake it for biologically created DNA.
“Incorporating a tailored error-correcting code is crucial for further reducing inaccuracies,” Arwani adds.
Reading DNA
Polymerase chain reactions are also used to read DNA. A primer, or short sequence of DNA, is used to locate a particular sequence that has been tagged. It binds to that sequence and then amplifies it so it can be read. This is known as sequencing by synthesis (SBS). The process typically pulls out irrelevant stretches of DNA and destroys them as well, hence the need for multiple copies of each data set encoded in the DNA.
“When retrieving DNA data, current technologies may require 40 to 100 DNA molecules of the same data for reading, and these molecules are destroyed in the process. However, at Biomemory, we view this not as a limitation but as a manageable aspect of our technology,” Arwani says. “When synthesizing DNA, we produce a controlled quantity of identical molecules. Depending on the anticipated number of reads, we can generate anywhere from a few hundred to several million molecules. This ability ensures that sufficient data copies are available for multiple retrieval processes without risking data loss.”
Erfane Arwani, Biomemory
DNA that has been encapsulated and labeled with specific terms may allow for more efficient retrieval. The primers bind to the tags that correlate to those terms and thus only pull those DNA “files” while leaving other files out of the process.
DNA can also be sequenced by passing the strands through microscopic holes in an electro-resistant membrane. As each base passes through the hole, called a nanopore, it creates a different electric signal, which is then recorded by a chip connected to the membrane.
In both technologies, the sequences of the bases are again converted to the 0s and 1s so that they can be read in digital form.
SBS sequencing is cheaper and more accurate. But it is also more destructive and less efficient. Nanopore sequencing has higher error rates and requires longer DNA sequences, which means it is not ready for current DNA storage technologies, which use short sequences.
Complications
While DNA storage is promising, it is not yet practical for storage of large amounts of information nor for retrieval at the speeds needed to support current data demands.
“Currently, DNA is utilized for the robust and long-term storage of small data quantities (a few kilobytes) because we are still in the process of scaling our machinery for data centers, which will store vast amounts of data,” Arwani reports.
“Where we see the most useful application of DNA data storage today is in storage of archival data, also known as cold data,” Leproust adds. “DNA data storage could work alongside current storage applications to store archival data and meet the very real need for more storage options, and more sustainable storage options.”
According to one estimate, it would cost up to $1 trillion to write 1 million gigabytes of data in DNA format. There are nearly 70 trillion gigabytes of data on the Internet. Enzyme solutions, while not toxic, are also extremely expensive. Recycling these solutions may help to reduce costs in the future. One paper estimates that using a new enzyme solution for each round of synthesis of 1,000 strands of 1,000 nucleotides would cost $136,000 but recycling it would reduce the cost to $136.
“To overcome these challenges, laboratories and companies have been focusing on developing technologies that are more tolerant of errors and, crucially, rely on far less expensive consumables,” Arwani relates. “At Biomemory, we’ve taken a pioneering approach by producing our consumables in bioreactors using sugar, eliminating the need for heavy chemical processes, akin to methodologies found in the food industry.”
Error rates are also a concern. Bases may be erroneously inserted or deleted, in the same way that 0s and 1s may be accidentally swapped or removed in digital data. Ensuring that repetitive sequences are not inserted is essential, as is keeping the rate of G and C bases below 50%. High percentages of G and C bases are more difficult to sequence.
While error rates for SBS sequencing are fairly low, they do occur. And they may be as high as 10% for nanopore sequencing.
Errors can also be introduced during storage if the DNA degrades or comes into contact with other substances, underscoring the need for redundant sets of DNA data. UV radiation can damage DNA, as can exposure to water and oxygen.
Still, Leproust thinks this may be of lesser concern than other issues. “Error correction algorithms are used routinely and are a part of all workflows. In fact, the DNA writing used for storage does not need to be on par with medical grade DNA writing and can tolerate fairly substantive errors given the correction algorithms,” she says.
Balancing proprietary technologies and standardization of protocols will become increasingly important as DNA storage comes to scale. The long-term storage potential will be useless if obscure technologies are employed, making the data inaccessible decades or centuries down the line.
Security protocols also need to be developed to ensure that malicious code cannot be introduced into DNA data and that existing data is properly encrypted.
Still, for all its challenges, DNA data storage appears to be developing at a rapid pace and may soon offer highly efficient solutions to our looming storage challenges.