Hamming distance is frequently used as a metric in the design of primers to support random access of files in DNA-based storage systems because a high Hamming distance is an effective proxy for low hybridization likelihood. Hamming distance is a measure of how many symbols differ between two symbolic strings and in our case, the strings of interest are two DNA primer sequences.
When analyzing two primers, p1 and p2, we compared their Hamming distance directly by lining up the sequences and counting the positions in which they are different. Hence, a Hamming distance of 0 means that the two primers were in fact the same sequence. If the Hamming distance was equal to the length of the primer, then every position was different. However, in terms of hybridization, we were interested in whether p1 will bind to the reverse complement of p2, as that binding site was present on the data strand. For convenience, we describe the Hamming distance of the two coding primers, but for hybridization, we analyzed the hybridization likelihood for p1 against the reverse complement of p2. Hence, a Hamming distance of 0 implies hybridization was guaranteed, however, a high Hamming distance implies that hybridization was unlikely, although caveats existed. For example, if p1 was the same as p2 but merely shifted over one position, it had a high Hamming distance but a low edit distance. Such a high Hamming distant primer almost certainly bound due to low edit distance. To ensure that low edit distances do not skew the findings, primers with a much lower edit distance than Hamming distance were screened.
While high Hamming distances of 10 or more were common in past literature, low Hamming distances and their relationship to hybridization were of particular interest to our design. To better understand the potential of exploiting primer binding among similar but non-complementary primers, an in silico analysis was used to predict the likelihood of primer hybridization as a function of Hamming distance. Our approach was based on a Monte Carlo simulation that considered the likelihood of hybridization among many primer pairs. One primer was designed specifically for data storage using procedures common in the field, namely it must have had GC-balance between 40 and 60%, melting temperature between 50 and 55 °C, and avoided long homopolymers. Then, it was randomly mutated into a new primer with varying Hamming distances, from 1 to N, where N was the length of the string. The mutated primer was produced by generating a random index from 1 to N with equal likelihood and randomly picking a new base for that position from the other three bases with equal probability. The mutation process was repeated until a primer with a suitable distance was achieved. Primers with a much lower edit distance were screened in this step, and it is worth noting that such primers had a very low probability due to the probabilistic nature of the mutate step; only a handful was observed over all trials. Using NUPACK’s complex tool, the ΔG for the complex arising from the original primer binding to the reverse complement of the mutated primer was estimated23. Negative values beyond a threshold of −10 kcal/mol were interpreted as binding in our analysis. The Monte Carlo simulation included at least 10,000 trials for each given Hamming distance to estimate the likelihood of hybridization. The percentage of mutated primers with a high chance of hybridizing for each Hamming distance is reported as the hybridization % in Fig. 1b.
The python program that performed this analysis is included in our code repository as part of the Supplementary Material32.
Primers were selected for use in File Preview using a similar screening process as that for the Hybridization Model. However, instead of generating many trials, only a handful of primers were produced at each desired Hamming distance. These primers were then subjected to additional experimental screening.
Using one primer sequence as the 0 Hamming distance amplifying primer, 10 variable strand addresses at each even-numbered Hamming distance were used as template strands for qPCR amplification (Supplementary Table 1). All strands were amplified using the same primer pair since they contained the same forward primer binding sequence while varying the reverse primer binding sequence. Reactions were performed in 6 μL format using SsoAdvanced Universal SYBR Green Supermix (BioRad). A range of primer concentrations (125–500 nM), template strand concentrations (2E3-2E6 strands/μL), and annealing temperatures (40–60 °C) were tested. Thermocycler protocols were as follows: 95 °C for 2 min and then 50 cycles of: 95 °C for 10 s, 40–60 °C for 20 s, and 60 °C for 20 s followed by a melt curve increasing from 60 °C to 95 °C in 0.5 °C increments held for 5 s each. Data were analyzed using BioRad CFX Maestro. Cq value (i.e., cycle number at which a sample reached a defined fluorescence) and melt curve data (i.e., temperature a sample was denatured while being heated) were used for analysis. Successful amplifications were defined as crossing the Cq threshold before the negative control while also creating an individual peak (i.e., single product) on the melt curve.
Using four distinct primer pairs as the 0 Hamming distance amplifying primers, 5–30 unique strands (60 bp) containing variable address pairs at 2, 3, 4, or 6 Hamming distance were tested as template strands alongside 0 HD strands (200 bp) in competitive qPCR amplifications (Supplementary Table 2). All strands designed using the same original primers were amplified using the 0 HD primer pair. Reactions were performed in 6 μL format using SsoAdvanced Universal SYBR Green Supermix (BioRad). Template strand concentrations were in equal copy number concentration for the 0 HD and variable HD strands (1.67E5 strands/μL). A range of primer concentrations (125–500 nM) and annealing temperatures (40–60 °C) were tested. Thermocycler protocols were as follows: 95 °C for 2 min and then 50 cycles of: 95 °C for 10 s, 40–60 °C for 20 s, and 60 °C for 20 s. Final products were diluted 1:60 in 1×TE before analysis using high-sensitivity DNA fragment electrophoresis (Agilent DNF-474; Advanced Analytical 5200 Fragment Analyzer System; Data analysis using Prosize 3.0). The ability for a primer to variably amplify a strand with a nonspecific primer binding site at different PCR conditions, or amplification tunability, was calculated using the following equation (concentrations in nmole/L):
File Preview was performed on JPEG images due to their widespread popularity, their small storage footprint, and their support for organizing data within a file that works synergistically with the goals of File Preview in this work. In particular, JPEG’s progressive encoding33 allowed for image data to be partitioned into scans by the color band and by frequency. Through careful organization of the file, a low-resolution grayscale image was constructed from a small percentage of the file’s data, or an increasingly higher-resolution image was obtained from reading a greater percentage of the file32. For the File Preview operations, the JPEG information was arranged in such a way that a 0 HD access pulled out a small amount of data and produced a low-resolution image. By tuning the access conditions as described, more of the file was accessed and a greater resolution image was produced.
The most important aspects of the JPEG format are described for the sake of explaining how Preview works. JPEG holds information in three color bands known as Y, Cb, Cr that together encode information for all pixels in an image. Y represents luminosity, Cb is a blue color band, and Cr is a red color band. Together, these components may represent any conventional RGB color. Each pixel of an image can be thought of as having a tuple of Y, Cb, and Cr components although they are not actually stored that way.
JPEG does not store images in a naive matrix of (Y, Cb, Cr) pixel values. This would waste storage since many pixels have the same color. Instead, each 8 × 8 block of pixels from each color band are converted into a frequency domain representation using the 2-D Discrete Cosine Transform (DCT). The 2-D DCT has the interesting effect of partitioning the data into low-frequency and high-frequency components. Each 8 × 8 block becomes a linearized vector of 64 values ordered from low frequency to high frequency. The first value in the vector is known as the DC value because it represents an average value across the original 8 × 8 pixel block. For example, if the original 8 × 8 blocks were entirely white, the Y band would have a DC value of 255, indicating the average value over the block was white. The remaining 63 entries represent the higher frequency components known as the AC band. For an all-white block, the rest of the vector would be 0, indicating no other content.
In a progressive encoding, each color band is encoded in scans. A scan is the aggregation of all values from a given position in the linearized vector across all 8 × 8 blocks. For example, the first scan of a file would include all of the DC values from the Y band across all 8 × 8 blocks. The scan of DC values for a given band is given as Y[0], Cr[0], or Cb[0]. The Y[0] scan by itself is essentially a low-resolution grayscale image. Cr[0] and Cb[0] would add low-resolution color information.
The DC scans precede the AC scans. The AC scans group the following AC components together, and these scans could include a single value from the linearized vector or multiple values. For example, Y[1:5] would include indices 1 through 5 of the linearized vectors taken from all 8 × 8 blocks in the Y band. All indices from 1 through 63 must be included in at least one scan. This is repeated for all bands. The JPEG standard additionally compresses each scan to save storage space, but the details of that mechanism are not pertinent to Preview and are omitted. Furthermore, the scans follow the standard and are stored in compressed form.
The JPEG files were first encoded into 42 scans: Y[0], Cr[0], Cb[0], Y[1:5], Cb[0],Cr[0], Y[6:10], Y[11:15], Y[16:20], Y[21:25], Y[26:30], Y[31:35], Y[36:40], Y[41:45], Y[46:50], Y[51:55], Y[56:60], Y[61:63], Cb[1:5], Cr[1:5], Cb[6:10], Cr[6:10], Cb[11:15], Cr[11:15], Cb[16:20], Cr[16:20], Cb[21:25], Cr[21:25], Cb[26:30], Cr[26:30], Cb[31:35], Cr[31:35], Cb[36:40], Cr[36:40],Cb[41:45], Cr[41:45], Cb[46:50], Cr[46:50], Cb[51:55], Cr[51:55], Cb[56:60], Cr[56:60], Cb[61:63], Cr[61:63].
The scans were grouped into partitions. Wuflab logo and Wright Glider 2 had four partitions, and Wright Glider 1 and Earth had three partitions. In all cases, the first and second partitions, if accessed alone, provided low-resolution images that are recognizable as the image. For the Wuflab logo and Wright Glider 2 files, the third partition contained all remaining scans. For the others, the third partition added DC color information and some higher frequencies of the Y band to improve image quality, and the fourth partition contained all remaining scans.
Each partition was treated as a block of data and encoded into DNA as a unit. Each partition was tagged with primers. Higher numbered partitions were given primers with a greater Hamming distance.
The encoding process is described in Supplementary Fig. 7. Each partition was encoded into DNA using a multilevel approach. First, the JPEG file was partitioned into scans. Then, each partition was divided into blocks of 1665 bytes, which were interpreted as a matrix with 185 rows and 9 columns with one byte per position. Blocks smaller than 1665 bytes at the end of a file or partition were padded out with zeros. An RS outer code with parameters of [n = 255, k = 185, d = 71] added additional rows to each block to compensate for the possible loss of strands within a block. Each row was given a unique index that was two bytes long. Then, each row was appended with error correction symbols using an RS inner code given as [n = 14, k = 11, d = 4] that protected both the data and index bytes.
Each row of a byte was converted into a DNA sequence using a comma-free code that mapped each byte to a unique codeword sequence. The codewords were designed using a greedy algorithm to be GC-balanced and have an edit distance of at least two to all other codewords. Each codeword had a length of 8 nts. The last step was the appending of primers to each end of the sequence and insertion of a restriction enzyme cut site in the middle of the strand. Each partition of the JPEG file used different primer binding sites, so these primer sequences were given as inputs for each partition as it was encoded.
An additional set of flanking primers were added to each strand to enable the entire library to be amplified at once using a single common primer. The final set of strands for each file were synthesized into a DNA library.
The four-file synthetic DNA library was ordered from Twist Biosciences. Flanking primer amplifications unique to each subset of strands (Supplementary Table 3) were optimized and the resulting products were used in screening and preview reactions. Each subset of strands within a file encodes an increasing percentage of the stored image and contains a unique restriction enzyme cut site to allow for rapid sample analysis. It was determined that each block of data encoded in strands with increasing Hamming distance binding sites (2, 4, and 6 HD), needed to be physically stored with extra copies of the nonspecific strands: 10×, 100×, and 1000×, respectively. A screen of variable PCR conditions was conducted on files from the library prior to preview and full-access reactions. Reactions were performed in 6–50 μL format using SsoAdvanced Universal SYBR Green Supermix (BioRad) or Taq polymerase (Invitrogen). Conditions varied during testing include: annealing temperature (40–60 °C), annealing and extension timing (20–90 s), number of cycles (25–40), primer concentration (62.5–1000 nM), polymerase concentration (0.5–2× recommended units), dNTP concentration (200–800 μM), MgCl2 concentration (0.75–3 mM), KCl concentration (50–200 mM), and absence or presence of 0.1–1% Triton X−100, 0.1–1% BSA, 0.1–1% Tween−20, 2–8% DMSO, 0.1–3.5 mM Betaine, or 2% DMSO plus 0.1 mM Betaine. Reaction products (1 μL) were added to restriction enzyme reactions to cut 0, 2, 4, or 6 HD sections of the products. Digestion products were diluted 1:3 in 1×TE for analysis using high-sensitivity DNA fragment electrophoresis (Agilent DNF−474; Advanced Analytical 5200 Fragment Analyzer System; Data Analysis using Prosize 3.0). Quantification data were taken directly from Fragment Analyzer. Undigested preview, full access, and intermediate samples were then analyzed via Illumina Next-Generation Sequencing (Genewiz and AmpliconEZ).
Template DNA was amplified using 0.5 μL of Taq DNA polymerase (5 units/μL, Invitrogen) in a 50 μL reaction containing 1× Taq polymerase Rxn Buffer (Invitrogen), 2 mM MgCl2 (Invitrogen), the sense and antisense primers at 1E13 strands each, and dATP, dCTP, dGTP, dTTP (NEB), dPTP (TriLink), and 8-oxo-dGTP (TriLink), each at 400 mM. PCR conditions were 95 °C for 30 s, 50 °C for 30 s, and 72 °C for 30 s for 35 cycles with a final 72 °C extension step for 30 s.
In Fig. 3h, i and Supplementary Fig. 4, we refer to background size (GB). For clarity and ease of comparison, this value was calculated based on the total number of DNA strands. Each strand is comprised of 200 nts, 20 of which are used for each primer sequence, 16 for the index, and 8 for the checksum. Eight nts comprise each 1-byte codeword. Thus, each strand addressed with a single primer pair contains 17 bytes of data. We assumed a 10-copy physical redundancy per unique strand to provide a conservative estimate for a realistic system where multiple copies of each strand would likely be needed to avoid strand losses and inhomogeneous strand distributions. Thus, the total background size is divided by 10.
FASTQ files obtained from sequencing were all decoded successfully into images. Decoding occurred in the reverse order shown in Supplementary Fig 7. Files were reconstructed by placing all data blocks and JPEG file partitions into the correct order based on their index. Since error correction was applied separately to each partition, each partition succeeded or failed at partition boundaries. If a partition was incomplete, it was omitted from the JPEG image. As long as omitted partitions were the latter partitions taken from AC scans, their absence only reduced the quality of the JPEG image and made it appear lower resolution or grayscale, depending on the scans that were lost in the partition. However, if the first partition in the file was missing or too erroneous to decode, the image would be unreadable. No experiment yielded an undecodable or unreadable image. The successfully decoded images are shown in Fig. 3b, f, and h.
To gain deeper insight into which strands were sequenced and their relative abundance, a clustering analysis was performed on all sequenced reads32. The Starcode algorithm is an open-source and efficient algorithm for performing an all-pairs search on a set of sequencing data to find all sequences that are within a given Levenshtein distance to each other34. To derive the number of reads for each encoded strand in the library, the algorithm was seeded with 20 copies of each strand from the library. The Starcode algorithm was additionally given the following parameters: Levenshtein distance set to 8 edits, the clustering algorithm set to message passing, and the cluster ratio set to 5. The Levenshtein distance parameter defines the maximum edit distance allowed when determining whether a strand belongs to a cluster. The clustering algorithm attributed all reads for a given strand S to another strand V provided that the ratio of V’s reads to S’s reads were at least the cluster ratio. Hence, providing 20 copies of each expected strand ensured that each was well represented during clustering such that it was considered a centroid. With the clusters formed, each centroid was interrogated to make sure that it was a strand from the library and not an unexpected strand present during sequencing. If the centroid matched an expected strand defined by the encoded file(s), the number of reads for that strand was adjusted to match the size of the cluster less than 20. These results are reported in Fig. 3c.
Further information on research design is available in the Nature Research Reporting Summary linked to this article.











