Researchers encode malware in DNA, compromise DNA sequencing software
With everyone from academics to Microsoft looking at the prospect of storing data using DNA, it
was probably inevitable that someone would start looking at the security implications.
Apparently, they're worse than most people might have expected. It turns out it's possible
to encode computer malware in DNA and use it to attack vulnerabilities on the computer that
analyzes the sequence of that DNA.
The researchers didn't find an actual vulnerability in DNA analysis software—instead, they
specifically made a version of some software with an exploitable vulnerability to show that the
risk is more than hypothetical. Still, an audit of some open source DNA analysis software shows
that the academics who have been writing it haven't been paying much attention to security best
More like a virus than most
DNA sequencing involves determining the precise order of the bases that make up a DNA strand.
the process that generates the sequence is generally some combination of biology and/or
once it's read, the sequence is typically stored as an ASCII string of As, Ts, Cs, and Gs. If
handled improperly, that chunk of data could exploit vulnerable software to get it to execute
arbitrary code. And DNA sequences tend to see a lot of software, which find overlapping
align it to known genomes, look for key differences, and more.
To see whether this threat was more than hypothetical, the researchers started with a really
exploit: store more data than a chunk of memory was intended to hold, and redirect program
to the excess. In this case, said excess contained an exploit that would use a feature of the
shell to connect into a remote server that the researchers controlled. If it worked, the server
would then have full shell access to the machine running the DNA analysis software.
Actually implementing that in DNA, however, turned out to be challenging. DNA with Gs and Cs
stronger double-helix. Too many of them, and the strand won't open up easily for sequencing. Too
few, and it'll pop open when you don't want it to. Repetitive DNA can form complex structures
get in the way of all the enzymes we normally use to manipulate DNA. The computer code they
to use, however, had lots of long runs of the same character, which made for a repetitive
that was very low in Gs and Cs. The company they were ordering DNA from couldn't even synthesize
In the end, they had to completely redesign their malware so that its translation into nucleic
produced a DNA strand that could be synthesized and sequenced. The latter created another
The most common method of sequencing is currently limited to reading a few hundred bases at a
Since each base has two bits of information, that means the malware has to be incredibly
That limits what can be done, and it explains why all this particular payload did was open up a
Then, there was the matter of getting the malware executed. Since this was a proof of concept,
researchers made it easy on themselves: the modified an existing tool to create an exploitable
vulnerability. They also made some changes to the system's configuration to make the execution
random memory locations easier (made the stack executable and turned off memory address
randomization). While that makes the test environment less realistic, the goal was simply to
demonstrate that DNA-delivered malware was possible.
With everything in place, they ordered some DNA online then sent it off to a facility for
sequencing. When their sequences came back, they sent them through a software pipeline that
their vulnerable utility. Almost immediately, the computer running the software connected into
host, providing them with access to the machine. The malware worked.
Given how easy the authors made things—a known vulnerability and a number of safeguards turned
off—does this really pose a threat? There's good news and bad news here.
On the good side, there's the complications of translating computer instructions into DNA that
be synthesized and sequenced. Plus there's the issue that most sequencing machines are limited
how long a sequence they can read. The machine used in this work maxes out at 300 bases, which
the equivalent of 600 bits, and most facilities keep things shorter than that. Longer read
are available, but they're also error prone, and any errors will typically disable the
But it's also common for the software used to analyze DNA to look for places where two short
sequences overlap and use that to build up longer sequences. This has the potential to expand
size of the malware considerably, although less of the analysis software pipeline will be
these longer, assembled sequences.
Similar issues exist with how the malware is encoded. While the authors used each base to encode
bits, DNA analysis software handles DNA in various ways internally. For example, if sequencing
doesn't provide a clear indication of what a base is, other characters may be used (for example,
for any base, or R for G or A). Any software that handles these ambiguous bases has to have a
complex encoding scheme; many simply use ASCII characters.
As a result, different pieces of software will be vulnerable to different malware encodings.
that means some software will be immune, the size of the DNA analysis pipelines typically means
a dozen or more pieces of software will be run in succession. Chances are good that at least one
them will use he same encoding as the malware.
The research community's habits are also a major point of vulnerability. The analysis software
generally not written with security in mind. Using the Clang compiler's analysis tools and HP's
Fortify compiler, the authors searched a collection of open source DNA analysis software for
potential vulnerabilities. They found widespread use of functions that are prone to buffer
(strcat, strcpy, sprintf, vsprintf, gets, and scanf)—about two instances for every 1,000 lines
code. "Our research suggests that DNA sequencing and analysis have not to date received
significant—if any—adversarial pressure," they conclude.
The second issue is how easy it is to infiltrate malicious code onto other machines via DNA. The
sequencing machines have such a high capacity, work from several different labs is run on a
machine at the same time. As a result, some of the sequences returned from the machine will end
mixed into an unrelated sample. When the researchers checked with another group that had their
sequencing performed at the same time, they found that the other group's results contained 27
instances of the malware.
Separately, lots of services simply allow you to send in any DNA for sequencing, putting their
software at risk. And many public repositories allow people to upload their sequence for
others. So, you wouldn't even have to synthesize any DNA to have your exploit analyzed—you can
simply upload the text of the sequence you've designed to someone else's data repository.
None of this means that a DNA-based exploit is around the corner. But it's a healthy warning that
the research community and commercial DNA companies should look to improve their practices
this does become a problem.