Researchers encode malware in DNA, compromise DNA sequencing software

With everyone from academics to Microsoft looking at the prospect of storing data using DNA, it was probably inevitable that someone would start looking at the security implications. Apparently, they're worse than most people might have expected. It turns out it's possible to encode computer malware in DNA and use it to attack vulnerabilities on the computer that analyzes the sequence of that DNA.

The researchers didn't find an actual vulnerability in DNA analysis software—instead, they specifically made a version of some software with an exploitable vulnerability to show that the risk is more than hypothetical. Still, an audit of some open source DNA analysis software shows that the academics who have been writing it haven't been paying much attention to security best practices.

More like a virus than most

DNA sequencing involves determining the precise order of the bases that make up a DNA strand. While the process that generates the sequence is generally some combination of biology and/or chemistry, once it's read, the sequence is typically stored as an ASCII string of As, Ts, Cs, and Gs. If handled improperly, that chunk of data could exploit vulnerable software to get it to execute arbitrary code. And DNA sequences tend to see a lot of software, which find overlapping sequences, align it to known genomes, look for key differences, and more.

To see whether this threat was more than hypothetical, the researchers started with a really simple exploit: store more data than a chunk of memory was intended to hold, and redirect program execution to the excess. In this case, said excess contained an exploit that would use a feature of the bash shell to connect into a remote server that the researchers controlled. If it worked, the server would then have full shell access to the machine running the DNA analysis software.

Actually implementing that in DNA, however, turned out to be challenging. DNA with Gs and Cs forms a stronger double-helix. Too many of them, and the strand won't open up easily for sequencing. Too few, and it'll pop open when you don't want it to. Repetitive DNA can form complex structures that get in the way of all the enzymes we normally use to manipulate DNA. The computer code they wanted to use, however, had lots of long runs of the same character, which made for a repetitive sequence that was very low in Gs and Cs. The company they were ordering DNA from couldn't even synthesize it.

In the end, they had to completely redesign their malware so that its translation into nucleic acids produced a DNA strand that could be synthesized and sequenced. The latter created another hurdle. The most common method of sequencing is currently limited to reading a few hundred bases at a time. Since each base has two bits of information, that means the malware has to be incredibly compact. That limits what can be done, and it explains why all this particular payload did was open up a remote connection.

Then, there was the matter of getting the malware executed. Since this was a proof of concept, the researchers made it easy on themselves: the modified an existing tool to create an exploitable vulnerability. They also made some changes to the system's configuration to make the execution of random memory locations easier (made the stack executable and turned off memory address randomization). While that makes the test environment less realistic, the goal was simply to demonstrate that DNA-delivered malware was possible.

With everything in place, they ordered some DNA online then sent it off to a facility for sequencing. When their sequences came back, they sent them through a software pipeline that included their vulnerable utility. Almost immediately, the computer running the software connected into their host, providing them with access to the machine. The malware worked.

Semi-realism

Given how easy the authors made things—a known vulnerability and a number of safeguards turned off—does this really pose a threat? There's good news and bad news here.

On the good side, there's the complications of translating computer instructions into DNA that can be synthesized and sequenced. Plus there's the issue that most sequencing machines are limited in how long a sequence they can read. The machine used in this work maxes out at 300 bases, which is the equivalent of 600 bits, and most facilities keep things shorter than that. Longer read machines are available, but they're also error prone, and any errors will typically disable the malware.

But it's also common for the software used to analyze DNA to look for places where two short sequences overlap and use that to build up longer sequences. This has the potential to expand the size of the malware considerably, although less of the analysis software pipeline will be exposed to these longer, assembled sequences.

Similar issues exist with how the malware is encoded. While the authors used each base to encode two bits, DNA analysis software handles DNA in various ways internally. For example, if sequencing doesn't provide a clear indication of what a base is, other characters may be used (for example, N for any base, or R for G or A). Any software that handles these ambiguous bases has to have a more complex encoding scheme; many simply use ASCII characters.

As a result, different pieces of software will be vulnerable to different malware encodings. While that means some software will be immune, the size of the DNA analysis pipelines typically means that a dozen or more pieces of software will be run in succession. Chances are good that at least one of them will use he same encoding as the malware.

Bad habits

The research community's habits are also a major point of vulnerability. The analysis software was generally not written with security in mind. Using the Clang compiler's analysis tools and HP's Fortify compiler, the authors searched a collection of open source DNA analysis software for potential vulnerabilities. They found widespread use of functions that are prone to buffer overflows (strcat, strcpy, sprintf, vsprintf, gets, and scanf)—about two instances for every 1,000 lines of code. "Our research suggests that DNA sequencing and analysis have not to date received significant—if any—adversarial pressure," they conclude.

The second issue is how easy it is to infiltrate malicious code onto other machines via DNA. The sequencing machines have such a high capacity, work from several different labs is run on a single machine at the same time. As a result, some of the sequences returned from the machine will end up mixed into an unrelated sample. When the researchers checked with another group that had their sequencing performed at the same time, they found that the other group's results contained 27 instances of the malware.

Separately, lots of services simply allow you to send in any DNA for sequencing, putting their software at risk. And many public repositories allow people to upload their sequence for analysis by others. So, you wouldn't even have to synthesize any DNA to have your exploit analyzed—you can simply upload the text of the sequence you've designed to someone else's data repository.

None of this means that a DNA-based exploit is around the corner. But it's a healthy warning that the research community and commercial DNA companies should look to improve their practices before this does become a problem.

Digital Flow s.r.o.