Researchers eye machines to analyze malware
Robert Lemos, SecurityFocus 2006-06-08
The reliance on humans for analyzing malware bothers Thomas Dullien.
The reverse engineer--better known amongst security researchers by
his nom de plume, Halvar Flake-- created an automated system for
classifying software into groups, a process for which he believes
machines are much better suited. Research using the system has
underscored the sometimes-arbitrary decisions humans make in
classifying malicious programs, he said. Among other anomalies, he
found that Sasser.D has only a 69 percent correlation to previous
members of the Sasser family, while two examples of bot software,
Gobot and Ghostbot, are more similar.
"It's like putting donkeys and bunnies in the same class because they
both have long ears," Dullien, the founder and CEO of reverse-
engineering tool maker Sabre Security, said in a recent interview.
The current problems with classifying and naming viruses are among
the reasons that automated classification technology has once again
become a focus of research. The plethora of names for specific
malicious programs has caused confusion amongst consumers, despite a
project that seeks to provide guidance, if not to consumers, to
software analysts and incident responders. In January, when a new
computer virus appeared on the Internet, antivirus companies rushed
to issue alerts and inundated consumers with a confusing array of
names: Blackmal, Nyxem, MyWife, KamaSutra, Blackworm, Tearec and
Worm_Grew all describe the same mass-mailing computer virus.
Several research projects hope to improve upon that record.
Last month, at the annual conference of the European Institute for
Computer Anti-Virus Research (EICAR), Microsoft released early
results of its development of a system to automate classification of
malicious software based on the actions performed by the code at
runtime.
"A significant challenge we have today is the large number of active
malware samples, totaling on the order of tens of thousands, and
increasing rapidly," Tony Lee, a virus researcher at Microsoft, said
in a recent blog posting following the conference. "It has become
apparent to us that the traditional manual analysis process is not
adequate in dealing with malware of this order of magnitude, and that
we should seek automation technologies to aid human analysts."
The researchers modeled a piece of malicious software as the series
of actions that the software takes at the operating system level.
Referred to as "events" in a paper written by Lee and anti-malware
program team manager Jigar Mody, the actions can include data
copying, changing registry keys and opening network connections.
The researchers then trained a recognition engine using an adaptive
clustering algorithm--similar to self-organizing maps--and classified
a previously unseen subset of malware using the trained system. Using
more clusters typically resulted in better classification. When the
software samples were classified based on 100 events, accuracy fell
below 80 percent, while classification based on 500 and 1000 events
typically has accuracy rates above 90 percent.
Reverse engineer Dullien takes a different approach. Working with
other researchers at Sabre Security, he used automated tools to
deconstruct the actual code of virus and bot software, removing any
common libraries that the code might use and then comparing the
relationships between functions to characterize the software.
Using a database of 200 samples of bot software, a test case for the
automated process resulted in two major families of code, three
smaller groups, and several pairs and singletons. The system also
identified variants of bot software not recognized by a signature-
based antivirus system.
Dullien believes that static analysis is a better approach to malware
classification than Microsoft's runtime analysis. Actions that a
malicious program does not perform right away--known as time-delayed
triggers--can foil runtime analysis, he said. And virus and attack-
tool writers could add a few lines of code to a program to confuse
runtime analysis, he added.
"The approach presented in the paper can be trivially foiled with
very minor high-level-language modifications in the source of the
program," he stated in a blog entry analyzing Microsoft's system.
Microsoft declined to make its researchers available for interviews.
However, in the paper, the authors argued that a combination of both
static analysis and runtime analysis would likely perform best. For
example, static analysis appears to deliver results more quickly;
Microsoft's behavioral classification requires 3 hours to cluster 400
files at the 1,000 event limit, according to the paper.
In some ways, software classification resembles the state of
biological classification back in the time of Carl Linnaeus. The 18th
century botanist pushed the scientific community of his day into
accepting a hierarchical classification system for plants and
animals. However, early classifications relied on external
similarities, much in the way that many of today's classifications
rely on external attributes of programs rather than their internal
processes.
At least one other project hopes to help human analysts do a better
job of classification.
OffensiveComputing.net, a project founded by researchers Val Smith
and Danny Quist, aims to create a database of malware that records a
number of basic attributes of the code, including checksums,
antivirus scanner results, and what type of packer the malware uses
to compress itself. The project started in response to the increase
in code sharing amongst virus and attack-tool writers and the faster
development of exploits and the faster incorporation of those
exploits into existing malicious software, OffensiveComputing's Smith
said.
"The biggest benefit is more rapid response to complex threats,"
Smith said. "As the synergy between viruses, Trojans, worms, rootkits
and exploits grows, waiting for a solution becomes more dangerous."
OffensiveComputing's database gives incident response workers and
analysts access to meaningful data about malicious software, which is
especially necessary until automated analysis programs, such as
Microsoft's and Dullien's classification systems, mature. The project
strives to be adaptable, involve the community, have measurable
results, and remain open, Smith said.
"There is an arms race going on between analysts and malware authors,
so any solution will have to keep pace with advances on both sides,"
Smith said.