TESTING INTRUSION DETECTION SYSTEMS
Elizabeth B. Lennon, Editor
Information Technology Laboratory
National Institute of Standards and Technology

Introduction
In government and industry, intrusion detection systems (IDSs) are now standard 
equipment for large networks. IDSs are software or hardware systems that automate the 
process of monitoring the events occurring in a computer system or network, analyzing 
them for signs of security problems. Despite the expansion of IDS technology in recent 
years, the accuracy, performance, and effectiveness of these systems is largely untested, 
due to the lack of a comprehensive and scientifically rigorous testing methodology. This 
ITL Bulletin summarizes NISTIR 7007, An Overview of Issues in Testing Intrusion 
Detection Systems, by Peter Mell and Vincent Hu of NIST's Information Technology 
Laboratory, and Richard Lippmann, Josh Haines, and Marc Zissman of the Massachusetts 
Institute of Technology Lincoln Laboratory. The Defense Advanced Research Projects 
Agency (DARPA) sponsored the work.

The lack of quantitative IDS performance measurements can be attributed to some 
challenging research barriers that must be overcome before the necessary tests can be 
created. NISTIR 7007 outlines the quantitative measurements that are needed, discusses 
the obstacles to the development of these measurements, and presents ideas for research 
in IDS performance measurement methodology to overcome the obstacles. NISTIR 7007 
is available online at http://csrc.nist.gov/publications/nistir/index.html. 

Who Needs Quantitative Evaluations?
The results of quantitative evaluations of IDS performance and effectiveness would 
benefit many potential customers. Acquisition managers need this information to improve 
the process of system selection, which is often based only on the claims of the vendors 
and limited-scope reviews in trade magazines. Security analysts who review the output of 
IDSs would like to know the likelihood that alerts will result when particular kinds of 
attacks are initiated. Finally, R&D program managers need to understand the strengths 
and weaknesses of currently available systems so that they can effectively focus research 
efforts on improving systems and measure their progress.

Measurable IDS Characteristics
Listed below is a partial set of measurements that can be made on IDSs. These 
measurements are quantitative and relate to performance accuracy.

?	Coverage. This measurement determines which attacks an IDS can detect under 
ideal conditions. For signature-based systems, this would simply consist of 
counting the number of signatures and mapping them to a standard naming 
scheme. For non-signature-based systems, one would need to determine which 
attacks out of the set of all known attacks could be detected by a particular 
methodology. The number of dimensions that make up each attack makes this 
measurement difficult. Another problem with assessing the coverage of attacks is 
determining the importance of different attack types. In addition, most sites are 
unable to detect failed attacks seeking vulnerabilities that no longer exist on a site.
 
?	Probability of False Alarms. This measurement determines the rate of false 
positives produced by an IDS in a given environment during a particular time 
frame. A false positive or false alarm is an alert caused by normal non-malicious 
background traffic. Some causes for Network IDS (NIDS) include weak 
signatures that alert on all traffic to a high-numbered port used by a backdoor; 
search for the occurrence of a common word such as help in the first 100 bytes of 
SNMP or other TCP connections; or detection of common violations of the TCP 
protocol. They can also be caused by normal network monitoring and 
maintenance traffic generated by network management tools. It is difficult to 
measure false alarms because an IDS may have a different false positive rate in 
each network environment, and there is no such thing as a standard network. Also 
important to IDS testing is the receiver operating characteristic (ROC) curve, 
which is an aggregate of the probability of false alarms and the probability of 
detection measurements. This curve summarizes the relationship between two of 
the most important IDS characteristics: false positive and detection probability.

?	Probability of Detection. This measurement determines the rate of attacks 
detected correctly by an IDS in a given environment during a particular time 
frame. The difficulty in measuring the detection rate is that the success of an IDS 
is largely dependent upon the set of attacks used during the test. Also, the 
probability of detection varies with the false positive rate, and an IDS can be 
configured or tuned to favor either the ability to detect attacks or to minimize 
false positives. One must be careful to use the same configuration during testing 
for false positives and hit rates.

?	Resistance to Attacks Directed at the IDS. This measurement demonstrates how 
resistant an IDS is to an attacker's attempt to disrupt the correct operation of the 
IDS. One example is sending a large amount of non-attack traffic with volume 
exceeding the processing capability of the IDS. With too much traffic to process, 
an IDS may drop packets and be unable to detect attacks. Another example is 
sending to the IDS non-attack packets that are specially crafted to trigger many 
signatures within the IDS, thereby overwhelming the human operator of the IDS 
with false positives or crashing alert processing or display tools.

?	Ability to Handle High Bandwidth Traffic. This measurement demonstrates 
how well an IDS will function when presented with a large volume of traffic.  
Most network-based IDSs will begin to drop packets as the traffic volume 
increases, thereby causing the IDS to miss a percentage of the attacks. At a certain 
threshold, most IDSs will stop detecting any attacks.  

?	Ability to Correlate Events. This measurement demonstrates how well an IDS 
correlates attack events. These events may be gathered from IDSs, routers, 
firewalls, application logs, or a wide variety of other devices. One of the primary 
goals of this correlation is to identify staged penetration attacks. Currently, IDSs 
have only limited capabilities in this area.

?	Ability to Detect Never-Before-Seen Attacks. This measurement demonstrates 
how well an IDS can detect attacks that have not occurred before. For commercial 
systems, it is generally not useful to take this measurement since their signature-
based technology can only detect attacks that had occurred previously (with a few 
exceptions). However, research systems based on anomaly detection or 
specification-based approaches may be suitable for this type of measurement.

?	Ability to Identify an Attack. This measurement demonstrates how well an IDS 
can identify the attack that it has detected by labeling each attack with a common 
name or vulnerability name or by assigning the attack to a category.

?	Ability to Determine Attack Success. This measurement demonstrates if the IDS 
can determine the success of attacks from remote sites that give the attacker 
higher-level privileges on the attacked system. In current network environments, 
many remote privilege-gaining attacks (or probes) fail and do not damage the 
system attacked. Many IDSs, however, do not distinguish the failed from the 
successful attacks.

?	Capacity Verification for NIDS. The NIDS demands higher-level protocol 
awareness than other network devices such as switches and routers; it has the 
ability of inspection into the deeper level of network packets. Therefore, it is 
important to measure the ability of a NIDS to capture, process, and perform at the 
same level of accuracy under a given network load as it does on a quiescent 
network.

?	Other Measurements. There are other measurements, such as ease of use, ease 
of maintenance, deployments issues, resource requirements, availability and 
quality of support, etc. These measurements are not directly related to the IDS 
performance but may be more significant in many commercial situations.

IDS Testing Efforts to Date
IDS testing efforts vary significantly in their depth, scope, methodology, and focus.  
Evaluations have increased in complexity over time to include more IDSs and more 
attack types, such as stealthy and denial of service (DoS) attacks. Only research 
evaluations have included novel attacks designed specifically for the evaluation and 
evaluated the performance of anomaly detection systems. Evaluations of commercial 
systems have included measurements of performance under high-traffic loads. Traffic 
loads were generated using real high-volume background traffic mirrored from a live 
network and also with commercial load-testing tools.

Academic, research laboratories, and commercial organizations have all been active in 
IDS testing efforts. The University of California at Davis and IBM Zurich developed 
prototype IDS testing platforms. MIT Lincoln Laboratory performed the most extensive 
quantitative IDS testing to date, developing an intrusion detection corpus that is used 
extensively by researchers. The Air Force Research Laboratory focused on testing IDSs 
in real-time in a more complex hierarchical network environment. The MITRE 
Corporation investigated the characteristics and capabilities of network-based IDSs. The 
Neohapsis Laboratories/Network Computing magazine collaboration involved the 
evaluation of commercial systems. The NSS Group evaluated 15 commercial IDSs and 
one open-source IDS in 2000 and 2001, and issued a detailed report and analysis. Lastly, 
Network World Fusion magazine reported a more limited review of five commercial 
IDSs. See NISTIR 7007 for a complete description of these testing efforts.

IDS Testing Issues

?	Difficulties in Collecting Attack Scripts and Victim Software. The difficulty of 
collecting attack scripts and victim software hinders progress in developing tests. 
It is difficult and expensive to collect a large number of attack scripts. While such 
scripts are widely available on the Internet, it takes time to find relevant scripts to 
a particular testing environment. Once a script is identified, our experience is that 
it takes roughly one person-week to review the code, test the exploit, determine 
where the attack leaves evidence, automate the attack, and integrate it into a 
testing environment.

?	Differing Requirements for Testing Signature-Based vs. Anomaly-Based 
IDSs. Although most commercial IDSs are signature-based, many research 
systems are anomaly-based, and it would be ideal if an IDS testing methodology 
would work for both of them. This is especially important for comparison of the 
performance of upcoming research systems to existing commercial ones. 
However, creating a single test to cover both types of systems presents some 
problems.

?	Differing Requirements for Testing Network-Based vs. Host-Based IDSs. 
Testing host-based IDSs presents some difficulties not present when testing 
network-based IDSs. In particular, network-based IDSs can be tested in an off-
line manner by creating a log file containing TCP traffic and then replaying that 
traffic to IDSs. Since it is difficult to test a host-based IDS in an off-line manner, 
researchers must explore more difficult real-time testing. Real-time testing 
presents problems of repeatability and consistency between runs.

?	Four Approaches to Using Background Traffic in IDS Tests. Most IDS testing 
approaches can be classified in one of four categories with regard to their use of 
background traffic: testing using no background traffic/logs, testing using real 
traffic/logs, testing using sanitized traffic/logs, and testing using simulated 
traffic/logs. While there may be other valid approaches, most researchers find it 
necessary to choose among these categories when designing their experiments.  
Furthermore, it is unclear which approach is the most effective for testing IDSs 
since each has unique advantages and disadvantages.

See NISTIR 7007 for a complete discussion of these issues.

Recommendations for IDS Testing Research
Research recommendations for IDS testing focus on two areas: improving datasets and 
enhancing metrics.

?	Shared Datasets. There is a great need for IDS testing datasets that can be shared 
openly between multiple organizations. Few datasets exist that have even semi-
realistic data or have the attacks within the background traffic labeled. Without 
shareable datasets, IDS researchers must either expend enormous resources 
creating proprietary datasets or use fairly simplistic data for their testing. 

?	Attack Traces. Since it is difficult and expensive to collect a large set of attacks 
scripts for the purposes of IDS testing, a possible alternative is to use attack traces 
instead of real attacks. Attack traces are the log files that are produced when an 
attack is launched and that specify exactly what happened during the attack. Such 
traces usually consist of files containing network packets or systems logs that 
correspond to an instance of an attack. Researchers need a better understanding of 
the advantages and disadvantages of replaying such traces as a part of an IDS test. 
In addition, there is a great need to provide the security community with a large 
set of attack traces. Such information could be easily added to and would greatly 
augment existing vulnerability databases. The resulting vulnerability/attack trace 
databases would aid IDS testing researchers and would provide valuable data for 
IDS developers.

?	Cleansing Real Data. Real data generally cannot be distributed due to privacy 
and sensitivity issues. Research into methods to remove the confidential data 
within background traffic while preserving the essential features of the traffic 
could enable the use of such data within IDS tests. Such an advance would 
alleviate the need for researchers to expend additional effort creating expensive 
simulated environments. Another problem with real background data is that it 
may contain attacks about which nothing is known. It is possible, however, that 
such attacks could be automatically removed. One idea is to collect a trace of 
events in the real world and use a simulation system to produce data similar to 
those in the collected trace. 

?	Sensor and Detector Alert Datasets. Some intrusion correlation systems do not 
use a raw data stream (like network or audit data) as input, but instead rely upon 
alerts and aggregated information reports from IDSs and other sensors. 
Researchers need to develop systems that can generate realistic alert log files for 
testing correlation systems. A solution is to deploy real sensors and to sanitize the 
resulting alert stream by replacing IP addresses. Sanitization in general is difficult 
for network activity traces, but it is relatively easy in this special case since alert 
streams use well-defined formats and generally contain little sensitive data (the 
exception being IP addresses and possibly passwords).

?	Real-Life Performance Metrics. Receiver operating characteristic (ROC) curves 
are created by stepping through alerts emitted by the detector in order of 
confidence or severity. The goal is to show how many alerts must be analyzed to 
achieve a certain level of performance and, by applying costs, to determine an 
optimal point of operation. The confidence or severity-based ROC curve, 
however, is not a good indicator of how the IDS will perform with an intelligent 
human administrator sitting at the console. The human administrator does not 
consider the IDS alerts alone, but makes use of additional information such as 
network maps, user trouble reports, and learned knowledge of common false 
alarms when considering which alerts to analyze first. Thus the alert ordering 
used as a basis of the ROC is often not realistic. A further problem is that few 
current detection systems output a continuous range of scores but instead output 
only a few priorities (low/medium/high). Thus the ROC consists of only a few 
very coarse points. It might be useful to use alert type, source, and/or destination 
IP address along with severity or confidence to order a set of IDS alerts for the 
purpose of estimating cost and performance of a detector. This new technique 
could produce a curve that could provide a much more realistic basis for 
comparing attack detection and false alarm performance, and for estimating the 
cost of using the intrusion detection product at various levels of performance.

?	New Technologies. Newly evolving IDS technologies include meta-IDS 
technologies that attempt to ease the burden of cross-vendor data management; 
IDS appliances that promise increased processing power and more robust remote 
management capabilities; and Application-layer technologies that filter potential 
attack traffic to downstream scanner on dedicated network segments. These new 
directions focus on new technologies for enterprises or service providers and 
represent examples of research efforts to solve the difficulties of false positives, 
traffic bottlenecks, and distinguishing serious attacks from nuisance alarms.

Conclusion
While IDS testing efforts to date vary significantly and have become increasingly 
complex, the lack of a comprehensive and scientifically rigorous testing methodology to 
quantify IDS performance has hindered the development of needed tests. NIST believes 
that a periodic, comprehensive evaluation of IDSs could be valuable for acquisition 
managers, security analysts, and R&D program managers. However, because both normal 
and attack traffic vary widely from site to site, and because normal and attack traffic 
evolve over time, these evaluations will likely be complex and expensive. To enable 
evaluations to be conducted more efficiently, NIST recommends that the community find 
ways to create, label, share, and update relevant datasets containing normal and attack 
activity. 

Disclaimer 
Any mention of commercial products or reference to commercial organizations is for 
information only; it does not imply recommendation or endorsement by NIST nor does it 
imply that the products mentioned are necessarily the best available for the purpose.