This Work Was Funded by the U.S. Department of Energy - Defense Programs Organization
and Sandia National Laboratories
The history of warfare is fraught with examples of fixed defenses that failed. The so-called 'Great Wall' of China is a prime example. Over a period of several thousand years, walls were built to protect Chinese leaders from attacks by warriors from the Steppe, and over the entire period, those same said warriors penetrated those walls, took over the defending kingdoms, become the new rulers, built longer, higher, stronger walls, and were themselves again attacked and defeated despite their improvements. Eventually there were more than 3,000 miles of walls built, but they never really provided adequate protection.
In the end, it seems that there were four major reasons for the failures of the 'Great Wall':
The 'Great Wall' is not the only example of this phenomena. The Maginot Line, the Berlin Wall, and most national borders have had the same problems. The Maginot line, for example, was a very well fortified set of fixed defenses positioned along a line and supported with underground bunkers capable of housing and feeding all of the people required to defend the line for a long period of time. It was bypassed by Germany in the Blitz when they went through the Ardennes. and took France by storm. The Berlin Wall was originally put up by the Russians to starve West Berlin into submission after the end of World War II, but the Berlin airlift bypassed the wall. Maintaining this relatively short wall eventually cost so much and was so ineffective, that the Soviet Union abandoned it. The southern border of the United States like many fixed borders, is also ineffective at keeping people out as well. Even the wealthiest nation in the history of humanity with all of its high tech wizardry cannot afford to make a fixed defense of this sort work.
As with most ancient military situations, the situation with the limits of fixed defenses has been addressed in a variety of ways over the past several thousand years. Some of the techniques that have emerged in physical defense include:
Each of these defenses have their limitations and weaknesses just as fixed defenses have theirs, and thus we have a game of attack and defense that is, in essence, unending.
By this point, the reader is probably wondering why we are spending so much time discussing defensive warfare over the centuries. This becomes far more clear when we start to look at the limitations we have today with information system defenses in networked environments. In particular, and not too surprisingly, there are four major reasons that current cyber-defenses fail.
The reader may notice that these are the same four reasons that fixed physical defenses have failed over the last several thousand years. The notional background of this paper is that fixed defenses always have these limitations and that the same notions used in the physical domain to avoid the limitations of fixed defenses may apply by analogy to the information protection arena. That then is the basic idea behind our research and concept behind this paper. Further observation may lead to the conclusion that defenses based on flexibility are not perfect, but are rather based on increasing the complexity and resource requirements for successful attack. That is also a notion inherent in the approach we describe. Just as complexity-based integrity maintenance is effective in cryptography and virus defense, so complexity-based network protection may be effective in flexible protection of computer networks.
In the rest of this paper, we will
As the title states, we are discussing notions surrounding automated, dynamic, flexible, distributed, scalable defenses for computer networks. We pause here to clarify our intent when we use these words.
As an overall concept then, we wish to move from a defense that consists of a hard outer shell and a gooey center to a defense that is crunchy everywhere and at every granularity.
Notionally, and in implementation, our design is quite simple. It is based on the idea that each systems should, in the words of today's environmental movement, think globally - act locally. Each networked computer participating in the defense controls only its own configuration, but may request situational information from other computers. While this outside information may, at the sole option of the requesting computer, be used in any way it sees fit, there is no guarantee that the information is accurate, timely, or correct. Again, notionally, a computer that is in a hostile environment may get lies from all other computers it can communicate with. It is then the task of each computer to do its best to offer its users the optimal tradeoff between security and functionality in what it perceives to be the current situation.
Tradeoffs and Optimization
Some readers may assert that if we knew the best tradeoff, why not just provide it all the time and be done with it. The reason is that the tradeoff between functionality and security is not static. For example, during periods of high threat, and in situations in which few of the providers of a vital service remain available, systems capable of providing that service should probably put forth great effort to provide the service, perhaps at the expense of cutting off other services or operating the service in a more secure mode, even when that service is directly under attack - while in periods of low threat and when many service providers are available, even a mild attack against the same service might be reasonably responded to by a temporary shut-down of that same service.
Issues of Time
With this notion comes another one. Defenses in such a system will automatically scale up and down, balancing the import of providing services with the risks associated with providing them. In most attacks against large networks, as in most conflicts of all kinds, we don't go from complete peace to full scale war and back to complete peace in instants, and even if they did, it would probably not be prudent for a large network to respond instantly either in scaling its defenses up or down because of the possibility of an attacker causing the network to become unstable through the use of pulsed attacks and defenses. For substantial networks, there are practical limits on the ability of an attacker with a given amount of resources to have more than a certain amount of effect over a small period of time. The scale of the attack should then somehow be related to the rate and severity of response. This implies that the times associated with scaling up and down are key factors in making decisions about what to do when. It also implies that defenders do not have to act instantly and universally in order to be successful.
If human defenders could handle automated attacks in time to mitigate harm effectively, we would not need automated defenses - even though we might still chose to have them as labor saving devices. Since automated attacks happen relatively quickly and can intentionally or accidentally produce cascading effects, the rate at which effective defenses must operate is bounded from above by the consequences of slow action. While many who speak and write on this subject seem to think that attack propagation rates are at or near the speed of light as a function of the wire distance between systems, with the exception of electromagnetic attacks on the underlying hardware and propagation of waves in systems like electrical power grids, this is not generally the case.
For example, even the most simple and rapid of automated denial of service attacks on undefended computer systems take on the order of hundreds of milliseconds to be effective, and even mildly sophisticated attacks that corrupt or leak information take several seconds if everything attempted works perfectly. Realistic large-scale attacks usually take on the order of hours to be widely effective, and the most widespread outages spread over global networks on time frames of hours to days. A key element in our understanding is that defenses don't have to be instantaneous or even operate at a very high rate in most cases. One key to designing a successful system of this sort is understanding how quickly which things must happen.
Another time issue that is commonly a source of problems in network-based protective systems is that absolute time is non-existent and relative time may be complex to get right. Even if we all agree on GMT as the 'standard', we still have problems of time delays, mis-set clocks, date-specific problems with particular systems, and attackers who cause erroneous times to be set. Furthermore, the requirement for accurate time information depends on the bounds on action as a function of time.
As a basis for our current work, we use relative time, where each request for information from another system results in a response that includes the current time from the perspective of that other system and where all measures of time are relative to that current time. One of the effects of this is that time can move backwards or be dilated (e.g., a clock is changed or overflows) without serious impact on operations. Because time is critical to delay calculations and the way in which defenses are ramped up and down, correlation of time data over time and across systems can be used to reliably determine if and when systems are acting strangely with respect to time, and this may be used as a basis for altering the level of trust placed in the data provided by them.
Searching for Commonalities
The vast majority of substantially harmful attacks do not appear as random uncorrelated events that by some marvelous coincidence of design result in high consequences. Rather, there are commonalities associated with operating system types, machine types, vulnerabilities, locations, business function, and so forth. Cost effective mitigation then typically involves identifying commonalities and defending in such a way as to continue unrelated processing unhindered. If a denial of services attack is underway against PCs running the Windows NT operating system, we don't normally limit users on Unix systems from doing file transfers.
The ability to differentiate in this way would seem to imply some general purpose correlation capability that is capable of correlating all the information related to an incident so as to identify commonalities. But in a fully distributed system, this can not be done. Our approach to correlation is to have the participants gather samples along each dimension by their selection of what other systems to request information from. Similarly, the rate at which requests are sampled across each of these dimensions relates to the time scale requirements associated with the attacks that effect those commonalities. So if an attacker could reasonably be expected to break into X Web servers per minute and there are Y Web servers being sampled, a distributed sampling theory would tell us how often each Web server should make requests of other Web servers to assure that an attack of a particular ferocity will not affect more than a given portion of those machines before detection. This applies across all dimensions of interest and produces sampling rates and other similar parameters.
Detectability Issues
You can't see what you can't see. If an attacker is able to successfully attack systems without being detected, our defenses will not be effective against them, and this can reasonably be said for any defensive scheme. The underlying issue, however, is one of detectability. Given that we have imperfect detection schemes, which we do, the issue is then how we compensate for these imperfections so as to be reasonably effective despite them.
While we don't have any special preference for the sensors used with our protective scheme, having better sensors will help you do a better job of detection and response. Current sensors come in several varieties.
In our initial experiments, we have used sensors based on deceptions because they were available and easy to adapt. In practice, any sensor can be used, but clearly, better sensors will tend to yield better results.
Push or Pull
Many people we have discussed this idea with ask us how we prevent an attacker from creating large quantities of false attack data and pushing it into all of our systems to cause us to become highly defensive without ever launching a successful attack. The answer is that machines don't push data out when an attack is detected - rather machines pull data from other machines when they see fit to ask for it.
The advantage of pushing data is that the system can be more responsive while saving bandwidth. This is essentially an interrupt driven scheme in which a sensed attack generates immediate notification of other parties. Every detected step of the attack is then pushed out to a set of other machines that can act on it as they see fit. Bandwidth is only used when necessary and no performance impacts are felt except when there is an attack. One of the problems with this is that an attack that destroys the push capability will cause the rest of the systems to believe that the system under attack is actually safe from attack. The counter to this is a keep-alive signal sent if no attack has been detected over a given time period. This of course eliminates some of the performance advantages and we then need to determine an appropriate keep-alive time based on parameters similar to those used to determine how often to request data on a pull-basis. The other question is where we should send any sensed attack data, and that can be analyzed as well.
The advantage of pulling data is that each system can decide what it is interested in finding out when and where from as a function of its perception of the situation. Under attack, a machine may decide to apply more resources, query more remote systems of more types, investigate a particular aspect of the incident in depth, and so forth. This can all be done easily under a pull system by an individual computer, but is very complex to do under a push system. As in a push system, the schedule for pulling data is important to responsiveness and the ability to have an accurate picture of the situation. A further advantage of a pull system is that through the use of stochastic processes, undetected subversion of the system by non-local attackers can be made quite difficult. For example, unless we use a highly predictable schedule to request information from a highly predictable set of sources (something we make substantial efforts to avoid), an attacker who is not in the path between a system and each of its remote data sources cannot get the information required to forge responses to pulled requests, and thus cannot reliably forge those responses without being detected. In a push system, forgery is far easier within the infrastructures used for networking today.
Another push-pull issue relates to the ability of an attacker to exploit the shared information about attacks to determine whether their efforts are being detected and to measure the effects of their attacks. In a pull system with unrestricted responses, an attacker can find out as much as all the defenders about what has been detected, the defensive posture of the network, and so forth. In an unrestricted push system, the attacker can do much the same thing by observing the responses of systems to stimuli. Both of these can be addressed to some extent by restricting locations of pushes and pulls and by encrypting, authenticating, and delaying behaviors so as to reduce unauthorized correlation, but even then, insiders will have access to some of this data and potentially be able to exploit more and more of it as they expand their attack, using the new information to attack still further.
We do not believe that a definitive argument can be made for push or pull systems at this time. While we have implemented a restricted pull system, we believe that elements of push are appropriate to this and related designs. For example, one day soon, we may create pushes to alert select other machines of severe attacks as they are detected so that a rapid attack that incapacitates a machine will not cause the data on the process of the attack to be lost to the rest of the network.
Degrees of Trust and Historical Data
Since systems under attack do not always provide accurate data, a model of trust may be helpful in assessing information derived from remote sources. The issue of how we do this and the bounds of such trust models are important to the success of our defense.
As a basic principal, we must admit that a successful attacker might alter the functioning of local defensive systems. If they do so, we cannot rely on these systems for correct behaviors, and thus, the accuracy of remote data notwithstanding, we may get wrong answers without recourse. This is the case regardless of any coding, assurance, or other protective measures we may take. In essence, such an alteration is the equivalent of a design flaw and unmitigatable on a local level. This does not, however, mean that a local corruption is not mitigatable on a network-wide level.
When we rely on data from remote systems, if a sufficient number of those remote systems (perhaps all of them in some schemes) provide false or misleading information, then to the extent that we use their information to make decisions, we may make less than optimal decisions. Again, there is no way to avoid this consequence in the sort of defensive scheme we are discussing, but there may be ways to make a successful attack infeasible in practice by the manner in which we use this method and by the way in which we choose to use remote information and limit its effect on local decisions.
Despite these limitations, our situation is far from hopeless. Some of the techniques we apply to mitigate this are:
All of this leads to the issue of how hard it is to lie successfully. While only a little bit if research has been done on the subject to date, there are results indicating that the use of redundant audit sources from within the same computer can make undetected lying quite complex. The complexity of applying this sort of result across infrastructures has not yet been studied.
There are also ways of enhancing specific systems to provide higher levels of assurance. Some of these techniques include cryptographic authentication and obfuscation, remote detection of physical characteristics, and the use of differentiated signaling. While these techniques are not easily scaled to millions of systems, they may be valuable for providing small numbers of high assurance systems which can be used for sanity checks on other operations.
The Attacker Wins Some
This is a network defense intended to provide increasing assurance against increasing losses in networks where the value of assets is diffused. In cases where a single high-valued asset exists, higher levels of assurance are appropriate, and less scalable custom or semi-custom defenses are probably more appropriate. In cases where any loss is a major loss, high assurance protection is key and this highly scalable and flexible method of defense may not be as appropriate as other alternatives.
In fact, with this defense, we expect that some systems will be broken into. As such, the attacker who goes up against such a defensive system is likely to attain some success. The goal of our defense is not to prevent all attacks, it is only to make it harder and harder to attack more and more systems. Successful attack depends, as it does in other domains, on knowing how much to take and how often to take it. If you take too little, it's not worth the risk of getting caught, and if you take too much, your odds of getting caught become very high. From the sophisticated attacker's perspective, the issue to be addressed is how much can be taken how often without a substantial chance of being caught. One of the key goals of this defensive scheme is to make that determination difficult and unreliable for the attacker.
The Insider Is At Risk
The real difference between an authorized user and an unauthorized user from a computer security standpoint is that the authorized user does authorized things while the unauthorized user doesn't always know what is and is not authorized and thus does some unauthorized things. Now this may seem a bit strange at first, but from the perspective of a computer security system, the only difference between two users comes in the form of how they behave at the interfaces between the system and the user. Even a thumbprint and retinal scan combined with a password and posession of a smart card are no different to a computer than good forgeries of each. If an attacker behaves in every detectable way like a legitimate user, the computer system will not be able to tell them apart.
Because differentiation is the only way to detect attack, any computer security system must be able to distinguish between insiders and outsiders based on some differences in their manner of use. People differ significantly and so do certain aspects of their manner of use, such as typing speeds and types of errors they make. This makes the problem of building sensitive detection difficult to do and creates a tradeoffs between false positives and false negatives. The better tuned the detection system is, the better discrimination we can get.
This use of complex and time varient sensors has two affects in this line. One effect is that some legitimate users may sometimes be detected as attackers. The other is that some attackers may sometimes be treated as insiders. The key issue is that, while legitimate users really have nothing to fear, attackers do. When and if a detection is eventually investigated by people, the attacker has a chance of being differentiated. **** work on this *** Since the defenses change with time, only the insider with detailed knowledge
Using the mechanism against itself
covert channels
reflexive control
and so forth
The Distributed Assessment and REsponse (DARE) system is a first attempt at implemting these notions. DARE is currently operating on the HEAT network - a collection of more than 50 computers of different sorts dedicated to cyber attack and defense experiments.
Each computer contains some number of sensors that fuse detailed input sequences into a set of InfoCon values at a set of times. The values are determined by the design of sensors and response system with the intent that higher values and more recent events will tend to cause the system to become more protective.
Each computer produces and follows a schedule for requesting InfoCon information (values and relative times) from other systems. These values and times are combined with historical values and times and the computer's local InfoCon levels and times to produce the computers assessment of the True Infocon and a measure of certainty (i.e. a confidence level) with which the system's assessment of the value at the time of the assessment reflects the underlying reality. The systems and times selected for the schedule depend on what sort of information is being sought by the computer making its schedule, but it generally reflects an inherent goal of increasing the confidence level and the tradeoffs between achieving increased confidence and decreases in overall system performance. Generally, the user of a system can specify an allowable level of performance reduction associated with system protection and the
Each computer has a set of models that anticipate behaviors associated with different threat profiles in terms of the distributions of InfoCon levels over time and space and can be assessed so as to compare actual behaviors to modeled behaviors to yield fitness measures reflecting the similarity of those models to observed behaviors. If observed behaviors fit very closely to a model, this increases the certainty with which the computer will act in accordance to the predictions of that model in scaling its defenses and increases the extent to which it will put forth effort in probing remote system histories and its own histories to try to refute the stronger and stronger assumption that the observed behavior is being properly modeled.
The notion here is that the more certain a system is of the threat profile assumption it is making for defending itself, the harder it will try to refute the assumption by searching in more depth. The less certain a system is of the threat profile assumptions it is making to defend itself, the broader it will search in order to find a closer match to one of its known behavioral patterns. Ultimately, it may be unable to search any more broadly or find any very close matches between its models and observed behaviors. In this case, it must create a new model, do detailed analysis of histories, compute anticipated consequences associated with the behaviors observed, and determine an appropriate set of responses to mitigate this type of threat. In reality, such occurences trigger human investigation and the adoption of new models. Inconsistent data from different sources is part of the modeling process and provides for the possibility that some of the sources of information are in error or intentionally trying to subvert the protective system.
A particularly important notion is the correlation of information about data sources along different dimensions of similarity and difference that we care about. For example, scheduling rates could be based on total similarities of business function, infrastructure dependendies (e.g. power supply, physical building location, same local network), corporation, department, ISP, machine type, OS version, same class-A, class-B, or class-C network, and so forth.
The sensors currently used by DARE come from the Deception ToolKit (DTK). This was selected for ease of use in prototype implementation, ready availability in source form, the fact that it is fully distributed, and ease of use across platforms. These sensors already produce InfoCon levels and relative times (by providing the current system time and the system times of each event, time differentials can be used instead of absolute times). InfoCon levels are based on user supplied (or default) response profiles and retain audit trails in formats ammenable to use. DTK also provides a Deception Port which we use for exchanging information on InfoCons and has rudimentary authentication and encryption capabilities to allow for its use in authenticated encrypted exchange between hosts.
The scheduler was designed in a few hours and is called SIM because it is basically an event driven simulation engine where evnts trigger the execution of user specified scripts. SIM implements automatic and process-specified requeues of processes and is timer driven using system interrupts. It is largely platform independent and wa designed to integrate with DTK so that it can easily pull InfoCon levels from other systems.
The modeling system is based on the model-based anticipation and constraint engine provided in the CID project. This system uses (threat, mechanism) pairs to derive match proximity information from indicators (in the form of data fused to mechanisms) and can produce predictions of future behavior based on match proximity of other mechanisms to previously detected mechanisms.
Consequence analysis is performed by a linear algebra solve ...