[iwar] [fc:Global.Routing.Instabilities.during.Code.Red.II.and.Nimda.Worm]

From: Fred Cohen (fc@all.net)
Date: 2001-09-28 16:07:11


Return-Path: <sentto-279987-2496-1001718435-fc=all.net@returns.onelist.com>
Delivered-To: fc@all.net
Received: from 204.181.12.215 by localhost with POP3 (fetchmail-5.1.0) for fc@localhost (single-drop); Fri, 28 Sep 2001 16:08:07 -0700 (PDT)
Received: (qmail 16708 invoked by uid 510); 28 Sep 2001 23:07:29 -0000
Received: from n22.groups.yahoo.com (216.115.96.72) by 204.181.12.215 with SMTP; 28 Sep 2001 23:07:29 -0000
X-eGroups-Return: sentto-279987-2496-1001718435-fc=all.net@returns.onelist.com
Received: from [10.1.4.55] by cj.egroups.com with NNFMP; 28 Sep 2001 23:07:15 -0000
X-Sender: fc@big.all.net
X-Apparently-To: iwar@onelist.com
Received: (EGP: mail-7_4_1); 28 Sep 2001 23:07:15 -0000
Received: (qmail 35064 invoked from network); 28 Sep 2001 23:07:14 -0000
Received: from unknown (10.1.10.27) by l9.egroups.com with QMQP; 28 Sep 2001 23:07:14 -0000
Received: from unknown (HELO big.all.net) (65.0.156.78) by mta2 with SMTP; 28 Sep 2001 23:07:12 -0000
Received: (from fc@localhost) by big.all.net (8.9.3/8.7.3) id QAA17280 for iwar@onelist.com; Fri, 28 Sep 2001 16:07:11 -0700
Message-Id: <200109282307.QAA17280@big.all.net>
To: iwar@onelist.com (Information Warfare Mailing List)
Organization: I'm not allowed to say
X-Mailer: don't even ask
X-Mailer: ELM [version 2.5 PL1]
From: Fred Cohen <fc@all.net>
Mailing-List: list iwar@yahoogroups.com; contact iwar-owner@yahoogroups.com
Delivered-To: mailing list iwar@yahoogroups.com
Precedence: bulk
List-Unsubscribe: <mailto:iwar-unsubscribe@yahoogroups.com>
Date: Fri, 28 Sep 2001 16:07:11 -0700 (PDT)
Reply-To: iwar@yahoogroups.com
Subject: [iwar] [fc:Global.Routing.Instabilities.during.Code.Red.II.and.Nimda.Worm]
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 8bit

Global Routing Instabilities during Code Red II and Nimda Worm
Propagation Preliminary Report 19 
James Cowie, Andy Ogielski, BJ Premore and Yougu Yuan, Renesys
Corporation, 9/28/2001
<a href="http://www.renesys.com/projects/bgp_instability/">http://www.renesys.com/projects/bgp_instability/>

SUMMARY

As a part of an ongoing project to develop practical and scalable tools
for analysis of very large, high-dimensional Internet behavior datasets,
researchers from Renesys Corporation have been studying RIPE NCC's large
repository of raw BGP message data. 

In this online note, we summarize our preliminary analysis of the
surprisingly strong impact of the Internet propagation of Microsoft
worms (such as Code Red and Nimda) on the stability of the global
routing system.  The data exhibit strong correlations between BGP
message storms and worm propagation periods. 

This note will continue to evolve as our analytical tools and techniques
evolve, and will reflect the results of ongoing experiments and the
daily arrival of new BGP message traffic. 

INTRODUCTION

Many successful academic and commercial projects use direct traffic
measurements (such as ping, traceroute, and web page access data) to
study the structure and dynamics of the Internet.  Such efforts are
inherently limited by the locations of probe points required to 'cover'
the Internet meaningfully.  Compounding the problem, there are no
effective shortcuts - simply placing agents throughout the Internet's
core, as done by several commercial services, only builds up a picture
of core-to-core traffic latencies and losses that has no power to
predict the true "Internet weather" that end users actually experience
at the network edge. 

Studying global routing data provides one of the few alternatives to
traffic-based analysis of the Internet's dynamics.  By the very nature
of globally distributed BGP routing processes, a listener at any
well-connected point has the opportunity to obtain a very accurate
picture of the evolution of best routes to every prefix in the Internet,
delayed only by seconds to minutes.  In particular, research has begun
to focus on the dynamics of streams of BGP route update messages as a
tool to identify the emergence of long-lived routing instability events. 
DO MICROSOFT WORMS CAUSE GLOBAL ROUTING INSTABILITY?

On multiple occasions, we have detected hours-long periods of
exponential growth and decay in the route change rates, across all
sampling points and most prefixes, indicating significant widespread
degradation in the end-to-end utility of the global Internet.  To our
very great surprise, these events did not correlate with point failures
in the core Internet infrastructure, such as power outages in telco
facilities or fiber cuts. 

Instead, we have documented a compelling connection between global
routing instability and the propagation phase of Microsoft worms such as
Code Red and Nimda.  Contrary to conventional wisdom, what were thought
to be purely traffic-based denials of service in fact are seen to
generate widespread end-to-end routing instability originating at the
Internet's edge. 

We speculate that, although most of the traffic in the Internet
continued to flow normally through the small fraction of links that make
up the global backbones, most of the links at the Internet edge had
serious performance problems during the worms' probing and propagation
phases.  A complete list of reasons still needs to be documented, but we
suspect i) congestion-induced failures of BGP sessions due to timeouts;
ii) flow-diversity induced failures of BGP sesions due to router CPU
overloads; iii) proactive disconnection of certain networks; and iv)
failures of other equipment at the Internet edge such as DSL routers and
other devices. 

METHODOLOGICAL BACKGROUND

When a BGP router's "best route" to a given network prefix has changed
(for better or worse), it sends out a BGP UPDATE message to each
connected peer router.  By establishing BGP peering connections with a
large number of BGP routers from well-connected organizations, analysis
of traffic gathered at a single BGP monitoring point can provide a great
deal of information about the way those organizations view the Internet,
and about the dynamics of how paths change over a wide range of
timescales. 

The RIPE routing information project maintains several such collection
points across Europe; they peer with many of the so-called global tier-1
providers, plus very many smaller regional European networks.  Access to
multiple BGP monitoring points provides additional opportunities to
filter the effects of infrastructure failures that are "close to"
individual collection points, clearing the way to unambiguously identify
and study routing instability features that affect large portions of the
Internet simultaneously. 

OVERVIEW OF CONCERNS

We are performing multiresolution analyses of tens of gigabytes of
archived BGP message data from the RIPE collection points, seeking to
learn what we can about the origins and mechanisms of global routing
instability. 

There are two predominant strategies for using routing statistics to
measure global Internet instability:

Reachability; that is, measuring the number of prefixes that appear in a
particular organization's routing tables at a given time. 

Rates of change; that is, measuring the number of prefix announcements
and withdrawals in BGP UPDATE messages sent out by a particular
organization per unit time. 

The BGP protocol contains dampening features that prevent a BGP router
from exchanging "too many" messages about a given prefix with a given
peer.  As a result, one never sees information about route changes to a
given network prefix more frequently than once every 30 seconds per
peer.  If we see large increases in the number of BGP update messages,
therefore, it's an unambiguous sign that the diversity among network
prefixes under discussion is rising. 

DISTINGUISHING FEATURES

The duration of these BGP message surges, and the nature of the growth
(linear or exponential, for example) are what distinguish truly global
Internet instability from simple background noise (which is pervasive). 
Very short, high spikes in announcement rates are very common --
whenever a peer BGP session undergoes a hard reset, for example, a full
table dump will follow.  More surprisingly, our examination of the data
indicates that failures in the core internet infrastructure (fiber cuts,
flooding, generator failures, building collapses, train wrecks) tend to
generate only short-term increases in the BGP prefix announcement rate,
which revert to the mean in a matter of seconds or minutes as the highly
redundant core Internet topology routes around the damage.  Specific
networks may remain unreachable until the damage is repaired, but
because content networks are so vastly outnumbered by access networks,
the "average" network prefix presumably adds very little in the way of
marginal utility to the "average" Internet user. 

Of far greater concern are the appearance of sustained exponential rises
in BGP message rates that last for hours.  That is what the
worm-triggered traffic causes. 

GLOBAL ROUTING STABILITY -- SUMMER OF 2001

The results described here are based on analysis of time-stamped BGP
messages collected at the RIPE NCC site rrc00 in Amsterdam, the
Netherlands, for the period of June - September, 2001.  We also analyzed
BGP traffic from other Internet exchanges that host RIPE BGP collection
sites, including LINX (London), SFINX (Paris), AMS-IX (Amsterdam), CIXP
(Geneva), and VIX (Vienna).  The extended analysis results will be
presented in the coming updates to this note, and in separate
publications. 

The RIPE NCC collection facility is particularly interesting, as it
collects BGP routing updates from several large Internet providers, and
thus provides a good and fairly complete dynamic view - second by second
- of the evolving state of global routing. 

The following Autonomous Systems have BGP peering routers at the RIPE
NCC location:

AS286, KPNQwest Backbone AS513, CERN AS1103, SURFnet AS2914, Verio
AS3257, Tiscali Global ASN AS3549, Global Crossing (two separate peering
sessions) AS4608, Telstra Internet AS4777, APNIC Pty Ltd - Tokyo AS7018,
AT&amp;T AS9177, Nextra (Schweiz) AS13129, Global Access
Telecommunications

TRENDS IN THE AGGREGATE RATE OF BGP ANNOUNCEMENTS


The wide graphic plot above shows the trends in the aggregated rate of
BGP prefix announcements received from all of the above peers at the
RIPE NCC data collection points from 1 June through 24 September 2001. 
The X-axis is time, and the Y-axis is the log of the total number of
network addresses (prefixes) advertised in consecutive 30-second
windows.  In other words, each dot in the plot represents a 30-second
count of the number of prefixes advertised in BGP Update messages
received in Amsterdam. 

What this plot shows:

These data provide a coarse indicator of Global BGP Instability, because
they sum over all of the advertisements of all of the prefixes by all of
the autonomous systems that peer with RIPE NCC.  The timeseries
represents a measure of the gross route change activity for the entire
Internet. 

Any patterns and features observable at this high level might tell us
something about the global state of BGP routing, as it fluctuates from
day to day.  Surprisingly, there are several strong patterns and
features that emerge from the background of noise.  For example:

One can observe strong weekly and daily trends in the rate of route
advertisements, an effect which may be due to interactions with either
the diurnal patterns of traffic, or the diurnal patterns of activity by
network operators performing routine maintenance on BGP routers
(interestingly, if this were the case, it looks like network operators
are ramping up their work from Monday through Wednesday, and then
steadily retire towards the weekend). 

Two strange non-periodic features jump out when the baseline is
examined:

1.an order of magnitude higher BGP message storm on July 19th, 2.a more
rapidly rising and longer-lasting BGP message storm on September 18th. 
What this plot does not show:

The above aggregated timeseries does not serve as a measurement of
"reachability" over time --- this would be achieved by plotting the
number of prefixes in the routing tables, and watching for dips. 
Because the data aggregate across all ASs and prefixes, there are few
"interesting" features when network infrastructure is broken at
localized geographic points.  Interestingly, neither the Baltimore
tunnel train wreck of 18 July nor the attacks of 11 September appear as
features in this plot.  These tragic events did not destabilize the
global Internet.  In general, the high levels of routing activity
following fiber cuts between tier-1 and other major providers remain
localized within the immediately affected Autonomous Systems, and do not
create message storms that are highly visible worldwide. 

But something else did create a long-lasting instability of BGP routes
on July 19th and September 18th.  In particular, we are concerned with
the two non-periodic features visible on 19-20 July and 18-19 September. 
These two "storms" in BGP update rates correlate with the propagation
phases of the Microsoft worms known as Code Red 2 (in July) and Nimda
(in September). 


BGP STORM #1: JULY 19, 2001 (CODE RED II)

On July 19th, we observed an exponentially growing eight-fold increase
in the advertisement rate, over a period of about eight hours (all times
are in GMT; subtract 4 hours for EDT).  This BGP surge faded over the
same time scale as it arrived.  When one considers the conventional
wisdom about BGP convergence times (seconds to minutes), it is more than
a little disturbing to see a fundamental quantity like BGP advertisement
rate exhibiting exponential growth for eight hours. 

One initial guess was a delayed effect from the Baltimore train wreck,
whose impact was highly visible in the discussions on various mailing
lists such as NANOG, as network operators "tweaked" routing for the next
day or so.  But this does not appear to have caused the BGP storm. 


A zoom-in on the BGP message storm of July 19. 

In order to gain a better understanding of the mechanism driving this
BGP storm, we conducted a finer analysis.  We began by separating the
contributions of individual BGP peering sessions to the total BGP update
message traffic, and by separately following the time courses of BGP
route announcement and withdrawal messages. 

BGP prefix announcements in 60 min periods.  Each row is one BGP peer in
RIPE NCC.  Note a wave of announcements on July 19. 

Consider the plot above, in red.  In this plot, BGP prefix announcements
are counted in 60 minute bins, and graphed along the Z-axis as impulses. 
The X-axis is time, from July 1st through July 31st.  The Y-axis (going
into the screen) separates the contributions of the 13 individual peer
AS's at rrc00 (each data row parallel to the X-axis represents
announcements from one BGP router peering with the RIPE NCC message
collecting router).  On other days, it is common for individual peers to
contribute "spikes" of high advertisement volume to the mix, presumably
reflecting BGP sessions closing and opening close to the collection
point.  On July 19th, however, all peers experience a "wave" of smoothly
increasing traffic, sustained for many hours. 


BGP prefix withdrawals in 60 min periods.  Each row is one BGP peer in
RIPE NCC.  Note a wave of withdrawals on July 19. 

Similarly, the above plot (in blue) plots the hourly rate of BGP
withdrawal messages across all peers at rrc00 --- the prefix withdrawal
count on the Z axis showing an indication of the number of network
prefixes that are no longer reachable via the given peer.  The July 19th
surge is the only occasion in July when all 13 peers register a
significant simultaneous surge in withdrawal rates, lasting for
approximately eight hours. 

Further analysis of the BGP message traffic has since indicated that no
specific autonomous system or set of autonomous systems seems to be
generating the traffic surge, and that no specific IP prefix or set of
prefixes was flapping significantly more than before the onset of the
surge. 

Instead, the net effect was that routes to most of the 110,000 prefixes
in the Internet were changing a few more times than normal.  In other
words, the data reflect a broad-based BGP storm with no single-point
cause. 

CORRELATION WITH CODE RED II ATTACK PHASE

The time course of the July 19 BGP storm suggests that it has been
triggered by the sudden spread of a new variant of the Microsoft worm
known as Code Red 2. 

Our analysis has been materially aided by data collected by the network
security community and announced on the incidents.org, jammed.com and
neohapsis.com mailing lists.  One can find there higher than usual
number of anecdotal reports of sudden connectivity losses, ARP storms,
and similar localized worm effects.  In addition, however, several alert
people presented quantitative data supporting the assertion that a new
type of Code Red worm started a very rapid propagation phase at a time
coinciding with the onset of the BGP storm that we have shown above. 
Ideal data on the worm-generated traffic storm would show the time
series of worm activity for a good statistical sample of networks of
known size (i.e.  prefix length), so that the global activity levels
could be inferred by extrapolation. 

Here we plot the Code Red 2 propagation data collected independently on
two class B networks (i.e.  each nominally containing 2^16 = 64k IP
addresses) during the entire day of July 19.  The original data were
collected by Ken Eichman and Dave Goldsmith, respectively; the relevant
posts are here and here, and summarized on the incidents.org mailing
list "Handler's Diaries" on July 20 and July 21 . 

The time series shown above (in red) plot the number of HTTP requests
(or TCP SYN packets) received in the two distinct /16 networks hour
after hour.  Note the virtually identical time course of these attacks
as seen from different networks.  These plots give a measure of the
intensity of worm scanning traffic at all affected networks - that is,
knowing the probability of target IP address generation by the worm,
plots like the one above allow to estimate the global level of
worm-induced traffic.  The analysis of this is continuing as we receive
more such data. 

A thorough analysis of the host infection rate by Code Red 2 is
available from CAIDA Analysis of Code-Red and in particular the David
Moore's analysis The Spread of the Code-Red Worm (CRv2). 

NETWORK REACHABILITY FAILURES DURING CODE RED II

In the following figure we show that all prefixes were similarly
affected by the effects of the Code Red 2 worm attack on the stability
of global routing.  Only the classful networks (/8, /16 and /24) are
shown for clarity, but similar behavior has been seen for all
intermediate prefix lengths. 

Rate of prefix withdrawals in 30-sec intervals for selected prefix
lengths

Further analysis will be presented to demonstrate that no particular AS,
or prefix, and no particular set of ASs or prefixes, were to blame for
this instability. 

BGP STORM #2: SEPTEMBER 18-19, 2001 (NIMDA)

On Tuesday, September 18, simultaneous with the onset of the propagation
phase of the Nimda worm, we observed another BGP storm.  This one came
on faster, rode the trend higher, and then, just as mysteriously, turned
itself off, though much more slowly.  Over a period of roughly two
hours, starting at about 13:00 GMT (9am EDT), rrc00 aggregate BGP
announcement rates exponentially ramped up by a factor of 25, from 400
per minute to 10,000 per minute, with sustained "gusts" to more than
200,000 per minute.  The advertisement rate then decayed gradually over
many days, reaching pre-Nimda levels by September 24th. 

A zoom-in on the BGP message storm of September 18 - 19.

j The analysis of the BGP storm triggered by the NIMDA worm followed a
similar course as the analysis of Code Red presented above.  Analysis of
finer details continues, but as in July, there does not seem to be a
single AS or prefix, or group of ASs or prefixes, whose BGP
advertisement rate increased disproportionately during the storm. 

BGP prefix announcements in 15 min periods.  Each row is one BGP peer in
RIPE NCC.  Note steep onset of a wave of announcements on September 18. 

In this plot, prefix announcements in 15 min periods are separated by
contributing BGP peers at RIPE NCC as in the July worm analysis.  Note
the heavy announcement surge across all peers, with a long-tailed finish
that continues past September 20. 

BGP prefix withdrawals in 15 min periods.  Each row is one BGP peer in
RIPE NCC.  Note steep onset of a wave of withdrawals on September 18. 

And as in July, this plot of 15-minute September withdrawal data shows
strong correlation of withdrawals among all peers in the September 18-19
event.  CORRELATION WITH THE NIMDA ATTACK

The steep exponentially growth of the September 18 BGP storm is aligned
with the exponential spread of Nimda, the most virulent Microsoft worm
seen to date.  The Nimda worm exhibits extremely high scan rates,
multiple attack modes generating very heavy traffic, and has been much
more damaging that the July Code Red worm. 

Preliminary analyses describing the Nimda spread and attack modes are
available from SecurityFocus and SANS Institute. 

As with the Code Red 2 worm in July, the security mailing lists contain
a huge number of non-quantitative reports of a stunnigly rapid increase
of HTTP probing and of various network slowdowns and connectivity
failures, with scan activity jumping dramatically at approximately 13:00
GMT.  Edge network administrators were reporting "insane ARP storms",
router failures, congestion slowdowns and connectivity problems.  The
SANS report shows the typical growth trend in the rate of HTTP scans; it
correlates precisely with the onset of global routing instability, as
measured by increases in the BGP prefix announcement rate. 

The time series in the image above (source: SANS) illustrates the trend
in the worm scanning rate.  Notice a faster onset that with July 19 Code
Red 2 attack.  The data shown here do not contain any information about
the size of a network (or multiple networks) where the measurements were
obtained.  As such data become available, the analysis of the worm's
target IP address generation algorithm will allow us to estimate the
global traffic intensity created by the worm spread.  See also CAIDA's
estimates of the total number of Nimda infected hosts. 

NETWORK REACHABILITY FAILURES DURING THE NIMDA ATTACK

In the figure below, we show again that all prefixes were similarly
affected by the effects of the Nimda worm attack on the global routing
stability.  Only the classful networks (/8, /16 and /24) are shown for
clarity, but similar behavior is seen for all prefix lengths. 

Rate of prefix withdrawals in 30-sec intervals for selected prefix
lengths. 

However, the very rapid onset of Nimda-induced routing instability, as
compared with comparatively slower onset of the Code Red induced
instability on July 19, supports several preliminary observations:

The rate of withdrawals of smaller networks, such as /24, is the first
to begin to rise rapidly. 

The log slope of the rate of withdrawals of smaller networks, such as
/24, is higher. 

At this time, we take this to indicate that the worm-induced routing
instabilities begin to propagate from the Internet edge, while the
Internet core remains stable (to the extent that the data shows it).  A
preliminary comparative analysis of the BGP routing tables a few hours
before and after the onset of the Nimda storm shows that the increase in
prefix withdrawals during the period, while significant, did not result
in long-lasting losses of reachability to the edge networks.  These
were, by and large, transient failures.

PRELIMINARY CONCLUSIONS

It would be premature to draw definitive conclusions at this point about
the exact causal relationship between worm propagation and global
internet routing instability.  However, we are accumulating strong
evidence of at least a strong correlation. 

The first hypothesis is that high rates of worm-related traffic are
resulting in significant traffic surges near the edges of the Internet,
causing a large number of BGP sessions to time out, close down, and
reopen.  The reasons may be due either to congestion losses, or to
router CPU overload due to surges in the number of flows.  Although one
would expect BGP messages to be very high priority traffic, and thus not
subject to congestion-related loss until the situation were dire indeed
(such prioritization is a good thing, as shown in a recent SIGCOMM
paper).  But it's less clear that network operators routinely enable
this kind of prioritization. 

Another hypothesis is that explosive growth of worm traffic at the
Internet's edge causes a large number of network operators from
corporations and small ISPs to independently shut down, or reboot or
attempt to reconfigure their border routers; and the total amount of BGP
message traffic grows exponentially with the number of edge domains
feeling the effects of the worm. 

Overall, we tentatively conclude that the worm-induced routing
instability results from the combination of effects such as these two
listed above, plus the failures of other network components.  We will
revise these speculations as more data becomes available.  Copyright ©
2001 Renesys Corporation. 

Contact James Cowie <a
href="mailto:cowie@renesys.com?Subject=Re:%20(ai)%20Global%20Routing%20Instabilities%20during%20Code%20Red%20II%20and%20Nimda%20Worm%2526In-Reply-To=%2526lt;200109281734.f8SHYGf06175@smtpsrv1.mitre.org">cowie@renesys.com</a>
or Andy Ogielski <a
href="mailto:ato@renesys.com?Subject=Re:%20(ai)%20Global%20Routing%20Instabilities%20during%20Code%20Red%20II%20and%20Nimda%20Worm%2526In-Reply-To=%2526lt;200109281734.f8SHYGf06175@smtpsrv1.mitre.org">ato@renesys.com</a>
for further information.  Thanks to Henk Uijterwaal and his group for
the RIPE RIS data, and to Tim Griffin, Dave Donoho and others at the
2001 Leiden Workshop on Multiresolution Analysis of Global Internet
Measurements for fruitful discussions. 

The motivation for this study arose from work partially supported by the
Defense Advanced Research Projects Agency (DARPA), under grant
N66001-00-8065 from the U.S.  Department of Defense.  Its contents are
solely the responsibility of the authors and do not necessarily
represent the official views of the Department of Defense.  This note
should be cited as follows:

Cowie, J., Ogielski, A., Premore, B., and Yuan, Y.  (2001) "Global
Routing Instabilities during Code Red II and Nimda Worm Propagation." <a
href="http://www.renesys.com/projects/bgp_instability">http://www.renesys.com/projects/bgp_instability>,
{access date}. 


------------------------ Yahoo! Groups Sponsor ---------------------~-->
Pinpoint the right security solution for your company- Learn how to add 128- bit encryption and to authenticate your web site with VeriSign's FREE guide!
http://us.click.yahoo.com/yQix2C/33_CAA/yigFAA/kgFolB/TM
---------------------------------------------------------------------~->

------------------
http://all.net/ 

Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/ 



This archive was generated by hypermail 2.1.2 : 2001-09-29 21:08:51 PDT