Explanation and analysis of today’s internet outage. What happened and what could’ve been done to prevent this?

Background

We can see that a $clueless_company (AS396531) gets internet connectivity through two providers (called upstreams in the network speak): Verizon (AS701) and DQE Communications (AS33154).

So what does “get connectivity” exactly entail? $clueless_company receives a bunch of routes – in fact, probably the whole internet routing table (which is around 800k routes at the moment) from both providers. Their router imports a routing table from both of these providers and then chooses best routes based on several factors.

Now, this communication/exchange of routes (which happens over a protocol called BGP) goes both sides. Basically, as a company, you want to have some IP space of your own too. In order for the world to be able to reach your IP space, you need to export routes for this IP space to your upstreams.

Thanks to the magic of routing, every router on the internet knows that in order to reach your IP space, they need to go either through Verizon or DQE. Great!

Where’s the problem?

You are supposed to FILTER what you export!!! The stability and reliability of the internet relies on people filtering their prefixes. For example, Google shouldn’t export AT&T’s IP space to the internet and CloudFlare shouldn’t export Google to the internet. And a $clueless_company shouldn’t export large portions of the internet to Verizon… Which is exactly what happened. Instead of only exporting their 1 prefix (IP block/range), the $clueless_company did this:

Illustration of the BGP leak (bgp.he.net)

To put it in words – they took everything DQE provided them and swiftly exported that to Verizon. Verizon decided to play an exhibitionist and propagated this to the rest of their peers – basically all other Tier 1 ISPs* (some of which accepted it – I know at least TATA, Cogent, Telia did) and their customers.

Which meant that half of the internet now learned that in order to reach networks like CloudFlare or OVH, they can go through Verizon -> $clueless_company -> DQE.

 

 

 

 

 

How routing to OVH looked like in Cogent’s Looking Glass. Notice the AS path:

Paths: (1 available, best #1)
  Advertised IPv4 Unicast paths to peers (in unique update groups):
    38.5.4.117      
  Path #1: Received by speaker 0
  Advertised IPv4 Unicast paths to peers (in unique update groups):
    38.5.4.117      
  701 396531 33154 3356 16276
    66.28.1.152 (metric 64070) from 154.54.66.21 (66.28.1.152)
      Origin IGP, metric 4294967294, localpref 99, valid, internal, best, group-best, import-candidate
      Received Path ID 0, Local Path ID 1, version 437096364
      Community: 174:10017 174:20666 174:21000 174:22013
      Originator: 66.28.1.152, Cluster list: 154.54.66.21, 66.28.1.9

Why did this have such a large impact?

Normally a leak like this would be a very regional thing. The leaked AS paths are usually much longer and routers generally prefer shorter ones. What happened is that DQE runs a routing optimizer, which generates longer prefixes.

Routers generally prefer longer prefixes with long AS paths over short prefixes with short AS paths (in other words, prefix length carries a larger priority when deciding which path is the best than AS path).

Who is responsible?

To be honest, I’m not even mad at the $clueless_company. They made a mistake, and mistakes happen. Especially when you’re a smaller company whose primary focus is not IT, this can be forgiven. Who is ultimately responsible is Verizon:

  • They didn’t filter $clueless_company
    There should’ve been prefix filters in place. $clueless_company only advertises one prefix and has no downstreams, it’s not that hard to do.
  • They didn’t put a prefix count limit on $clueless_company
    Even a generous prefix count filter of 100 prefixes would’ve prevented this disaster. Limit would’ve been tripped, BGP session disabled, everything would be fine.
  • Their NOC did exactly nothing to mitigate this issue
    The route leak lasted for literally hours. Verizon’s NOC should’ve noticed something was amiss and should’ve dropped the customer’s BGP session.

Can the routing optimizer be blamed?

No. A routing optimizer is a piece of software which does exactly what it is supposed to do. The problem is not with the technology, but rather with people operating networks.

 

*Tier 1 ISPs form the backbone of the internet. They’re very large service providers who freely exchange traffic between each other and sell IP transit to others.