Graceful Restart in NSX

In this article we shall discuss how is Graceful Restart relevant to your design, the considerations with respect to ECMP based designs, vs HA based designs. Continue for more information.

I had an interesting discussion regarding Graceful Restart recently, due to the some confusion regarding traditional networking and GR/NSF/NSR, vs NSX’s implementation, specifically about when you implemented ESGs in HA Mode vs ECMP.

Firstly a quick recap on Graceful Restart, courtesy of Cisco:

When Graceful Restart is used, peer networking devices are informed, via protocol extensions prior to the event, of the SSO capable routers ability to perform graceful restart. The peer device must have the ability to understand this messaging. When a switchover occurs, the peer will continue to forward to the switching over router as instructed by the GR process for each particular protocol, even though in most cases the peering relationship needs to be rebuilt.

Reference Link

Now if we take an extract from the NSX Reference Design Guide, p128:

With GR, the NSX Edge can refresh adjacency with the physical router and the DLR Control VM while requesting them to continue using the old adjacencies. Without GR, these adjacencies would be brought down and renegotiated on reception of the first hello from the Edge and this would ultimately lead to a secondary traffic outage.

So here is the thing, with the DLR Control-VM, and ESGs in HA mode, HA is the key element. Remember the Control-VM is like an ESG, from a virtual appliance perspective. So when you deploy a Control-VM and ESG in HA Mode, we have an Active and a Standby Appliance. So during a failover of to Standby, in the event of a failure of the Active, we want to avoid tearing down learnt routes during failover, otherwise we will black hole traffic.

So from a DLR Control-VM perspective we don’t want to remove routes from the VDR instance on each of hosts, whilst the standby Control-VM is taking over. From an ESG perspective we don’t want, to the physical estate, or the DLR,  to tear down routes, whilst the Standby ESG is taking over. Keep in mind that routing protocol timers may be default with HA based ESGs/Control-VM, but to mitigate any other additional risks of black holing we can introduce a floating static route, that utilises a higher administrative distance. This will then be utilised in the event all other routes are lost.

How this differs with ECMP ESGs is that graceful restart is not required full stop! With an ECMP ESG, if the ESG fails then we will use an existing learnt route path via a second (or more) ECMP ESG. In addition these ESGs can support tuned hello and dead timers for the dynamic routing protocol. If an ECMP ESG fails, we want to tell the world as quickly as possible, to use the existing active redundant path.

Hopefully that clarifies a few things.

Bal Birdy

Leave a Reply

%d bloggers like this: