Thursday, January 9, 2020

Using BGP to announce cloud-native workloads

To realize the goal of dynamically obtaining and announcing layer-3 network identity (IPv4 addresses) for cloud-native workloads, we propose an approach that utilizes the BGP routing protocol.

In a typical network design, these two steps are necessary:

1. Add an EIGRP “network” statement at the access-layer, for the wider "service" subnet (eg. 192.168.0.0/24). Note that "access-to-distribution" dynamic routing is achieved using EIGRP in our datacenter.

2. Considering that a cloud-native load-balancer such as MetalLB announces the /32 service identity with next-hop value of the primary interface, there is no need add a network statement.  Instead, a secondary address/mask is specified at default-gateway for the access VLAN's layer-3 interface spec to allow this gateway to send ARP requests on the access network upon receiving traffic destined to the service subnet.

eBGP multi-hop between MetalLB and “distribution” switch will receive /32 with the new AS_PATH, and extended-community attributes (optional).  The distribution switch can now resolve the next hop.  Note that there is no need to establish a BGP relationship with “access” layer.  Additionally, we must setup the "service subnet" as a "secondary" IP address/mask specification on the gateway's layer-3 interface. We typically employ a pair of switches with HSRP enabled for high availability.

This method requires us to aggregate Linux BGP speakers with MetalLB at rack or server-room level. This represents a significant new capability for publishing services from cloud-native environments such as Kubernetes.

Typically, when a router receives an eBGP route where the next-hop is unknown, the route is not installed in the table, or propagated.

NEXT_HOP is inaccessible

We solved this problem by using an EIGRP network statement for the service prefix, one that be obtaining using APIs from a Network Identity Manager such as Infoblox.

If the router doesn't know how to reach a route's next hop, a recursive lookup will fail, and the route can't be added to BGP. For example, if a BGP router receives a route for 192.168.0.11/32 with a NEXT_HOP attribute of 192.168.0.1, but doesn't have an entry in its routing table for a subnet containing 192.168.0.1, the received route for 192.168.0.11 is useless and won't be installed in the routing table.

If however, MetalLB announces the prefix with the next-hop value of the primary interface (10.205.32.15 in this example), then the IGP (EIGRP) between "access" and "distribution" layers will ensure that next-hop is reachable, and the route is successfully installed.

BGP configuration example at the network layer router:

    router bgp 65001
    neighbor 10.205.32.15 remote-as 65002

    address-family ipv4 unicast

      neighbor 10.205.32.15 ebgp-multihop
      neighbor 10.205.32.15 activate
      neighbor 10.205.32.15 prefix-list service-subnets in

Prefix-list
`ip prefix-list service-subnets seq 100 permit 192.168.0.0/24 ge 32`

EIGRP configuration at the "access" layer switches

    router eigrp 100
      network 192.168.0.0/24

Layer-3 Gateway at "access" layer
    interface vlan 3032
      ip address 10.032.2/22
      ip address 192.168.0.1/24 secondary
        hsrp 32
          ip 10.0.32.1


MetalLB configuration

    apiVersion: v1
    kind: ConfigMap
    metadata:
      namespace: metallb-system
      name: config
    data:
      config:
        peers:
        - peer-address: 10.0.0.1
          peer-asn: 65001
          my-asn: 65002
        address-pools:
        - name: default
          protocol: bgp
          addresses:
          - 192.168.0.0/24
          bgp-advertisements:
          - aggregation-length: 32
            localpref: 100
            communities:
            - no-advertise
          - aggregation-length: 24
        bgp-communities:
          no-advertise: 64512-64534