Thursday, January 9, 2020

Using BGP to announce cloud-native workloads

To realize the goal of dynamically obtaining and announcing layer-3 network identity (IPv4 addresses) for cloud-native workloads, we propose an approach that utilizes the BGP routing protocol.

In a typical network design, these two steps are necessary:

1. Add an EIGRP “network” statement at the access-layer, for the wider "service" subnet (eg. 192.168.0.0/24). Note that "access-to-distribution" dynamic routing is achieved using EIGRP in our datacenter.

2. Considering that a cloud-native load-balancer such as MetalLB announces the /32 service identity with next-hop value of the primary interface, there is no need add a network statement.  Instead, a secondary address/mask is specified at default-gateway for the access VLAN's layer-3 interface spec to allow this gateway to send ARP requests on the access network upon receiving traffic destined to the service subnet.

eBGP multi-hop between MetalLB and “distribution” switch will receive /32 with the new AS_PATH, and extended-community attributes (optional).  The distribution switch can now resolve the next hop.  Note that there is no need to establish a BGP relationship with “access” layer.  Additionally, we must setup the "service subnet" as a "secondary" IP address/mask specification on the gateway's layer-3 interface. We typically employ a pair of switches with HSRP enabled for high availability.

This method requires us to aggregate Linux BGP speakers with MetalLB at rack or server-room level. This represents a significant new capability for publishing services from cloud-native environments such as Kubernetes.

Typically, when a router receives an eBGP route where the next-hop is unknown, the route is not installed in the table, or propagated.

NEXT_HOP is inaccessible

We solved this problem by using an EIGRP network statement for the service prefix, one that be obtaining using APIs from a Network Identity Manager such as Infoblox.

If the router doesn't know how to reach a route's next hop, a recursive lookup will fail, and the route can't be added to BGP. For example, if a BGP router receives a route for 192.168.0.11/32 with a NEXT_HOP attribute of 192.168.0.1, but doesn't have an entry in its routing table for a subnet containing 192.168.0.1, the received route for 192.168.0.11 is useless and won't be installed in the routing table.

If however, MetalLB announces the prefix with the next-hop value of the primary interface (10.205.32.15 in this example), then the IGP (EIGRP) between "access" and "distribution" layers will ensure that next-hop is reachable, and the route is successfully installed.

BGP configuration example at the network layer router:

    router bgp 65001
    neighbor 10.205.32.15 remote-as 65002

    address-family ipv4 unicast

      neighbor 10.205.32.15 ebgp-multihop
      neighbor 10.205.32.15 activate
      neighbor 10.205.32.15 prefix-list service-subnets in

Prefix-list
`ip prefix-list service-subnets seq 100 permit 192.168.0.0/24 ge 32`

EIGRP configuration at the "access" layer switches

    router eigrp 100
      network 192.168.0.0/24

Layer-3 Gateway at "access" layer
    interface vlan 3032
      ip address 10.032.2/22
      ip address 192.168.0.1/24 secondary
        hsrp 32
          ip 10.0.32.1


MetalLB configuration

    apiVersion: v1
    kind: ConfigMap
    metadata:
      namespace: metallb-system
      name: config
    data:
      config:
        peers:
        - peer-address: 10.0.0.1
          peer-asn: 65001
          my-asn: 65002
        address-pools:
        - name: default
          protocol: bgp
          addresses:
          - 192.168.0.0/24
          bgp-advertisements:
          - aggregation-length: 32
            localpref: 100
            communities:
            - no-advertise
          - aggregation-length: 24
        bgp-communities:
          no-advertise: 64512-64534

Tuesday, November 11, 2014

Fibre Channel switching demystified at a basic level

Fiber Channel has been compared to Ethernet and for good reasons. Both are OSI L2 protocols, and both use globaly unique physical addresses, but not in the same way! While MAC addresses are always present in ethernet traffic, FC WWNs are quite a bit different.  They never appear in the FC headers.

Let's understand the underlying FC switching fabric first to put things in perspective. When multiple FC switches are connected together using E_ports, they form a switch fabric. The domain ID is unique to each switch in the fabric and there lies the first problem with scalability: only 239 are available. The lowest [PS_priority+WWWN] becomes the PS (principal switch). The PS selection process occurs when the E_ports are first connected, and the BF (the mundane sounding "build fabric") frames are exchanged. If two switches contain the same domain ID then the link between the two is "isolated". The domain id can be chosen at random, unless it is administratively set. This is required to generate the initial discovery traffic such as BF, EFP, SW_ACC frames. If you are really curious, the S_id and D_id (FCIDs) in these frames are always to set to 0xFFFFFD which is the fabric controller address.

So, given the 239 domain IDs in any given fabric, how to we break through the limit to scale up? Enter NPV or N-port virtualization. An NPV enabled switch does not take up a domain ID, but instead, relays the FLOGIs comming received on F_ports (from N_ports on hosts) up to the core switch via NP_ports. In other words, ports on the NPV edge switch that connect to the F_ports on the core are always set to type "NPV". NPV mode and Fabric mode are mutually exclusive, and a reboot is necessary when selecting either mode.

What is NPIV? N_port id virtualization is a feature that allows F_ports to accept multiple FLOGI requests from the same N_port and assign FCIDs accordingly.  Ask: are NPIV and NPV mutually exclusive on a any given device? Hint: NPIV is a server feature, while NPV is a switch feature, typically. Whether it is NPV enabled switch or a NPIV enabled host, the F_port essentially behaves the same and accepts multiple FLOGIs. So, an NPV enabled switch looks exactly like an NPIV enabled host to a F_port. Phew! Enough already, right?

So, how does FLOGI work? What is the initial FCID in the frame that comes out of an N_port going toward the F_port? Ans: 0x000000 - this is the initial value of the FCID, and this frame is sent to 0xFFFFFE which represents the FLOGI server. The payload contains the WWNN and service parameters. The FLOGI servers assigns the FCID (N_port ID) and BB_Credit.

Next, the host can use its newly acquired N-port ID to continue with the PLOGI process where it sends its WWPN to FCID map. The destination FCID is that of the FCNS: 0xFFFFFC. The FCNS now registers this information it its database and exposes this to other devices according to zoning that has been configured.

Try the command: "show fcns database" if you SAN uses a Cisco switch.

Monday, September 8, 2014

Troubleshooting CRC and input errors with Nexus 5000 in transit path

When you encounter CRC (input) errors on an upstream switch interface, and you have a Nexus 5000 in the transit path (downstream), MTU Stomping by the latter is likely a factor.

On a Nexus 5000 switch, when a frame is received on a 10 Gb/s interface, it is considered to be in the cut-through path.  If there is no "network qos" type policy with jumbo frame support, and the hosts connected this switch generate jumbo frames, we will encounter the following situation:

There are logical and physical causes for the Nexus 5000 to drop a frame. There are also situations when a frame cannot be dropped because of the cut-through nature of the switch architecture. If a drop is necessary, but the frame is being switched in a cut-through path, then the only option is to stomp the Ethernet frame check sequence (FCS). Stomping a frame involves setting the FCS to a known value that does not pass a CRC check. This causes subsequent CRC checks to fail later in the path for this frame. A downstream store-and-forward device, or a host, will be able to drop this frame.

"show queueing interface ethernet x/y" will reveal the hardware MTU.  This setting is enforced by the application of a "network qos" type policy-map under "system qos" section of running configuration.

Monday, April 21, 2014

IP subnetting unraveled

Consider the following address in CIDR notation:

172.19.235.75 /13

1) What is the network address?

2) What is the broadcast address?

First, determine the subnet mask in dotted decimal notation: 255.248.0.0

Next, calculate: IP_Address AND Mask.  This reveals the network address as 172.16.0.0 (NET_address)

Then, determine the wildcard mask (WC_mask) by inverting 1's and 0's in the subnet mask: 0.7.255.255

Finally, calculate: NET_address OR WC_mask, and this reveals the broadcast address: 172.23.255.255

Now, practice this method on 10.0.0.100 /28 and if you are feeling adventurous, try 192.168.255.254 /22

Monday, April 15, 2013

Wireless AP 1142 - Convert to controller based or standalone mode

If you are faced with the task of installing 1100 series APs into a controller environment but somehow "standalone" APs were procured, don't panic! Here is the fix:

1. Note the AP's IP address by "show cdp neigh detail"
2. http:// to the address to manage (Cisco/Cisco... case sensitive ID and pass!)
3. Go to the software tab and update to the correct image

If your controller is a hop or two or more away, don't forget to adjust DHCP option 43 and option 60 values!

Cisco 2600 AP - Convert to Stand-Alone mode

Follow these simple steps:

1. Download the stand-alone image from Cisco.com (.tar)
2. Rename the file to ...tar.default
3. Make sure you have a console connection to the 2600 AP - this will you verify the name of the file the AP expects to boot from using TFTP.
4. Change your IP address to 10.0.0.2 /8
5. Run a TFTP  server on your laptop or PC
6. Plug the AP and your PC into two ports a switch that are in the same VLAN (no layer3 SVI necessary, I recommend creating a temporary VLAN such as 99 for example)
7. Make sure that the AP is connected to a PoE port if you don't have a power brick, BUT before you do that, press the MODE button and hold it down as the AP boots!!
8. Watch the console messages for the correct filename - ...tar.default should match exactly.
9. AP's LED will go blue and then after 15 seconds will go RED - you can let go of the MODE button now!
10. AP LED will blink green and it will intialize in stand-alone mode, and obtain a DHCP address.
11. You can manage it using http://

Friday, January 4, 2013

Cisco ASA: "LAND" attack - track false positves

A LAND attack is characterized by an IP packet whose source and destination IP addresses are the same. HOwever, sometimes on the ASA platform, it is possible to see false postives.


Consider the following example:

Jan 05 2012 16:21:36: %ASA-2-106017: Deny IP due to Land Attack from 192.168.2.71 to 192.168.2.71

Jan 05 2012 16:21:37: %ASA-2-106017: Deny IP due to Land Attack from 192.168.2.71 to 192.168.2.71

What is really happening?

Upon closer examination, you notice that you have static tranlation setup:

static (inside, outside) 192.168.2.71 172.16.2.71

Now let's capture traffic b/w the private address and translated IP, and lo and behold, the mystery is solved! The server inside is trying to access "itself via its public IP address" (perhaps via a script that is running).

3 packets captured

1: 16:20:37.032469 172.16.2.71.58126 > 192.168.2.71.80: S 1304500266:1304500266(0) win 5840

2: 16:21:36.938168 172.16.2.71.58128 > 192.168.2.71.80: S 4035860468:4035860468(0) win 5840

3: 16:21:37.173559 172.16.2.71.58129 > 192.168.2.71.80: S 4123968769:4123968769(0) win 5840

3 packets shown   Notice how the time-stamps match!