Sunday, September 14, 2014

A Quick Look at the VXLAN's specification - RFC 7348



VXLAN is being deployed in numerous production environments and is supported by quite a few networking equipment vendor as well as software vendors such as VMware and Red Hat.  Not until August 2014 that the specification is moved from IETF draft status to RFC 7348.

RFC stands for Request For Comments. As a software developer for networking equipment, I have to know the RFC in and out.  This RFC is the functional specification for the feature that I develop.  Test group will test the feature to make sure the feature function as what the RFC describes.

Today, let’s take a look at what is in RFC 7348.

The title of RFC 7348 is “Virtual eXtensible Local Area Network (VXLAN): A Framework for Overlaying Virtualized Layer 2 Networks over Layer 3 Networks.”

RFC 7348 is relatively a short document compare to other RFC that specify other communication protocols.   The table of contents has 10 sections.  6 of them are standard sections for all RFC.  To break down this RFC to the core, we can look at the following sections:
  • VXLAN Problem Statement
  • VXLAN operation
  • VXLAN Frame Format
  • VXLAN Security Considerations

VXLAN Problem Statement
Basically RFC 7348 identified 3 problem areas in the networking infrastructure within the data center that VXLAN is designed to resolve.  Server virtualization in the data center works best in a flat layer 2 network.  Only until VMworld 2014 that VMware announced in vSphere 6 (as of this writing, it is still under beta) can support vMotion across vCenter and long distance.

Limitations Imposed by Spanning Tree and VLAN Ranges
Spanning Tree is used to safe guard loops in a Layer 2 network.  When there is a loop in a Layer 2 network, it will cause a broadcast storm and the network will be in-operable.  Spanning Tree Protocol is designed and used to protect the Layer 2 network.  While it is able to protect the Layer 2 network the price to pay is there are links in the network is not used as well as multipathing not an option in the network design.  Both of these problems ties back to ROI (Return On Investment) not to a point where it should be.

Another problem is that the VLAN ID is a 12-bit field thus causing data center in flat Layer 2 network limited to only 4094 VLANs.

Multi-tenant Environment
Multi-tenant is a virtualized data center is a definite requirement.  Tenant isolation is an absolute requirement.

In a Layer 2 network VLAN is a popular way to achieve tenant isolation.  The number of VLANs limited to 4094 as described in the previous section is a major stumbling block to data center design.  When each tenant requires to have more than one VLAN as in the example of a 3-tier web application, the number of VLANs available in a flat Layer 2 network become a more limited and yet required resource.

A layer 3 network is another possible way of network isolation for multi-tenant but it has it limitation.  This will limit each tenant to have unique IP subnets.  Also, this Layer 3 isolation solution limits user not able to use Layer 2 or non Layer 3 protocols for inter VM communication.

Inadequate Table Sizes at ToR Switch
The use of ToR switch to connect servers on a rack is a common data center design.  With multiple virtual machines running on the virtualized servers, the number of MAC address that a ToR switch that has to learn.  As the number of virtual machine that a virtualized server can host is increasing the MAC address table used by the ToR switch becomes inadequate

VXLAN Operations
Section 4 of RFC 7348 describes the operation of VXLAN.  As indicated in the title of this RFC, VXLAN is a framework for overlaying virtualized Layer 2 networks over Layer 3 networks.  It is designed specifically to overcome the problems that we face in the data center.
VXLAN is meant to extend a Layer 2 network over a Layer 3 network by the use of tunneling technology - encapsulating an UDP packet over the original Layer 2 frame.

RFC 7348 outline the following rules:
  • Each overlay is termed a VXLAN segment.
  • Only VMs within the same VXLAN segment can communicate with each other
  • Each VXLAN segment is identified by a 24-bit segment ID (VNI).
  • VNI identifies the scope of the inner MAC frame originated by the individual VM
  • VNI is an outer header that encapsulates the inner MAC frame originated by the individual VM.
  • VXLAN segment and VXLAN overlay network are interchangeable in the RFC.
  • VXLAN tunnels are stateless connection between 2 end points.
  • Each end point is called a VXLAN Tunnel End Point (VTEP)
  • VTEP can be implemented on a virtual switch, physical switch or physical server either on hardware or software.
  • Use of data plane learning.
  • Multicast is used for carrying unknown destination, broadcast and multicast frames (BUM traffic).
  • VTEPs MUST NOT fragment VXLAN packets.

The last 3 points are worth looked into a little bit more.

Data Plane Learning
Data plane learning means there is no control plane for VXLAN. Not until now that I realize I never truly understand what a control plane is.  Often time people will say generically that SDN is the separation of the control and data plane.  I work on networking area for almost 20 years and mostly Layer 2 networking features.  I say mostly because I worked on UDP Relay Agent and DHCP Snooping in which I have to know a little bit of IP forwarding.  Data plane is the forwarding operation of networking equipment.

So what exactly is a control plane?  For any network engineers working on Layer 3 networks this is very obvious to them.  Control plane is a Layer 3 networking concept.  BGP is an example of a control plane protocol.  Routers exchange routing information via the control plane.

VXLAN uses data plane learning just like the source learning on Ethernet (Layer 2) switches.  VTEP is responsible for learning the virtual machine’s MAC address and associate this with VXLAN segment/VXLAN Network Identifier.  This learning process is very important to the efficiency of the operation of VXLAN networks.

Multicast for BUM traffic
VXLAN is to extend Layer 2 segments to other data center.  In a Layer 2 network, unknown destination, broadcast and multicast frames are flooded to all the devices on the same broadcast domain.  With VXLAN, RFC 7348 specifically spell out IP multicast is used for sending BUM traffic to other VTEPs of the same VXLAN segment. 

While this works perfectly in the functional level, all Layer 3 network engineers always try to avoid IP multicast.

Cisco Nexus 1000V has Unicast-ONLY VXLAN and MAC Distribution Mode.  In the MAC Distribution Mode, there is a centralized controller.  This is same as introducing a control plane for VXLAN.

VMware NSX has its NSX Controller.  When a VM is provisioned, it will register itself to the NSX controller.  ARP request from the VMs are sent to the controller where the controller is aware of the MAC address and VTEP/VNI association. This again is introducing the control plane for VXLAN.

To work with other VTEPs, IP multicast still needs to be used.

RFC 7348 suggests to use bidirectional IP multicast protocol such as PIM-SM to build efficient multicast forwarding trees.

VTEPs MUST NOT fragment VXLAN packets
Due to encapsulation, VXLAN adds an extra 50 Bytes of overhead.  The “MUST NOT” is written in bold in the RFC.  This requirement has huge implication the MTU of the underlay Layer 3 network.  MTU for Ethernet v2 is 1500.  VMware recommend setting the MTU size to 1600 or to use jumbo frame option end to end. 

I know of an installation that ran into MTU problem and end up not deploying VXLAN.

VXLAN Frame Format
Section 5 of RFC 7348 details the frame format of VXLAN.  As a developer of network engineer that needs to troubleshoot VXLAN problem needs to know this frame format very well.

I believe WireShark is able to decode VXLAN traffic.  UDP port 4789 is assigned for VXLAN traffic.

VXLAN Security Considerations
While security consideration is a standard section for RFCs, it is also worth looking into. 

Quoting directly from the RFC:

Traditionally, Layer 2 networks can only be attacked from 'within' by rogue end points -- either
  • by having inappropriate access to a LAN and snooping on traffic,
  • by injecting spoofed packets to 'take over' another MAC address, or
  • by flooding and causing denial of service. 
VXLAN increases the attack surface for these kinds of attacks.

While not going into detail, the security consideration section of this RFC suggests the following ways to safe guard VXLAN networks.

·     Continue to use the traditional way of mitigating rogue end points attack by limiting the management and administrative scope of who deploys and manages VM/gateways in a VXLAN environment.

  • Use of 802.1X for admission control for individual end points.
  • Use of 5-tuple-based ACL.
  • Use of IPsec to authenticate and optionally encrypt VXLAN traffic.
  • Use of designated VLAN for VXLAN traffic.
  • Use of secure method on the management plane of the VTEP.

This summarize RFC 7348.

5 comments: