NSX-T and Kubernetes Services – east-west load-balancing with ClusterIP

Тhis is my first blog in a series, trying to reveal the Kubernetes Services mystery and how they are implemented when NSX-T is used as a container networking solution.

First things first: What is Kubernetes Service?

K8s Service helps to connect application together with another applications and users. It binds pods, which provide one and the same service, into a unique logical construct and then expose it to the Kubernetes cluster and/or outside world. The pods are grouped into an Endpoint list, based on a unique selector criteria (label), which they possess as an attribute. The Service is also represented with a DNS name in the Kubernetes dynamic DNS system (CoreDNS) or through environment variable injection.

There are several Service’s types. In this blog we will discuss the base one  – ClusterIP.

This is the default Kubernetes Service type. It provides a service within the cluster that other applications (pods) part of the same cluster can access. This type of service will not be accessible from outside. It gives distributed east-west load-balancing system that leaves on all of the K8s Nodes.

clusterIP

How it is natively implemented?

Every K8s Node runs a “kube-proxy” process, which is responsible for implementing the Services. The “kube-proxy” register a “watch” for the addition/removal of Service and Endpoints object. When there is a new Service created or modification of an old one, the API Server will notify “kube-proxy”. The “kube-proxy” will program the Linux IPtables (or netlink ipvs) rules on the Node to capture traffic to the Service’s ClusterIP address and port, and redirect that traffic to one of the Service’s backend sets. For each Endpoints object, separate IPtables rules exist, which select a backend Pod. By default, the choice of backend is random.

Below we have CLI output of K8s ClusterIP service. The name of the service is livefire-svc, with an IP address of 10.107.211.73 and port 80 (http).  TargetPort is the port the container (application) accepts traffic on and in our case is the same – port 80 (http). There are three pods in the Endpoints list (10.244.1.4, 10.244.2.2, 10.244.2.3), which have been selected base on the Selector – livefire-demo.

Bash
$ kubectl describe svc livefire-svc
Name:              livefire-svc
Namespace:         livefire
Labels:            app=livefire-demo
Annotations:       
Selector:          app=livefire-demo
Type:              ClusterIP
IP:                10.107.211.73
Port:              80/TCP
TargetPort:        80/TCP
Endpoints:         10.244.1.4:80,10.244.2.2:80,10.244.2.3:80

If we check the iptables ruleset on one of the Nodes, we will see three chains related to the Service:

Bash
$ iptables -t nat -L
Chain KUBE-SERVICES (2 references)
target     prot opt source               destination
...
KUBE-SVC-PUXF6NQGEM6L36VE  tcp  --  anywhere             10.107.211.73        /* livefire/livefire-demo-rc: cluster IP */ tcp dpt:http

Chain KUBE-SVC-PUXF6NQGEM6L36VE (1 references)
target     prot opt source               destination
KUBE-SEP-IE63CKU5OUEDJOFZ  all  --  anywhere             anywhere             /* livefire/livefire-demo-rc: */ statistic mode random probability 0.33332999982
KUBE-SEP-G5KIHW44JZP547LH  all  --  anywhere             anywhere             /* livefire/livefire-demo-rc: */ statistic mode random probability 0.50000000000
KUBE-SEP-RULIYA7EFF24T6SQ  all  --  anywhere             anywhere             /* livefire/livefire-demo-rc: */

Chain KUBE-SEP-RULIYA7EFF24T6SQ (1 references)
target     prot opt source               destination
DNAT       tcp  --  anywhere             anywhere             /* livefire/livefire-demo-rc: */ tcp to:10.244.2.3:80

Traffic destined to the Service IP and port (10.107.211.73:80) will match the rule in the KUBE-SERVICES chain and will be send to the service specific load-balancing chain KUBE-SVC-PUXF6NQGEM6L36VE. A random target chain will be selected (for instance KUBE-SEP-RULIYA7EFF24T6SQ), at which the traffic will be DNAT’ed to the pod IP address (10.244.2.3) on original port 80.

Now that we know how the native Cluster IP Service is implemented, let’s see how this is done when NSX-T is used as a container networking solution.

The NSX-T Kubernetes integration solution leverages Open vSwitch (OVS) to provide network pluming for the pods. OVS traffic will bypass the Node TCP/IP stack, which means that the native Kubernetes “kube-proxy” and Linux IPtables on the Nodes, can’t be used to provide connectivity. Because of that the integration introduces a new “NSX kube-proxy” agent, which is delivered as a container. It is a part of the NSX Node Agent pod, which is running one each K8s Node. The task of the “NSX kube-proxy” is to “watch for Services (creation/modification) and to program the corresponding OVS flows.

From Kubernetes perspective the Service will looks the same:

Bash
$ kubectl describe svc livefire-nsx-svc
Name:              livefire-nsx-svc
Namespace:         livefire
Labels:            app=livefire-demo
Annotations:       <none>
Selector:          app=livefire-demo
Type:              ClusterIP
IP:                10.108.213.27
Port:              80/TCP
TargetPort:        80/TCP
Endpoints:         10.4.2.2:80,10.4.2.3:80,10.4.2.4:80

Checking the OVS flow table, we can see a flow with destination the Service IP and Port (10.108.213.27:80), which has an action equals to “group:5”.

Bash
$ sudo ovs-ofctl dump-flows br-int
...
cookie=0x7, duration=170.052s, table=1, n_packets=0, n_bytes=0, priority=100,ct_state=+new+trk,tcp,nw_dst=10.108.213.27,tp_dst=80 actions=group:5

Group tables enable OpenFlow to process forwarding decisions on multiple links (load-balancing, multicast and active/standby). If we dump the group table, we can verify the load-balancing backend objects (the pods in the Service Endpoints list).

Bash
$ sudo ovs-ofctl -O OpenFlow13 dump-groups br-int
...
group_id=5,type=select,bucket=weight:100,actions=ct(commit,table=2,zone=65312,nat(dst=10.4.2.3:80)),bucket=weight:100,actions=ct(commit,table=2,zone=65312,nat(dst=10.4.2.2:80)),bucket=weight:100,actions=ct(commit,table=2,zone=65312,nat(dst=10.4.2.4:80))

As I mentioned above the ClusterIP service is accessible only within the Kubernetes cluster. In my next blog we shall discuss the options to expose a containerised applications outside of the cluster.

Nikodim Nikodimov on LinkedinNikodim Nikodimov on Twitter
Nikodim Nikodimov
Based in Sofia, Bulgaria, Nikodim is a Staff Solutions Architect at VMware, Inc. He has strong background in Network and Security fields, specialized over 12 years in professional services consultancy, scoping experience and cognition to provide high quality, technical, and informed knowledge within large enterprise and datacenter network environments.

Nikodim has broad experience in a number of technology areas including Cisco, Checkpoint, HPN, Juniper, Fortinet, F5, VMware, Microsoft and Linux. He is Certified Cisco Systems Instructor and holds CCIE Security.

His current focus includes large-scale data centers and cloud deployments, software defined networking (SDN) and data centers (SDDC), and network function virtualization (NFV).

2 thoughts on “NSX-T and Kubernetes Services – east-west load-balancing with ClusterIP”

Leave a Reply