Kubenet fetches gateway from CNI result instead of calculating gateway from pod cidr #85993
Description
This write up is on PR #85993. This PR fixed a regression in kubenet that prevented pods from obtaining IP addresses.
The goal of this PR was to refactor the way that kubenet fetches the gateway. The problem was that the kubenet was having issues allocating addresses for pods, with a node spec in 1.16, as described in Issue #84541.
Brief Overview of Kubenet
It is a Linux only network plugin, meant to be basic and simple. It’s not expected to implement things such as cross-node networking or network policy and is typically used in conjunction with a cloud provider that sets up routing rules for node(s).
Here are some things that kubenet will do:
- Create a
Linux bridgenamedcbr0 - Create a
veth pairfor eachpodconnected tocbr0 - Assign an
IP addressto thepodend of theveth pair- This
IP addresscomes from arangethat has been assigned to thenodethrough configuration or by thecontroller-manager
- This
- Assign an
MTUto thecbr0- This
MTUmatches the smallestMTUof anenabled normal interfaceon thehost
- This
More information can be located within the k8s.io docs. This is also the source in which this overview was derived.
Old Logic Breakdown
Before the PR, the gateway was derived from the pod cidr by ranging over the list of current pod cidrs:
for idx, currentPodCIDR := range podCIDRs {
_, cidr, err := net.ParseCIDR(currentPodCIDR)
if nil != err {
klog.Warningf("Failed to generate CNI network config with cidr %s at index:%v: %v", currentPodCIDR, idx, err)
return
}
// create list of ips and gateways
cidr.IP[len(cidr.IP)-1] += 1 // Set bridge address to first address in IPNet
plugin.podCIDRs = append(plugin.podCIDRs, cidr)
plugin.podGateways = append(plugin.podGateways, cidr.IP)
}
Notice how, in the above logic, we are creating a list of ips and gateways by setting up a Linux bridge, then appending the pod cidrs and gateways to it. Let’s break it down a bit.
First, we range over the pod cidrs to get their IP network value, a.k.a. IPNet, but in this context we call it the cidr.
for idx, currentPodCIDR := range podCIDRs {
_, cidr, err := net.ParseCIDR(currentPodCIDR)
Now that we have the cidr value, let’s create the list of ips and gateways. We first need to set the bridge address. This is done by setting it to the first address in the IPNet, a.k.a. that cidr value we’ve just mentioned:
cidr.IP[len(cidr.IP)-1] += 1
What that line above does is take an IPNet.IP value (i.e. 10.0.0.0) and increment that 32 bit address by 1 (i.e. 10.0.0.1). Notice that we are mutating the network number from the IPNet when we say cidr.IP. If you look at how IPNet works, it’s a struct containing the IP (network number) and the IPMask. This is how that struct looks in the golang library:
type IPNet struct {
IP IP // network number
Mask IPMask // network mask
}
Now we move on to this line of code:
plugin.podCIDRs = append(plugin.podCIDRS, cidr)
The plugin part, is a method receiver for the Event function that we are currently in, referencing the kubenetNetworkPlugin defined earlier in the file as:
type kubenetNetworkPlugin struct {
network.NoopNetworkPlugin
host network.Host
netConfig *libcni.NetworkConfig
loConfig *libcni.NetworkConfig
cniConfig libcni.CNI
bandwidthShaper bandwidth.Shaper
mu sync.Mutex //Mutex for protecting podIPs map, netConfig, and shaper initialization
podIPs map[kubecontainer.ContainerID]utilsets.String
mtu int
execer utilexec.Interface
nsenterPath string
hairpinMode kubeletconfig.HairpinMode
// kubenet can use either hostportSyncer and hostportManager to implement hostports
// Currently, if network host supports legacy features, hostportSyncer will be used,
// otherwise, hostportManager will be used.
hostportSyncer hostport.HostportSyncer
hostportSyncerv6 hostport.HostportSyncer
hostportManager hostport.HostPortManager
hostportManagerv6 hostport.HostPortManager
iptables utiliptables.Interface
iptablesv6 utiliptables.Interface
sysctl utilsysctl.Interface
ebtables utilebtables.Interface
// binDirs is passed by kubelet cni-bin-dir parameter.
// kubenet will search for CNI binaries in DefaultCNIDir first, then continue to binDirs.
binDirs []string
nonMasqueradeCIDR string
cacheDir string
podCIDRs []*net.IPNet
podGateways []net.IP
}
So, when we want to append the updated cidr to plugin.podCIDRs we are referring to a list of type *net.IPNet.
The next line also does some appending,
plugin.podGateways = append(plugin.podGateways, cidr.IP)
But, instead of the whole IPNet, we are appending cidr.IP to a list of type net.IP.
New Logic Breakdown
The logic described above was replaced with changes to the same kubenet_linux.go file. Changes were made to the kubenetNetworkPlugin struct and to some of the methods on that struct:
EventsetupsyncEbtablesDedupRulesgetRangesConfig
First, we removed podGateways:
podGateways []net.IP
from the kubenetNetworkPlugin struct.
Next, we updated the Event method’s logic:
for idx, currentPodCIDR := range podCIDRs {
_, cidr, err := net.ParseCIDR(currentPodCIDR)
if nil != err {
klog.Warningf("Failed to generate CNI network config with cidr %s at index:%v: %v", currentPodCIDR, idx, err)
return
}
// create list of ips
plugin.podCIDRs = append(plugin.podCIDRs, cidr)
}
What should stand out is the fact that we are no longer setting up the bridge address here or appending gateway values to podGateways in that kubenetNetworkPlugin struct. We’ve removed that altogether. This essentially removed the dependency on pod cidrs to derive a gateway value. So, how do we get the gateway now??
That’s where these next steps come in..
Now, let’s update the setup method. This method is responsible for setting up networking through CNI using the given ns/name and sandbox ID. Let’s start off by creating some variables representing lists of type podGateways and type podCidrs:
var podGateways []net.IP
var podCIDRs []net.IPNet
We can update these lists based on whether or not it is an IP4 or IP6 address we’re adding:
//TODO: v1.16 (khenidak) update NET_CONFIG_TEMPLATE to CNI version 0.3.0 or later so
// that we get multiple IP addresses in the returned Result structure
if res.IP4 != nil {
ipv4 = res.IP4.IP.IP.To4()
podGateways = append(podGateways, res.IP4.Gateway)
podCIDRs = append(podCIDRs, net.IPNet{IP: ipv4.Mask(res.IP4.IP.Mask), Mask: res.IP4.IP.Mask})
}
if res.IP6 != nil {
ipv6 = res.IP6.IP.IP
podGateways = append(podGateways, res.IP6.Gateway)
podCIDRs = append(podCIDRs, net.IPNet{IP: ipv6.Mask(res.IP6.IP.Mask), Mask: res.IP6.IP.Mask})
}
for reference, res is a variable defined earlier in this setup method as:
// Coerce the CNI result version
res, err := cnitypes020.GetResult(resT)
so, when we say
if res.IP4 != nil
or
if res.IP6 != nil
this checks the CNI result to see which IP address type is returned.
Then, at the bottom of this setup method, we make a call to a method that eliminates duplicate packets by configuring the rules for ebtables:
// configure the ebtables rules to eliminate duplicate packets by best effort
plugin.syncEbtablesDedupRules(link.Attrs().HardwareAddr, podCIDRs, podGateways)
If you have a sharp eye, you may have noticed the change in the syncEbtablesDedupRules method signature. Here’s how it was done previously:
plugin.syncEbtablesDedupRules(link.Attrs().HardwareAddr)
That is because we need podCIDRs and podGateways for this syncEbtablesDedupRules method when we do the following:
// per gateway rule
for idx, gw := range podGateways {
klog.V(3).Infof("Filtering packets with ebtables on mac address: %v, gateway: %v, pod CIDR: %v", macAddr.String(), gw.String(), podCIDRs[idx].String())
bIsV6 := netutils.IsIPv6(gw)
IPFamily := "IPv4"
ipSrc := "--ip-src"
if bIsV6 {
IPFamily = "IPv6"
ipSrc = "--ip6-src"
}
commonArgs := []string{"-p", IPFamily, "-s", macAddr.String(), "-o", "veth+"}
_, err = plugin.ebtables.EnsureRule(utilebtables.Prepend, utilebtables.TableFilter, dedupChain, append(commonArgs, ipSrc, gw.String(), "-j", "ACCEPT")...)
if err != nil {
klog.Errorf("Failed to ensure packets from cbr0 gateway:%v to be accepted with error:%v", gw.String(), err)
return
}
_, err = plugin.ebtables.EnsureRule(utilebtables.Append, utilebtables.TableFilter, dedupChain, append(commonArgs, ipSrc, podCIDRs[idx].String(), "-j", "DROP")...)
if err != nil {
klog.Errorf("Failed to ensure packets from podCidr[%v] but has mac address of cbr0 to get dropped. err:%v", podCIDRs[idx].String(), err)
return
}
}
The changes made above were to reflect the change in setup where we defined podCIDRs as opposed to appending to plugin.podCIDRs, which was the podCIDRs on the kubenetNetworkPlugin struct. Here’s the diff for that change:
klog.V(3).Infof("Filtering packets with ebtables on mac address: %v, gateway: %v, pod CIDR: %v", macAddr.String(), gw.String(), plugin.podCIDRs[idx].String()))
to
klog.V(3).Infof("Filtering packets with ebtables on mac address: %v, gateway: %v, pod CIDR: %v", macAddr.String(), gw.String(), podCIDRs[idx].String())
And similarly, a change was made in how we error. We’ve gone from:
_, err = plugin.ebtables.EnsureRule(utilebtables.Append, utilebtables.TableFilter, dedupChain, append(commonArgs, ipSrc, plugin.podCIDRs[idx].String(), "-j", "DROP")...)
if err != nil {
klog.Errorf("Failed to ensure packets from podCidr[%v] but has mac address of cbr0 to get dropped. err:%v", plugin.podCIDRs[idx].String(), err)
return
}
to
_, err = plugin.ebtables.EnsureRule(utilebtables.Append, utilebtables.TableFilter, dedupChain, append(commonArgs, ipSrc, podCIDRs[idx].String(), "-j", "DROP")...)
if err != nil {
klog.Errorf("Failed to ensure packets from podCidr[%v] but has mac address of cbr0 to get dropped. err:%v", podCIDRs[idx].String(), err)
return
}
Again, updating plugin.podCIDRs to be podCIDRs instead.
Finally, let’s hop out of the setup method and jump to getRangesConfig. This was a small method that was updated as well. It gets referenced in the Event method that we’ve talked about earlier. In Event, it is used to make the json output for the CNI network config:
json := fmt.Sprintf(NET_CONFIG_TEMPLATE, BridgeName, plugin.mtu, network.DefaultInterfaceName, setHairpin, plugin.getRangesConfig(), plugin.getRoutesConfig())
klog.V(4).Infof("CNI network config set to %v", json)
plugin.netConfig, err = libcni.ConfFromBytes([]byte(json))
That getRangesConfig method went from this:
// given a n cidrs assigned to nodes,
// create bridge configuration that conforms to them
func (plugin *kubenetNetworkPlugin) getRangesConfig() string {
createRange := func(thisNet *net.IPNet) string {
template := `
[{
"subnet": "%s",
"gateway": "%s"
}]`
return fmt.Sprintf(template, thisNet.String(), thisNet.IP.String())
}
ranges := make([]string, len(plugin.podCIDRs))
for idx, thisCIDR := range plugin.podCIDRs {
ranges[idx] = createRange(thisCIDR)
}
//[{range}], [{range}]
// each range is a subnet and a gateway
return strings.Join(ranges[:], ",")
}
to
func (plugin *kubenetNetworkPlugin) getRangesConfig() string {
createRange := func(thisNet *net.IPNet) string {
template := `
[{
"subnet": "%s"
}]`
return fmt.Sprintf(template, thisNet.String())
}
ranges := make([]string, len(plugin.podCIDRs))
for idx, thisCIDR := range plugin.podCIDRs {
ranges[idx] = createRange(thisCIDR)
}
//[{range}], [{range}]
// each range contains a subnet. gateway will be fetched from cni result
return strings.Join(ranges[:], ",")
}
Conclusion
We’ve just learned a bit of how kubenet works, how it has fetched the gateway in the past and how it fetches it at the time of the PR. Bugs can be very interesting and lead to interesting solutions and/or insight. The hope is, that you walk away from this PR breakdown with a little more knowledge of the inner-workings of Kubernetes, Go and networking. As a community, it is important to keep on learning and sharing the knowledge we gain from our experiences.
"We are only as strong as we are united, as weak as we are divided."
― J.K. Rowling, Harry Potter and the Goblet of Fire