how to decently configure net.ipv4.neigh.default.gc_thresh_n

vinni_f asked:

We recently had a storm of kernel: neighbour: arp_cache: neighbor table overflow! errors on some nodes of our kubernetes cluster.

After research we’ve figured out that we needed to bump our servers net.ipv4.neigh.default.gc_thresh{1,2,3} to match the needs of some of the new applications that have recently been added to our cluster.

While doing some research I was not able to find any decent way to calculate what would be the best suited values to put there.

How does one figures out what should be set there?

My answer:

Let’s look at the documentation:

neigh/default/gc_thresh1 - INTEGER
    Minimum number of entries to keep.  Garbage collector will not
    purge entries if there are fewer than this number.
    Default: 128

neigh/default/gc_thresh2 - INTEGER
    Threshold when garbage collector becomes more aggressive about
    purging entries. Entries older than 5 seconds will be cleared
    when over this number.
    Default: 512

neigh/default/gc_thresh3 - INTEGER
    Maximum number of non-PERMANENT neighbor entries allowed.  Increase
    this when using large numbers of interfaces and when communicating
    with large numbers of directly-connected peers.
    Default: 1024

In order to overflow the neighbor table, you have to have more than gc_thresh3 neighbor table entries. In this case, Kubernetes pods, as each pod has its own network namespace with a unique interface and unique MAC address.

That’s a lot of containers!

How you tune these values depends on the workload you expect to serve. Was it a one-off? Do nothing. Not sure what to do? Double everything and wait to see what happens. Do you know you have 5000 pods? Set the values so that you don’t run out of space.

View the full question and any other answers on Server Fault.

Creative Commons License
This work is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.