vinni_f asked:
We recently had a storm of kernel: neighbour: arp_cache: neighbor table overflow!
errors on some nodes of our kubernetes cluster.
After research we’ve figured out that we needed to bump our servers net.ipv4.neigh.default.gc_thresh{1,2,3}
to match the needs of some of the new applications that have recently been added to our cluster.
While doing some research I was not able to find any decent way to calculate what would be the best suited values to put there.
How does one figures out what should be set there?
My answer:
Let’s look at the documentation:
neigh/default/gc_thresh1 - INTEGER
Minimum number of entries to keep. Garbage collector will not
purge entries if there are fewer than this number.
Default: 128
neigh/default/gc_thresh2 - INTEGER
Threshold when garbage collector becomes more aggressive about
purging entries. Entries older than 5 seconds will be cleared
when over this number.
Default: 512
neigh/default/gc_thresh3 - INTEGER
Maximum number of non-PERMANENT neighbor entries allowed. Increase
this when using large numbers of interfaces and when communicating
with large numbers of directly-connected peers.
Default: 1024
In order to overflow the neighbor table, you have to have more than gc_thresh3
neighbor table entries. In this case, Kubernetes pods, as each pod has its own network namespace with a unique interface and unique MAC address.
That’s a lot of containers!
How you tune these values depends on the workload you expect to serve. Was it a one-off? Do nothing. Not sure what to do? Double everything and wait to see what happens. Do you know you have 5000 pods? Set the values so that you don’t run out of space.
View the full question and any other answers on Server Fault.
This work is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.