Whether or not it has kernel expertise, a company Twitter's size is going to regularly run into kernel issues, from major production incidents to papercuts. Without a kernel team or the equivalent expertise, the company will muddle through the issues, running into unnecessary problems as well as taking an unnecessarily long time to mitigate incidents. As an example of a critical production incident, just because it's already been written up publicly, I'll cite this post, which dryly notes:

Earlier last year, we identified a firewall misconfiguration which accidentally dropped most network traffic. We expected resetting the firewall configuration to fix the issue, but resetting the firewall configuration exposed a kernel bug

What this implies but doesn't explicitly say is that this firewall misconfiguration was the most severe incident that's occured during my time at Twitter and I believe it's actually the most severe outage that Twitter has had since 2013 or so. As a company, we would've still been able to mitigate the issue without a kernel team or another team with deep Linux expertise, but it would've taken longer to understand why the initial fix didn't work, which is the last thing you want when you're debugging a serious outage.

Another reason to have in-house expertise in various areas is that they easily pay for themselves, which is a special case of the generic argument that large companies should be larger than most people expect because tiny percentage gains are worth a large amount in absolute dollars. If, in the lifetime of the specialist team like the kernel team, a single person found something that persistently reduced TCO by 0.5%, that would pay for the team in perpetuity, and Twitter’s kernel team has found many such changes. In addition to kernel patches that sometimes have that kind of impact, people will also find configuration issues, etc., that have that kind of impact.



