It is often said that the more things change the more they stay the same. That adage is very true in the world of IT support.
Folks living the fast paced DevOps lifestyle think they are revolutionizing how the core fundamentals of technology work, but their hubris leaves them blind to the simple truths of electronic devices. A simple reset in form of cutting the power solves nearly all your problems.
Memory leaks, buffer overflows, race conditions, etc. all usually derive from two key principles; uptime duration and reliance on other components.
The longer something is running, even embedded devices, the greater the risk of something happening that causes breaks in logical operations.
- The data for a given pointer eventually gets corrupted or overwritten.
- The logical order of operations get out of sync, especially when external systems are relied upon for base functionality (e.g. Seemingly random 500 or 400 level errors)
- There is a network failure and BGP can’t seem to figure out the route because the rib table never refreshed.
All in all, to truly build HA systems that are fault tolerance, reliable, and have a minimal level of resiliency, I always recommend another adage; reboot early and often.