Books for software engineers and managers
Categories:
Engineering Manager,
Tech Lead,
Star Engineer
How strongly do I recommend Release It!?
7 / 10
To my surprise based on the title, Release It! is not about increasing deployment frequency – a primary interest of mine in building high performance engineering teams. However, this was a pleasant surprise.
Release It! is a good read for both engineers and engineering managers, particularly product engineers that also take on systems and DevOps responsibilities. The author smartly ties tactical recommendations to broader concepts and themes, which will help you communicate with other software engineers.
Top Ideas in This Book
Restore service first. Then worry about deep diving into the problem.
However, there’s a catch. Restoring service often comes at the cost of not understanding the problem.
Your systems and machines are in a bad state. When restoring them to a good state, you often inhibit your ability to debug the issue. But that’s a trade-off you sometimes need to make.
Bugs cannot be eliminated, but they can be survived by preventing propagation.
Cascading failures occur when one system experiences an issue, subsequently causing issues in another system, potentially causing issues in yet another system. On and on we go.
In other words, cascading failures exist because of relationships between systems.
Nygard provides a lot of commentary on this topic throughout the book. Some of my favorite thoughts are:
Chaos Engineering is a good way to understand how resilient your systems are against cascading failures. Netflix’s Chaos Monkey famously shuts down random services and servers in production to test how dependent systems respond.
You can load balance, govern requests, shed load, fail fast, and do plenty more to mitigate risk. But fundamentally, you can’t control the volume of requests your system receives or the nature of those requests.
Temporary fixes often arise in two situations:
I give my engineers the benefit of any doubt. In my experience, engineers usually just forget about that temporary fix. They’re constantly bombarded with new problems consuming their attention.
Your engineers usually just need a reminder. As manager, when my team is done firefighting or prototyping, I like to ask what needs to be done to make the code production-worthy.
Then comes the hard part. You personally need to value the process of transforming a temporary fix into a production-worthy fix, and support that value in the face of mounting pressure from your product roadmap.
Rather than focusing on how efficient your developers are, focus on how efficiently work moves through your process.
Focusing on process and not people can feel counter-intuitive for managers. Aren’t we managers of people? Yes, but your job is to make work possible and that often means addressing process failures.
Diagramming your value delivery chain is a good place to start. The DevOps Handbook provides insight on how exactly to map your value delivery chain, but the basic idea is to list every step in your process and how long that step takes – both in execution time and total process time. For instance, code review may only require 5 minutes of execution, but can often take hours or days to perform.
Value delivery chain diagrams usually help identify a few limiting factors that slow your release frequency and increase lead times – two of the key engineering performance metrics tracked in Accelerate. Some common areas I’ve seen slowing teams from releasing are:
Here are some good questions to ask when confronted with a system diagram:
Focus on the arrows more than the boxes.
Andy Grove said about leadership that only the paranoid survive and the same is true for software engineering. Evaluate your systems with a cynical eye. Identify your bottlenecks assume they will be overwhelmed at some point.
View integration points skeptically. Systems that you don’t control will eventually fail you in unexpected ways.
Your application will eventually undergo a denial of service attack, either maliciously or the friendly hug of death. When this happens you need the ability to shed demand load so that your system can recover and respond correctly to requests that do mke it through.
Services should monitor their own response times and respond accordingly. In a service oriented architecture or microservice architecture, the service itself should be able to respond appropriately to the pressure being applied.
Humans apply judgment based on context. Automation does not. When your automation is not configured correctly, and often you don’t learn about these configuration mistakes until it is too late, the system quickly reacts.
Your job is to make sure that automation does not go horrifically wrong and throw your system into an unrecoverable state.
Here are some examples:
These don’t need to be in the same dashboard but they should be accessible and ideally configured with anomaly detection so that your team can receive push alerts when something goes wrong without needing to consistently monitor these outcomes.