Home Conference Sessions Troubleshooting ...

Troubleshooting Tiered Tragedy: A Peek Into Failure

Jeff Smith | GOTO Berlin 2019

You need to be signed in to add a collection

Failure is complicated. Sometimes an incident can reveal latent failures in your systems that have just been sitting dormant, waiting for the right combination of factors to activate them. In this talk Jeff Smith will walk through a real failure scenario and the process Centro uses to highlight issues that go beyond just the life cycle of an outage. We’ll walk through the importance of looking into signals before they become catastrophic and ensuring your team has the capacity to do so. We’ll examine how monitoring the same system from multiple vantage points can help avoid confusion and gain clarity during an incident. How the Product organization plays a vital role in protecting system uptime, and lastly how a collaborative culture can decrease your Mean Time to Recovery. **What will the audience learn from this talk?**<br> The audience will learn practical troubleshooting steps when encountering an issue along with the pitfalls that exist without a diverse suite of tools at each layer. The audience will also learn tips for reading, prioritizing and resolving early warning signs to prevent outages before they occur. Lastly, we’ll learn how a company’s workflow can create hidden impact to the accountability and responsibility of system stability. **Does it feature code examples and/or live coding?**<br> No **Prerequisite attendee experience level:** <br> [Level 100](https://blogs.technet.microsoft.com/ieitpro/2006/09/29/microsofts-standard-level-definitions-100-to-400/)

Share on:
linkedin facebook
Copied!

Transcript

Failure is complicated. Sometimes an incident can reveal latent failures in your systems that have just been sitting dormant, waiting for the right combination of factors to activate them. In this talk Jeff Smith will walk through a real failure scenario and the process Centro uses to highlight issues that go beyond just the life cycle of an outage. We’ll walk through the importance of looking into signals before they become catastrophic and ensuring your team has the capacity to do so. We’ll examine how monitoring the same system from multiple vantage points can help avoid confusion and gain clarity during an incident. How the Product organization plays a vital role in protecting system uptime, and lastly how a collaborative culture can decrease your Mean Time to Recovery.

What will the audience learn from this talk?
The audience will learn practical troubleshooting steps when encountering an issue along with the pitfalls that exist without a diverse suite of tools at each layer. The audience will also learn tips for reading, prioritizing and resolving early warning signs to prevent outages before they occur. Lastly, we’ll learn how a company’s workflow can create hidden impact to the accountability and responsibility of system stability.

Does it feature code examples and/or live coding?
No

Prerequisite attendee experience level:
Level 100

About the speakers

Jeff Smith

Jeff Smith

Author of "All Things Dork" Blog, Manager of Production Operations at Centro

Related topics