A Field Guide to Reliability Engineering at Zalando
You need to be signed in to add a collection
We present Zalando's approach to engineering reliability from a very small to a very large scale, and touch on both technological and human angles. With over 50M customers across 23 countries, Zalando operates one of the largest eCommerce platforms worldwide. Achieving a reliable customer experience requires the intricate collaboration of over 3000 applications and more than 2000 software engineers who constantly seek to improve and extend product capabilities. In the talk we will walk you through the best practices Zalando has arrived to consistently achieve high levels of reliability. * We will start with a simple stand-alone application and cover best practices for instrumentation, monitoring and alerting. * We continue the journey to products that span multiple applications which are operated by different teams. At this scale methods like tracing and incident management become important. * Finally we will present technologies and processes which are used to steer reliability on the company level. Here WORM Cascades and Risk Management have proven highly effective.
Transcript
We present Zalando's approach to engineering reliability from a very small to a very large scale, and touch on both technological and human angles.
With over 50M customers across 23 countries, Zalando operates one of the largest eCommerce platforms worldwide. Achieving a reliable customer experience requires the intricate collaboration of over 3000 applications and more than 2000 software engineers who constantly seek to improve and extend product capabilities. In the talk we will walk you through the best practices Zalando has arrived to consistently achieve high levels of reliability.
- We will start with a simple stand-alone application and cover best practices for instrumentation, monitoring and alerting.
- We continue the journey to products that span multiple applications which are operated by different teams. At this scale methods like tracing and incident management become important.
- Finally we will present technologies and processes which are used to steer reliability on the company level. Here WORM Cascades and Risk Management have proven highly effective.