Home Conference Sessions A Field Guide to...

A Field Guide to Reliability Engineering at Zalando

Heinrich Hartmann | GOTO Amsterdam 2024

You need to be signed in to add a collection

We present Zalando's approach to engineering reliability from a very small to a very large scale, and touch on both technological and human angles. With over 50M customers across 23 countries, Zalando operates one of the largest eCommerce platforms worldwide. Achieving a reliable customer experience requires the intricate collaboration of over 3000 applications and more than 2000 software engineers who constantly seek to improve and extend product capabilities. In the talk we will walk you through the best practices Zalando has arrived to consistently achieve high levels of reliability. * We will start with a simple stand-alone application and cover best practices for instrumentation, monitoring and alerting. * We continue the journey to products that span multiple applications which are operated by different teams. At this scale methods like tracing and incident management become important. * Finally we will present technologies and processes which are used to steer reliability on the company level. Here WORM Cascades and Risk Management have proven highly effective.

Share on:
linkedin facebook
Copied!

Transcript

We present Zalando's approach to engineering reliability from a very small to a very large scale, and touch on both technological and human angles.

With over 50M customers across 23 countries, Zalando operates one of the largest eCommerce platforms worldwide. Achieving a reliable customer experience requires the intricate collaboration of over 3000 applications and more than 2000 software engineers who constantly seek to improve and extend product capabilities. In the talk we will walk you through the best practices Zalando has arrived to consistently achieve high levels of reliability.

  • We will start with a simple stand-alone application and cover best practices for instrumentation, monitoring and alerting.
  • We continue the journey to products that span multiple applications which are operated by different teams. At this scale methods like tracing and incident management become important.
  • Finally we will present technologies and processes which are used to steer reliability on the company level. Here WORM Cascades and Risk Management have proven highly effective.

About the speakers

Heinrich Hartmann

Heinrich Hartmann

Head of Reliability Engineering at Zalando SE

Related topics