Crafting Robust Architectures for a Resilient Future
Whether you're building a new system with an established team, trying to tame a legacy ecosystem, or starting from scratch, how you think about security and reliability has a big impact on how hard they are for you to achieve. In a candid conversation between security expert Eleanor Saitta and technology thought leader Jez Humble, the critical role of architectural clarity in ensuring robust security and resilience comes to the forefront. Saitta emphasizes the necessity of understanding and intentionally designing your architecture, highlighting the challenges faced by organizations in adapting to changing ecosystems. They discuss the dual aspects of security – external services and internal IT operations – shedding light on the potential risks associated with Windows and Office usage. Hear in this GOTO Unscripted talk about the significance of architectural awareness and basic IT hygiene in safeguarding organizations against security threats.
Read further
The overlap between Continuous Delivery and Secure Infrastructures
Jez Humble: Hello. My name is Jez Humble. I am a site reliability engineer at Google in the USA, working for Google Cloud. And I'm here at GOTO Aarhus with Eleanor Saitta, who is a principal at Systems Structure Limited. And Eleanor, do you wanna introduce yourself?
Eleanor Saitta: Thanks, Jez. It's lovely to be here with you. So, Systems Structure Limited provides fractional chief information security officer services to startups in the kind of pre-A to post-C range, roughly kind of 20 to 100 engineers. I've been a consultant in the security world for about 20 years and have done a bunch of different stuff, threat modeling, code review, you name it. Now I'm mostly working at kind of the architectural and organizational levels.
Jez Humble: So, I guess, I first came to know about you when you were working at Etsy.
Eleanor Saitta: Yes.
Jez Humble: I was talking about continuous delivery. You were talking about the work that you were doing at Etsy. And it's kind of weird, I just watched your talk, which was amazing. So, you know, you should definitely go and watch Eleanor's talk when that comes online.
Eleanor Saitta: Thanks.
Jez Humble: And nodding along vigorously, it's all of that. It turns out that there's kind of a strong connection between that kinds of ideas about continuous delivery and what it enables in terms of being able to build secure infrastructures. So, I was wondering if we could start with you talking a little bit about the overlap between those two different worlds.
Eleanor Saitta: I think in a lot of ways, security and reliability and resilience and all these things are really tied up into each other. There are other ways to get to a secure system, but it certainly feels like the most cost-effective and functional version of that is building a system that's heavily declarative. So, you have infrastructure as code, you have all this stuff that is ephemeral, that is immutable. You roll out containers or serverless jobs, you know, whatever your particular version of that infrastructure is. But you don't have kind of long-lived system images that are kind of configured in the field over time that slowly change and mutate. You just start with a clean slate, etc. All of the same stuff that we do around reliability and, like, observability. Can we verify the state of things in production and ensure that they're kind of continuously checked against what they should be? All of these things are tools that we use for reliability reasons, for velocity reasons, for a lot of other reasons, but they're also really good tools at building out the kinds of secure systems that work with those other goals.
Jez Humble: Absolutely. I think that's so important. Something that a lot of people miss is that there's so much commonality. What's interesting about Etsy is a lot of the principles you talk about, particularly around ephemeral machines and declarative, infrastructure, which are typically applied to the cloud. But on Etsy, of course, you are using bare metal. How do you apply those principles in kind of a bare metal environment?
Eleanor Saitta: It's interesting. It's a lot more complicated. I have another one of my clients right now who is also on bare metal, and there are a lot of new challenges. I think one of the reasons is we've got so used to cloud systems that those parts of the stack have kind of atrophied a lot. So, you end up having to do a certain amount of rolling on your own. But, I guess, the goals that I want, like, you still want your workloads sandboxed. You want whatever's actually booting on the machines to be as close to read only as you can. You wanna make sure that it's refreshed, you know, fairly rapidly. So, that means that you still need to have mineable workloads and all the other stuff.
I think the nice thing about building on bare metals, although you're kind of stuck in filling a lot of the parts of the stack, you can also make your stack a lot flatter. I kind of joke at some point that like the "Hello World" webpage, if you wanna just stand up a webpage that says "Hello World" on it, but you wanna do it properly and scalably and all these things, that's a 200-person engineering team. Because the stacks we've built are... assuming that you need to build it to scale and you need a data pipeline, you need all these things around that "Hello World," which, you know, you don't, but, like, if you did. And because stuff has just accreted over the years, we haven't ever really gone back and sort of squashed the stacks back down, and been like, "Okay. You know, why do we have separate control planes for deploy, cloud-config, job-config, and, like, cluster-config?" Surely that could probably be one control plane. But then, it's either a lot of vendor lock-in or there are reasons some of those splits exist, but they also massively increase complexity, which means they massively hurt security.
Jez Humble: Absolutely. And that was another theme that you talked about in your talk today. So, conversely, it's possible to find people who are all in the cloud but have built a cloud environment that follows old-school data center principles.
Eleanor Saitta: Sometimes it's the 1990s again. It's really weird when you run into one of those. They're pretty rare at this point. I've seen them a couple of times in the past few years. Most people have kind of gotten the memo. Often I will find in smaller companies, you have some like long-lived fems, and then you've got maybe the start of a little Kubernetes deployment over here, but it only actually runs two jobs. There are like three nodes in there. And so, it's often a bit of a mix because they're still sort of stuck partially between, "We have to ship features," and partially in the, like, "Well, we're trying to build out good infrastructure," but often, like, if you're lucky, there's somebody who's a staff or a principal engineer who's worked either at one of the big companies or at kind of a mid-tier, who has a real infrastructure that they're evolving towards. But I find that technical co-founders are often more from the problem domain than they are from, you know, the engineering domain kind of, you know...yeah.
Recommended talk: Kubernetes: Up & Running • Brendan Burns & Matt Turner • GOTO 2021
Jez Humble: I mean, and you often come into companies when they've been doing stuff for a while and they haven't really thought about security.
Eleanor Saitta: I mean, that's generally when I come in is they're like, "Oh, right. We don't know where to start, but we know it's time to start doing something."
Jez Humble: When you come into that situation, you take a look around. What is it that you see that makes you think, "Oh, this isn't gonna be so hard after all?" And then conversely, the follow-up question is, what do you see and you're like, "Oh, no, this is gonna be a total nightmare?"
Eleanor Saitta: I think there's three things, maybe four things. On the technical side, it's some platform choices kind of, you know, are they using what you kind of think of as internet standard tooling? You know, if you're on a cloud provider, are you on one of the major cloud providers? You know, do you have some IAC footprint? Where are you on that kind of stuff? And then, you know, do you have a pile of Windows boxes with no central management on your laptop side? Or are you on Macs and maybe you've even already got, you know, one of the management tools in or something like that? And it's also the amount of legacy, right? You know, if it's a 20-person dev team, but they've been running for 2 years and things seem like...and they have documentation.
It's always a bad sign when it's like, "Okay. Can I have a network diagram?" And they walk towards the whiteboard, which is probably two-thirds of the companies I work with, to be clear. It's not necessarily bad, but it's not a good sign. If you've got a lot of legacy stuff and a lot of legacy complexity that's poorly documented, that's a hard position to come back out of. And then it's kind of, do the execs care? Do the execs care and also care enough to get the work done? Because I've seen a bunch of companies where the execs really cared, but they still weren't gonna actually prioritize the work. And sometimes that was the right call because, you know, they had a very complicated market position and they needed some features to keep the company viable and that kind of thing.
One of the things that I will tell, especially small startups, is you can actually kick this can down the road a while, like, the security can. Don't kick the architecture can down the road because that's really expensive to fix later. So, think about how you're building stuff. You can wait a little while to secure it as long as you doing that isn't putting your customers at risk, right? If you're putting people in the world at risk, then you need to fix stuff, you have, like, ethical responsibilities there. But if it's not hurting anybody, and you're still trying to find product market fit, you can wait a while before you build a real security team out. Just make sure you're really sure about the first one.
What makes a good architecture from a security perspective
Jez Humble: Then you talked about architecture. What are you looking for in, like, a good architecture?
Eleanor Saitta: Someone should know what it is. It sounds like a joke, but seriously...
Jez Humble: No, I believe you.
Eleanor Saitta: ...there should be an architecture, things should be intentional. Ideally, everything follows the same architecture. One of the things that end up being the biggest problem for a lot of my customers is that they simply can't change their ecosystem very quickly that, you know, we end up spending a lot of time getting them to the point where they can evolve stuff so then they can start securing it. You can do some security improvements along the way, but a lot of it is, you know, you're gonna have to change your architecture over time. And whether it's because you've got a super-tightly tied monolith that has, like, a lot of expectations about the context in which it's deployed, or you have 17 different service architectures for your...you know, there's a lot of ways to do that wrong.
Jez Humble: To screw up.
Eleanor Saitta: Every happy network is the same. Every unhappy network is different.
Jez Humble: Oh, my God.
Eleanor Saitta: A lot of it is really just making sure that you know what the architecture is, that it's sort of sensible, and that you can change it, that you can actually evolve and fix stuff.
Jez Humble: This is where, like, the continuous delivery, kind of linking comes in, right?
Eleanor Saitta: Because if you have stood up machines and then changed them in the field, you can't necessarily regenerate that configuration very easily. Sometimes you can't regenerate it at all. It's just like, if this box dies, that's kind of the company right there, you know. That is not a fun place to be in. But if you're at the place where it's like, "Well, yeah, there's a build script, and the build script, we can point it at an MTGCP project or, you know, an MTAWS account, and it will do the whole thing if it has to. Like, that's kind of the dream. And it's very rare that you're gonna get there, but, at the very least, any given service should be that way. It should be trivial to set up a new version of a service. Ideally, you get it to the point where you've got a whole...like, you know, you can stand up a prod replica just by saying, "Okay. Deploy all of these. You know, deploy another instance of these." Even if some of the larger stuff is more manually tweaked or is, like, you know, a separate kind of thing. Because that's also, like, this makes a big difference for debug-ability and also recoverability and all this other stuff.
Jez Humble: It's like, "Can I build a test environment? But also, if I have a catastrophic failure, can I also, like, restore service?"
Recommended talk: Security Styles • Eleanor Saitta • GOTO 2022
Eleanor Saitta: Yes.
Jez Humble: And it's kind of the linkage routine, these two things.
Eleanor Saitta: How many weeks is it gonna take you to get things back up? And how many weeks of money do you have if everything is down? And ideally, one of those numbers is bigger than the other. And you know what both of those numbers are. Most small companies do not know what those numbers are.
Jez Humble: Well, that's a bit scary.
Eleanor Saitta: Most big companies also don't know what those numbers are, but in a very different way.
Jez Humble: Okay. Say more.
Eleanor Saitta: Well, I mean, does Google know what the number is for, like, you know, if every data center caught fire? I mean, I would argue that supply chains make that unpredictable at best.
Jez Humble: Right. No. that's definitely..
Eleanor Saitta: Or if they had to wipe everything at once.
Jez Humble: That has actually happened.
Eleanor Saitta: I'm sure they've thought about it, but, like, still actually figuring out what that number is becomes very difficult.
Jez Humble: No, that's true because there's so many different layers and so, so much complexity too.
Eleanor Saitta: Yes.
Security hazards – The battle for better IT hygiene
Jez Humble: Sure. So, as a kind of fractional CISO, I think there's two elements to that, and you can correct me if I'm wrong. One is, like, the service that's being provided.
Eleanor Saitta: Yes.
Jez Humble: One is...
Eleanor Saitta: The actual technical architecture, etc.
Jez Humble: Right. And then the other part is, like, the internal IT, like people are doing their work.
Eleanor Saitta: Yes.
Jez Humble: Which keeps you awake at night more?
Eleanor Saitta: It really depends on the company. Most of the time it's gonna be the laptop fleet that gets people owned to start with. It's, a lot of those vulnerabilities are easier to exploit. They're right there and, like, you know, especially there are also a lot of the times companies in that early phase don't have any of the itch internal protections. So, once I own an engineer's laptop, that's it. I've got all of prod, and it's gonna be a lot easier to, like, spearfish that engineer than it is to necessarily be...like, especially if it's, like, even kind of half-assed modern dev, but like, stuff is patched. Stuff is, you know, we don't have random ports open everywhere. Like, it's probably a lot easier to own an engineer's laptop.
I sometimes brought in to specifically look at the product side of things, and then I have to kind of push back and be like, "I will totally do this for you, but we're also gonna look in the closet over here and find the skeletons because those are what's gonna actually get you." I don't think it gets more than 50/50 ever. I think half of your risk is always gonna be on the IT side. You know, because it's, like, what was it, the last biggest Twitter breach was also a random engineer laptop. And, I mean, Twitter's maybe not a great example anymore, but...
Jez Humble: The last breach was fucking Elon taking over.
Eleanor Saitta: The last other kind of breach. But I think, you know, you can be a fairly big, pretty well-managed tech company and still be in the position where if you compromise an engineer laptop that can direct...if it's the wrong engineer. Like, Google is probably, I'm guessing, AWS and Apple may also be in a place where there's no single engineer's laptop that you could compromise that would result in a data breach of client data. I'm not sure that there are that many other companies that are in that position, you know.
Security tips and tricks
Jez Humble: What would be your basic advice to people for, like, basic hygiene around that?
Eleanor Saitta: Basic hygiene. Get rid of all the Windows boxes, and get rid of all the Office installs because there's still a ton more malware floating around for Windows than there is for OS X.
Jez Humble: Let me stop you there for a second. My wife works for a nonprofit. It's all Windows. It's all Microsoft. And the pushback you're gonna get is, but that's what people know. How are people supposed to do their work if...?
Eleanor Saitta: Well, it turns out that there's a whole lovely industry of people who will train them how to use the other tools. And literally, that's often one of the things, is that, like, you know, you will have...I don't know, 75% of the staff are like, "Yeah, whatever, I've used both. It doesn't matter." And then there's gonna be a chunk of folks who are, you know, maybe incredibly skilled. I see, you know, whatever. It's not that they're, like, tech illiterate or anything like that, they just don't have the experience and that's, you know, scary, or they don't know that they can use the tools, "Can't do what I need," or, you know, that kind of thing. And for the most part, they can now. You know, I mean, it's just a question of doing the work to be like, "No, this isn't a negative, hostile thing. I'm not taking your stuff away. We're gonna help make sure that you can do exactly what you need to do." And then sometimes, like, I have clients who have to have Windows boxes because they have some ridiculously expensive robot. And there's one company that makes the robot, and that machine, that needs a Windows box. So, sometimes you're stuck with that, but you can still reduce the attack footprint everywhere else.
Jez Humble: Okay. So, get rid of Windows, get rid of Office.
Eleanor Saitta: Get rid of Office because, like, Office files are still the number one malware delivery file. Get rid of Adobe Acrobat because that's the number two malware delivery channel. Like, slightly less now, partially because it's gotten less common and it has improved a lot, so has Office. If you really have to have Office around, you're gonna be buying E5 licenses and turning all the lockdown tools on, and then you end up with an Office environment. It's very difficult for people to get work done, you know. Or you could just use Google Docs. And I'm not saying this because you work at Google, it's because it's genuinely the...
Jez Humble: Well, thank you anyway.
Eleanor Saitta: It's genuinely the least terrible option right now.
Jez Humble: And then so you've done that, what's next?
Eleanor Saitta: You should have backups. The backups should actually work. You should try restoring the backups and make sure they work. This is not a small team problem. Like, there were plenty of hundred-plus engineer companies that have critical data that they cannot replicate, which is not backed up, and/or backups that they've never tested. So, the sooner you do that, the better. And then, you know, make sure that you keep doing it. But, that's probably the next big thing. And kind of once you've got those three...I guess, the other big one is to get you the keys in, right? Get some kind of U2F token. We'll see on the PASKY stuff. It's new enough and I don't have a good enough handle on, like, the deployment scenarios and the failure modes and that kind of thing. Like, I'm super excited by it, but I don't feel like I have a good instinct for the places it's gonna break yet. But get you the keys, and make sure that everyone is using them in U2F mode. If you are using workspace, maybe consider asking everybody on, like, engineering, exec, legal, and HR to turn on advanced protection mode. And then if you do this, it means you just get to stop thinking about credential phishing because, you know, the tools are gonna take care of it for you now. You literally just get to stop worrying about it.
Recommended talk: The Secrets of OAuth 2.0 Part 1/2 • Aaron Parecki & Eric Johnson • GOTO 2020
Anytime there's that kind of a serious security issue, where you can just get rid of the security issue and take it off your radar, great. If you've stopped everybody from using the file formats that are most likely to contain malware, and you've stopped everybody from being in a situation where they can give their credentials away because their credentials are tied to hardware, now you get to stop thinking about phishing. I mean, like, yeah, you still have to deal with, like, wire fraud scams and stuff, but that's a different problem, right? The phishing problem is you have staff who are paid to click on things because that is literally their job is to click on things in emails and do what emails say. And then you're angry at them because they clicked on things in emails and did what the emails said. You cannot win this one. You cannot train your way out of, like, an inherent security vulnerability in the structure of your ecosystem, and if you try to, you're just gonna piss people off.
The number of, like, phishing training campaigns that I've seen that left people really distrustful of IT and security is huge. You know, this doesn't help you. So, instead, go actually solve the problem, make it safe to do what people's job is. Do not make their jobs inherently dangerous. So, yeah. So, you know, if you get those two things in...like, again, that's most of your compromises are gonna be one of those two things. And it also means you don't have people complaining that they keep getting woken up by, like, 2FA off spam, you know. Like, that was that Twitter breach. Is somebody just 2FA off spammed an engineer like a hundred thousand times, literally, and eventually, like, email them and say, "Well, I'm just gonna keep doing this until you click Yes." And eventually, the guy just clicked Yes to make it stop.
Jez Humble: Wow.
Eleanor Saitta: There's a lot of other failures in that context about security awareness and, you know, why didn't this get...? There were many layers to something like that happening. But small versions of that happen all the time, you know. You don't want it to be, like, the employees used to always click the Yes, so they click Yes without thinking about it. They're like, "Oh, I didn't just do a thing." Or they did just do a thing that should have triggered that, but they got the wrong one, you know. Like, we can just not have these problems. We can have nice things.
Jez Humble: Yeah. It's possible.
Eleanor Saitta: Yes.
Jez Humble: So, we talked a bit about the IT side.
Eleanor Saitta: Yes.
Security in early-stage startups – getting security hygiene right
Jez Humble: Earlier on you said that the Hello World takes 100 people or 200 people. Let's talk a bit about, kind of, basic hygiene and, kind of, how you get set up. Say you are starting a startup, and you wanna build something, what are, you know, the top few things you would say that you need to get in place from the beginning that is gonna save you a ton of pain and misery later on?
Eleanor Saitta: I mean, literally just running down the stack. The first thing is how are you gonna deploy stuff? You know, how are you gonna deploy stuff and how are you setting up infrastructure? I'm a fan of Terraform mostly because it seems to be the least platform-locked and kind of the least terrible option. But have some kind of IAC system. Don't just start setting stuff up by hand. Have an actual deployed pipeline. It doesn't matter which one, you know, but it has some kind of deployed pipeline. Have some kind of infrastructure automation from ground zero. And get SSO set up for whatever your cloud environment is so you have like some kind of auth structure that makes sense, you know. And then as you're building stuff out, especially if you're actually starting from scratch, start working in containers from the start. You know, don't work with long-lived VMs unless you have specific requirements that require either hardware boxes or long-lived hosts. Like, those are genuinely pretty rare, you know.
We need a Hadoop cluster, which has to be long-lived. Okay. But can someone else run it? Can you make it.. do you need to build that authentication, or can you make it the problem of someone whose job it is to build that authentication who is probably better than you are? Those systems have costs. And if you're in a, like, low-margin per end user, like, you know, the scaling problems are real. However, you're probably not actually hitting those scaling problems until after you've figured out product-market fit, right? So, fine, outsource off until you figure out if there's a company there. Once there's a company there if it's a company where Firebase or whatever you're using costs too much per user, then bring it in because, at that point, you're probably at least closer to being able to do it competently.
If it's not your core competency, don't do it, you know. Like, you know, one of the things, at some point you need log centralization, right? You're gonna have to stand up indexers and ingest and all this stuff. Is it your job to run an ELK cluster? No. It's probably not your job to run an ELK cluster, so don't do it because they eat teams. You know, I've seen, literally, a 10-person team, not just, like, you know, entirely consumed with that work, but everybody on the team quit.
Jez Humble: Oh, my God.
Eleanor Saitta: Because it was that much of a fire. So, just let somebody else do it. It's expensive, but you know what you're paying for, you know. Especially on small teams, evaluating the cost of hiring is a real thing, you know. And I'm not like...You know, there is also a time when it is time to start insourcing stuff. You know, there's a time when it doesn't make sense to give AWS $100 million a month or, you know, whatever your AWS bill is. Actually, the cost structure around running your own hardware is more complicated than people think, but also not necessarily as bad as people think. But that's not necessarily when you're starting out. Like, don't start out there unless you're actually starting with 200 engineers, yeah.
Jez Humble: Having a $10 million cloud bill is actually a good problem to have because it means that...
Eleanor Saitta: It means you could pay a $10 million cloud bill. Hopefully, it means you can pay a $10 million cloud bill. If you were expecting it to be $10 million, it's probably a good thing.
Jez Humble: Right.
Eleanor Saitta: Now, you can quit. Now, you're at the point where, okay, you can think about cost savings and structure and all this kind of stuff, but you gotta get there first, you know. But I think building cleanly, documenting as you go along. It's nice if you have business rules about how the company runs that are, like, you know, if you're running a merchant site or something, but probably write those down. In general, you do not...Like, the authoritative reference is always gonna be the code. However, you don't want it to be the only reference. I am a huge fan of orientating documentation that's not trying to capture all the details because, again, the code is always gonna be authoritative, but it will help you bring people up to speed faster. It will tell people what is where in the ecosystem. Like, what is this? What does this service do? I don't know. You know, and I mean, this is possibly also, like, I don't know, trauma from my early days as a code auditor when I would literally get handed a zip file. No docs, no running instance, literally, just a zip file of uncommented code.
Jez Humble: Oh, my God.
Eleanor Saitta: Find bugs. And then you had to eventually learn how to do that. But it's a lot of, like, learning, you know, the kind of CIA, the men who stare at goats kind of like you stare at things then eventually, three years later, patterns start to emerge kind of thing. The first three years are...yeah, but...
Jez Humble: Painful.
Eleanor Saitta: Painful and weird. But you can avoid this by just having some basic documentation. Among other things, it's really good for equity because, you know, if you have employees from underrepresented minorities, you know what they get penalized for a lot more than white dudes? Asking questions, especially questions of things they should know. So, you're literally gonna make it easier for them to succeed in your company, which means that you have a more diverse team, which means that you have a team that's better at solving weird problems. Because, you know, really homogenous teams are not actually very effective a lot of the time.
Jez Humble: Yeah. Although they are very good at agreeing with each other.
Eleanor Saitta: They are really. And being wrong quickly. I mean, the other, like, it speeds up onboarding, right? If you have a company that you know is gonna grow, why would you not do the things that you can do early on to make it easier for you to grow later? Like documenting stuff.
Jez Humble: That's a commonality with both equity and with kind of getting things right from a process perspective is like...
Eleanor Saitta: Absolutely.
Jez Humble: ...do the things. Don't wait. Like, do...
Eleanor Saitta: Do the things that are cheap to start with and very expensive later, you know, and stuff like...At some point you need to deal with security detection, right? It's expensive to build out a security operation center, you know, so you're probably gonna want to manage detection first. And even that, if you're a really early startup and you're not gonna put users at risk again, that caveat, you know, you can kick the can down a bit, you know. And, like, you don't want AWS's security team to be your notification service, but, you know, it's better than going bankrupt, you know. And you can just sort of bolt that on later, you know.
Jez Humble: I think, you know, the write things down thing as well. So much of auditing is actually just making sure you are doing the things you said you were doing. And so, if you write down the things that you say you are doing, it's gonna make auditing a lot easier.
Eleanor Saitta: Absolutely. And, I mean, I do say that in a lot of contexts, security certifications are primarily a marketing expense because...not that they don't serve well. Okay. If a security certification is the thing that is forcing you to do security work, you should change your fundamental attitude towards security first. Do the stuff that you actually need to do and then worry about compliance. But they're primarily a tool for making the state of your security system legible to customers. And they're super important for that. But that's primarily what they do.
Jez Humble: If you're doing it right, you're gonna find it's actually pretty straightforward. Well, it's been an absolute pleasure as always. Thank you so much for sharing your wisdom. How can people find you if they want to name more?
Eleanor Saitta: Yeah. It's been lovely. If you wanna know more, you can go to structures.systems. I'm ela@structures.systems [SP], and I'd be happy to talk more.
Jez Humble: Fantastic. Well, thanks very much. Make sure you catch Eleanor Saitta's talk when it comes up. And, thanks to GOTO for hosting us.
Eleanor Saitta: Thanks for having me.