Expert Talk: War Stories from Moving to the Cloud
Are you a developer ready to embark on your cloud journey but feeling overwhelmed? Fear not! The benefits of the cloud far outweigh the initial struggles. With automation and proper monitoring, you can avoid sky-high bills while elevating your company and user experience to new heights. Don't miss out on the opportunity to learn from Lorna Jane Mitchell and Holly Cummins as they share their practical war stories from their own cloud migration and operations. Join us and take your development game to the next level!
Cloud Chaos: most frequent issues
Lorna Jane Mitchell: Welcome to "GOTO Unscripted." I am here at GOTO Copenhagen, and I'm happy to be sharing some chats with you today about some of the cloud topics. My name is Lorna Jane Mitchell. I work in developer relations for a company called Aiven. We do cloud databases. And I'm going to be chatting today with Holly Cummins
Holly Cummins: Thank you. I'm Holly Cummins. I work for Red Hat. I'm a senior principal engineer in the Quarkus team. So, basically, I help build Quarkus, which is kind of awesome.
Lorna Jane Mitchell: That is kind of awesome. I am a big fan. I saw your talk earlier where you were telling us about cloud chaos, and I especially liked your point that a lot of people who start with cloud get some kind of shock or surprise when they start those projects. What sort of things do you see people run into?
Recommended talk: Cloud Chaos & Microservices Mayhem • Holly Cummins • GOTO 2022
Holly Cummins: All sorts. Part of it is the cost. And there are two sorts of challenging aspects of the cloud. One is the fact that it costs money at all, but the second is that the costs are much more difficult to manage than they perhaps were with on-premise where it was a one-time investment, and then you had it, and then you didn't need to worry about it. With the cloud, what we see is sometimes organizations find that their cloud bill is quite large, and they didn't really know where it was going.
I think probably everybody's had the experience that you get the e-mail from the department saying, "We've looked at our cloud bills, and we're a little bit uncomfortable with this. And we're particularly uncomfortable because we don't actually know what any of it is or if any of it is useful. And could you please look at what you've provisioned, and if you're not using it, de-provision it?" And it's quite a manual process, which is sort of surprising because provisioning things is a very well-solved problem. You can automate it. You can do it in just one click. De-provisioning things, we're still kind of figuring out what the right way to do it is. You can do it in one click, but only if you remember to do it.
Lorna Jane Mitchell: That sounds really risky to me. Click here to delete all your servers.
Holly Cummins: Well, exactly. I've seen similar things with auto-scaling, actually, where I think we have a bias toward safety, which is probably quite right and proper. But I was looking at it when I was considering sustainability. And ideally, you want something to be very elastic, and it scales up, and then it will scale down. But most of the auto-scaling algorithms have a tendency to scale up very eagerly because we don't want users to be disappointed, but they have a tendency to be reluctant to scale down because we don't want users to be disappointed. And so that's probably, the right side to err on, but it does mean that we have the sort of inflation of costs and we don't really know why. It just sort of ends up being really difficult to manage, and it ends up being really difficult to reason about because, at a high level, you can say, "Okay, well, this department uses this much," but then you suddenly end up in the weeds trying to sift through. I was in a meeting once with a UK bank, and the topic of the meeting was, "Why are we spending so much on the cloud?" And it was a three-hour meeting where we just sort of went through with their...I was a vendor, so they went through with their vendors sort of saying, "And what's this bit, and what are we doing with that?" And honestly, it was so boring. So, we wanted to be polite because we were a vendor, but it was not the most exciting meeting I've ever been to in my life. But we're still trying to find a better way.
How can the cloud help with scaling?
Lorna Jane Mitchell: I do see those adventures of people scaling up and down, and I think there are some really good lessons there. I saw a great talk a few years ago about a company that was spinning up its developer platform just between 8:00 and 6:00 on weekdays. And so that saves a lot of money, right, because your services are not running all the time, but also you learn all the ways in which your cloud services coming up in the wrong order can go wrong, which is brilliant for disaster recovery. And I think that move from "We must keep the fires burning all the time and never allow the flame to go out" to be, like, "Oh, just make another one" is quite a big contrast. Is that something you see as well with the cloud change journey?
Holly Cummins: So, I've seen both sides. So, I've seen, exactly as you say, some companies are getting really great results by... We used to have this mindset of we've gotta get it up, and then once it's up, we're gonna step slowly away and we're not gonna touch it because we know bad things. I think we all still remember that if there was a power cut, then the data center would take days to come back up, and people would be having to sort of caress the servers and give them presents to make them come back up. And I think if you can get away from that mindset of fear to one of...GitOps may be a way to do it, there are other sorts of related techniques, but this idea that we have it as code, we trust the automation, we make sure that, in order to have that disaster recovery where if it goes down, we can come back up, we go back down and come back up as often as possible, because it's sort of the rule for everything...well, not everything, but there's quite a lot of things in IT, like, releasing and, like, spinning up systems. It's like brushing your teeth. If it hurts, that doesn't mean you should brush your teeth less. It means you should brush your teeth much more often. And it's the same for bringing systems up and releasing. If it's kind of painful and kind of horrible, that shows you don't have enough automation, and it shows that there's maybe something that's a little bit broken. You are going to need that, so practice it now when it's not a disaster. And then, in the ideal world, you get this sort of double win where you've got your disaster recovery, you've got your cost reduction, and as well you've just hopefully saved a whole bunch of tedious work for some of your people.
Recommended talk: Architecting For Scale • Lee Atchison & Ken Gavranovic • GOTO 2021
Automation in cloud
Lorna Jane Mitchell: And I think the automation, we learn to trust it, and that does take time, but letting the machines, I would say, handle the boring parts and just keeping us honest, catching things that are difficult for a human to see at review time, but the things that we can check with the machine I think is really valuable, and I think there's a lot that we can do there in terms of the automation and picking the things that we really don't want to go wrong and looking at how do we include them. I think you had a nice comparison of the two systems that worked perfectly, but one had changed in a way that meant it could no longer communicate with the other one. So, is that the thing that you see people...are we getting better at this as an industry, or is it still something we need a bit more education for?
Holly Cummins: I think we'll probably always be pushing at automation. And I'm sort of surprised now that there's... Because computers are becoming smarter. There are things that we couldn't have automated a few years ago that we can now automate, which is nice. Even simple things, like I was impressed in your talk when you were talking about Vale as a way of having linting, which is a concept that's so familiar to developers, but for pros, but having it be on the command line rather than having to fire up a word processor or something like that. Recently, I've been working on a document, and so I sent it out to the team for review. I'm normally a fairly compulsive proofreader, and I was shocked at how many things came back of, "You missed a letter here, and you've got an extra here," and these kinds of things. And I shouldn't be using the time of my teammates who are highly experienced engineers to proofread and find my typos. It should be something that was more automated. And so if that's true for pros, then it's definitely extra, extra true for the problems in your code that you don't want your users to be finding, because if you don't have the automation in place if you don't have that quality in place, then your users are gonna be the ones that are finding the problems, and in the worst case, they won't tell you. They'll just leave.
Lorna Jane Mitchell: Yes. Or you're seeing it in the logs. You've already deployed it, and then it's like, "Oh, there's a lot of red on the graph now. Okay."
Holly Cummins: And then that's when you need to have that ability to release quickly. So, you don't want to go through...what often happens, in that case, is either you know you can't go through the release process, so you do some sort of horrible hack and you SSH into the machine to do the patch because the release process is too painful, which is really bad, or it takes several days of sort of spinning the wheels and getting the approvals to fix something that ideally should be patchable quite quickly.
Lorna Jane Mitchell: Yes. I think I've been really lucky on the docs platform that we're very just deploying continuously. Just fix it, review it, merge it, it's live, done. So the turnaround time is really quick. I think that's something that has changed over time, though, in the industry where at one time, as I said, we did everything to keep those fires burning, and again, it's that "Just deploy a fresh one." Does this new one work? Great. It's our live build now. And seeing teams adapt to that, I think, has been a big part of the cloud challenge.
Holly Cummins: Yes. We used to do such a lot of work to try and patch things and to try and find a way of applying a change to a live system in a way that wouldn't ruin everything. Now, because we have automation in place to allow us to spin things up so easily, we really can throw the old one away and bring up a new one. It's so much easier, and it's so much nicer, and it's so much safer. And then it can be quite beneficial for security as well because then you have the "I'm a bit nervous about that one. Throw it away, we'll get a new one. It's clean. It's uncontaminated."
Lorna Jane Mitchell: Yes. And if it's just, "Oh, this one seems to be a bit low on memory. This one seems to be a bit unhappy. Let's just replace that node while everything else is running well." But I think that's quite a difficult concept if that's not where you're from, but at the same time, these tools are here to help all of us. So even the people who've been in the industry and are accustomed to doing it one way, I think I am seeing people adapt and find their way, and really embrace some of the new techniques. I think, especially with all the different cloud services, you compose your application, so people are...they have to learn to trust third parties running some of these things, not just, "I can't see and touch the server, it's an imaginary space," but I am trusting someone else to run this service for me and to manage it. So, we're doing managed databases. And I think that can be a leap of faith for modernization projects as well.
Holly Cummins: It can, because it means we need to learn new techniques for things like debugging, for example, because I was talking in my talk about how all of us had the experience when we first ran something on the cloud. And then we wanted to get the logs for an instance that had died and you cannot get those logs. That instance is gone. So, you need to sort of do a little bit of thinking in advance about, "Okay. What's my disaster recovery like? What's my diagnostics like?" It works really well. It's just not the same as it used to be.
Monitoring in the cloud: open telemetry
Lorna Jane Mitchell: And do you see good adoption of that, people having better practices with log shipping or some of the open telemetry type tooling?
Holly Cummins: I see a lot of good intentions around open telemetry, which I guess is better than the alternative of not even having any intentions. But I'm always interested when I talk about open telemetry and I do a show of hands about how many people are using it, because everybody is sort of nodding enthusiastically, "Well, yes, open telemetry, yes, I'm completely on top of it." Am I using it? Well, no. Not yet. Real soon now. But that's the way of these things. There's always a bit of a lag between the standard being established and then the adoption.
Lorna Jane Mitchell: Do you think there are big barriers, or is it just a matter of time?
Holly Cummins: I think perhaps not for open telemetry, in particular, but I think technology changes quickly. People and processes change slowly. And I think that one of the things that are hardest about the cloud is trying to get those processes to change and to keep up. So, for example, what you're talking about we deploy without ceremony and without fuzz, and we do it all the time because we know we can recover. A lot of institutions, you know, they just sort of start touching the chair when they hear that because there's a whole bunch of processes that were put in place to manage risk. And probably when we were pressing to CDs, they were a great way of managing risk. But now, it's not about how many barriers you put in front of doing the thing, it's how quickly can you recover after a bad thing happens. And that's what improves your quality, that's what improves your user satisfaction.
Recommended talk: Observability Engineering • Charity Majors, Liz Fong-Jones & George Miranda • GOTO 2022
Lorna Jane Mitchell: Yes. I think that's a really good way of thinking about it, like, what's the cost of having to redeploy? Because I remember, burning to DVD and putting it on the courier on a Friday, and that was how we delivered the website or whatever it was. And now we would just, you know, small fix, make sure it's good, make sure the tests all run green, and it's live in five minutes. I mean, I click the button in meetings. That's the barrier to release it"Have I realized we need to do this? Okay, cool. Go for it."
Holly Cummins: Yes. I think that's a good thing, really. It should be that you can have someone of fairly low competence and low attention doing your deployment and it's all okay. Again, I think the interesting thing about that is it sounds absolutely reckless and cavalier. How could you possibly have someone who is not particularly on the ball doing it? But actually, it's because it's so idiot-proof that you can have idiots doing it. I think it contrasts with what seemed to be this sort of very risk-averse strategy of, "We're going to have a release, and the release is on this day. And we're going to have all of the things beforehand." But what would happen is that you didn't wanna miss that release date. I remember, I used to work on WebSphere, and we did not want to miss the release date because it meant that the feature you'd worked on for two years would have to wait another two years, which is just heartbreaking, right?
Lorna Jane Mitchell: It's horrible.
Holly Cummins: I mean, what we want as developers, we want to make an impact, we want to have people using our stuff, we wanna make people happy, and so then that feeling of you just wasted it. So, of course, there'd be the push and there'd be the, "Okay. Well, let's work evenings. Let's work weekends. Let's work even more evenings. Maybe we could work in the middle of the night too." And, of course, you know what happens to the quality, right? It's not like people produce better code at 2:00 in the morning after they've been working for two weeks solid. So the feature would go out, and the quality was shocking. All the users would know that, so they definitely wouldn't pick up the first release. It's not a healthy cycle for anybody.
Lorna Jane Mitchell: It's not. And I think moving to the cloud with that lower barrier of release has given us room to get it right. And also, I mean, I remember release checklists that you had to first run the backup and then change the tape and then run this and that. It needed a skilled person with all of their wits about them and a good chunk of time. But now, we've automated the things we care about. And so that release checklist, almost the machine does that, and it just prompts me, like, "This isn't green. You can't deploy this." I was, "Oh, yeah." Whereas, you know, some time ago, I'd have made that mistake on release day. Now, it's like, "Oops."
Holly Cummins: Yes, definitely. If you care about it, automate it.
Lorna Jane Mitchell: think that's really good advice, and it means that anything we can automate. I have the pros linting, I have some link checking. Have we spelled everything correctly? And it's just there in the process. I don't think about it. I push my branch. It almost always fails to build on the first attempt. That's because the machine is on my side and making sure the quality is there.
Holly Cummins: Yeah.
Lorna Jane Mitchell: Thank you so much for the conversation. It was really cool to hang out and talk cloud with you. Enjoy the rest of the event.
Holly Cummins: Thank you so much, Lorna.