(mcorbin.fr): 200 deployments in production per day, Friday included: lessons learned

March 21, 2025

200 deployments in production per day, Friday included: lessons learned

A few days ago I read on X this tweet from Charity Majors about the fear of deploying on Friday and agreed with her. I’ve had this "deploy on Friday" discussion several times over the years, and even proposed a talk about this topic at some French conferences a while ago (I didn’t pass the CFP :D).

The tweet was on a Friday, so I looked at how many production deployments we did that day, counting only updates on applications: 152. Looking at metrics on one week, I can see peaks at more than 200 deployments per day. Not bad.

"Not deploying on Friday" is something I always heard in my career. It’s the kind of statement that is cargo cult a lot, taught to juniors, and often considered general wisdom: we don’t deploy on Friday to not break production just before the weekend.

Where’s in 2025 so it’s not a secret anymore: having dev teams deploying often in production increases productivity. Just read Accelerate if you need data about this.
Not deploying on Friday is an issue, especially on large projects in competitive fields (which is often the case in software engineering): it’s like stopping your factory one more day than your competitor. And often, "not deploying on Friday" also becomes "not deploying during summer holidays", "not deploying during the Christmas week", or "not deploying before bank holidays"… it’s a symptom of a more large disease.

You want to ship constantly, every day. Push features improvements and bug fixes quickly to customers. Even unfinished work should be able to land in prod (We’ll talk later in this article about how decoupling deploying in production and enabling a feature to end users).

There’s also a psychological aspect to deploying early and often. Large changes are frightening, at least for me. Just look how companies that ship only from time to time in production behave: release date set in stone in the calendar, code freeze, managers ordering pizzas, every team following anxiously the deployment and waiting for customers' feedback… Everyone expects the goddamn thing to explode. Of course, there’s a world between "deploying in production every month" and "not deploying on Friday", but the result is the same, just at another scale.

It’s interesting to ask ourselves "Why we shouldn’t deploy on Friday?". It’s often a lack of trust: in our tests, our CI/CD deployment pipeline, our ability to quickly rollback, our monitoring system, our practices around features development… And it’s OK. If you don’t deploy on Friday because you are lacking trust, don’t do it. But ask yourselves why, and what are you missing to gain confidence, and iterate on that. Fix your issues one by one. Gaining trust in your deployment will help your company every day of the year, not only on Friday.

There’s always stuff that I recommend not doing on Friday. A few years ago, while working at a cloud provider, I migrated the private network product from one orchestration engine to another (it included migrating all the data from one database to another).
It was an important and one-off task that I had to closely monitor when doing so. I didn’t do it on Friday at 6 PM. But it’s not the kind of tasks we do every day as software engineers (or SRE), they are like what, 0.1 % of our deployments? The rest, the day-to-day work, should be able to be shipped every day.

This article is my personal opinion on a few topics that are important for developers' productivity.

Good tech

There’s a secret sauce for productivity: good technology.

More and more, I see people saying that tech doesn’t matter: everything is about culture, working together (DevOps, etc), and "product first" (in the sense of: don’t care about tech, just deliver)…
Of course, all of this matters, and I even did several talks and articles about this topic over the years.
But if you have shitty tech (or shitty practices around tech), you’ll lower your productivity. Period.

I liked this talk from Nigel Kersten last year at Flowcon. Tech excellence will help you build better products and will make your customers (and your engineers) happy. Let’s, for once, consider tech as an asset and not just something that doesn’t matter and is interchangeable.
We’re in 2025, we have tons of cool and accessible technologies and practices to help us achieve our goals, let’s use them. And pushing technological boundaries is also a good way to change culture, by showing what’s possible!

Service-oriented architecture

I don’t like huge monoliths, and every time I encountered large codebases at work it was a mess and the main company pain point, slowing down everyone. I’m sure good-quality large monoliths application exists but never saw them.
I am also not a fan of microservice architectures, especially if badly designed. I even wrote an article (in French but I guess it can be easily translated) about the microservice topic one year ago.

Let’s be pragmatic. I really think that for most organizations there’s a sweet spot on the number of services per team or per developer, especially in large organizations. I don’t know what’s this number. One or two services per team (5-10 engineers handling one or several business domains) maybe? What I can say is that I don’t think we could deploy 200 times per day in production while having a monolith.

Smaller codebases mean smaller cognitive load when working on them, faster CI/CD pipelines (more about testing in the next section), less blast radius when things fail, and easier rollbacks.
Of code services oriented architectures have other challenges (the article I shared before explains a few of them, having strong tech foundations/standards is also important) but it’s a price I would pay because the benefits are huge. Again, I’m not pushing everyone to build crazy microservices architectures but there is a middle ground between "My 5 million LoC monolith that no one can run or launch tests locally" and "my 900 microservices distributed monolith".

Tests

Good test coverage for software is essential. Note that I didn’t write "100 % test coverage", it’s useless. The test should cover your software interfaces. Usually, your codebase has internal interfaces, with components containing business logic split per domain. Software also interacts with the outside world, through API for example: starting your software and then testing it as a blackbox also has value. This article is not about testing but there are a few things you should be careful about.

First, avoid slow tests. Slow tests are a productivity killer. It’s just so highly frustrating to wait for tests to execute for 30 minutes, locally or in your CI, especially if they fail at the end (please, avoid flaky tests). Tests should help developers to have a quick feedback loop and this is not possible if they are slow. Some stacks (like Golang) automatically cache test results and will only run tests that are impacted by code changes, which is nice for local development, but it’s still important to be able to run all tests quickly if needed (which will probably be the case on CI). if you have a huge monolith, you’ll have to modulize it to not run all tests for each commit.
Often, slow tests are a "boiling frog" syndrome: people react when it’s too late, and often the effort seems too big, don’t fall into this trap.

Another important thing is keeping your local (and so testing) environment simple. This point is a major problem that is impacting most companies, especially when they start to have several services on the backend side. I wrote an article about this topic two years ago.

I saw (and still see) so many CRAZY systems to run services and/or tests locally in my career. Tons of terrible bash scripts assembled, being forced to run tools like Kubernetes or complex networking stacks locally to run tests, services that can’t be tested without spinning all dependencies (services with which they interact), causing a giant mess. New joiners that can’t run the awful local dev environment after 3 days of trying. It’s ridiculous but not that uncommon. If you’re in this situation, fix it. Having a quick feedback loop is essential for productivity.

If you have a SOA/microservice architecture and can’t test one of the services in isolation without starting its dependencies, you lose. If you can’t just "git clone" a repository, maybe launch a docker compose file to spin up a database/message queue (for integration tests) and immediately run your tests with regular tools (go test, cargo test, mvn test…), you lose. If you need to deploy complex infrastructure components to run tests, you lose.
The key to success is being able to work in isolation on a given service, without caring about the outside world (this is why we have API contracts). Bonus: if it’s simple locally, it will also be simple on your CI pipeline.

What about end-to-end tests, like tests based on emulating browsers (using Selenium for example)? By experience, they are always slow, flaky, hard, and very costly to maintain. It’s not something to recommend to have in your CI pipeline. But running a few of them constantly (canary testing) on production, on a few key features (like: login) is interesting.

One last thing about tests: avoid on-demand environments, especially for backend services. On-demand environments seem appealing but there false friends because there’s a convenient way to hide deeper issues that will surface in the long run.
I remember an informal discussion I had with a few SREs from one of the most valuated French scale up a few years ago about this topic. They started splitting the historical monolith into several services and asked me my thoughts (knowing that I already worked on this topic before) about how to provide full-featured on-demand environments for backend engineers, started from pull requests in order to manually validate the code before merging. My answer was: DON’T!
It’s a trap: why split a monolith if, in the end, your engineers can’t work without fully rebuilding it (with the accidental complexity it brings to the CI/infrastructure layer) every time they make a change on a given service?

Avoid like hell having to "wait" for a team to deliver changes on one service before being able to ship your part: again, you completely lose the advantage of services-oriented architectures if you have coupling between them when delivering features. I’ll explain later a few techniques to avoid this.

Again, working in isolation is the key to success.

Deployment

We’re running all our production systems on Kubernetes, with ArgoCD as our GitOps engine. I started using Kubernetes in 2017, wasn’t totally convinced at the beginning but now I can say that it’s a game changer for infrastructure management and to build a reliable internal PaaS. And it’s also thanks to Kubernetes that we’re able to be so productive today (and that I can’t remember the last time I was paged while being on-call).

In any case, you’ll need something to handle stuff like application packaging, rollout (and connection draining), rollbacks, firewalling, service discovery, to detect issues on them (health checks). You will also need autoscaling, resources management, and scheduling capabilities. I could add a lot of stuff to this list.

Yes, you can do all of this without Kubernetes, I did it myself in the past. But in reality your homemade assembly of shell scripts/puppet modules/ansible roles/… will just be a much worse, non-standard, bug-ridden, hard-to-maintain version of Kubernetes. Just use Kubernetes, it’s ultra-reliable, even for critical services (we even automatically rebuild the whole infrastructure every 24 or 48 hours thanks to Karpenter running on our clusters for example :D).

As said before, we want to ship early and ship often. We don’t have releases, we just immediately deploy each commit pushed on the main branch. Each commit is a release. It works very well and I encourage you to do the same thing if possible.

Our current CI pipeline for all of our services (they all have continuous deployment enabled without exception, including critical ones) is, for each commit:

Run linters/tests
Immediately deploy in our staging environment if the previous phase is successful
If the rollout is successful in staging (the application is up and running), the application is deployed in production

But what is a "successful deployment"? Usually, applications have several replicas. Kubernetes will progressively replace replicas running the old version with replicas running the new version. If a new replica fails to start, the rolling restart is aborted and old versions continue to run. It means that in our case production deployment will start only when all replicas are up and running in staging, "up and running" being in the Kubernetes world "the /healthz HTTP endpoint on services are returning 200".
Thanks to Kubernetes and ArgoCD, we can also easily roll back to a previous revision if something fails.

For years, we worked that way and it was enough.

One year ago we introduced Argo Rollout to perform canary deployment for highly critical services. Indeed, if a service successfully starts but then generates tons of errors 5 minutes later, it will not be detected by Kubernetes which will happily continue to roll out new replicas of the faulty services.

Argo rollout allows fine-grained deployment based on services metrics (we’re using Prometheus/Thanos), for example, "deploy 20 % of replicas with the new version, wait for 5 minutes, and periodically check the application HTTP error rate. If the error rate is below the configured threshold, deploy 70 % of the replicas with the new version for 10 minutes, and if it’s still OK deploy 100 %".
If the error rate increases too much, the application is automatically rollbacked to the previous version. Deployments take more time, but we gain confidence. For critical services, it’s worth it, and thanks to our service-oriented architecture we don’t have "CI contention" on them.

Deploying vs enabling

I talked at the beginning of the article about decoupling deploying in production and enabling a feature to end users and shipping unfinished work. It’s a super important topic. A new feature can take weeks to be delivered completely but you don’t want to wait weeks before being able to merge work, it’s better to ship continuous improvements every day (Friday included). And I said before, you DON’T want to have teams waiting for each other when shipping new stuff.
Same thing for some smaller changes like existing feature improvements (for example: a new option available on an API endpoint). You may want for example to enable the feature in your staging environment but not in production, or in production but only for some identified users or only a subset of customers…

The obvious answer for this is using feature flags. Hide your new features or improvements behind a flag and enable it only on a subset of users or on specific environments. Most feature flag systems can also be used by non-tech people, which is nice. It looks easy on paper but feature flags can also become a huge mess:

Flags created but not cleaned once useless
Codebases polluted with if/else conditions, and hard to follow
Teams not thinking about what’s happening with the feature flag system is down (like: how the system will really react especially for partial failures and with several services referencing the same flag, potentially evaluated differently)

I saw a lot of cases in my career where feature flags were misused and where they should have been replaced by business logic. One of the most common errors is confusing feature flags with permissions management. If you have a solid permission system (centralized permission store, with tooling and an admin interface, integrated into your services business logic), it’s way easier to use it to enable a feature to specific customers rather than using a feature flag. No conditions in the code to add or remove, just your permission layer dynamically handling everything for you.
I like this talk from Dorra Bartaguiz (in French) about this topic.

Another way to hide features on specific environments is to hide new API endpoints, for a new feature for example. It’s easily doable if you have an API gateway in front of your services for example. It’s also OK to deploy a new service in production but not expose it at all.

Developers autonomy

Again, a topic on which I already wrote and talked a lot and that would deserve another dedicated article. See for example those slides (in French again!) about this topic.

Your SRE/Platform engineer/sysadmin/whatever new trending team name/… team shouldn’t be your organization bottleneck. This means that other teams (product teams, data teams…) should be able to manage their services in production by themselves, including (non-exhaustive list):

Creating new services, including infrastructure dependencies (Git repositories, databases, Kafka topics, S3 buckets…)
Configuring how services run in production: resources, number of replicas, autoscaling, health checks, ports/paths to expose to the outside world, environment variables, and secrets…
Managing production deployments, and rollbacks, receiving alerts (and being on call) on their services, and being able to troubleshoot incidents.

In short, the work doesn’t stop when the code is merged on the main branch.

We don’t ask developers to become cloud or Kubernetes experts. It’s our job to abstract the infrastructure complexity and to provide a simpler interface to end users. You don’t want developers to have to write 10 Kubernetes YAML files to deploy a single service for example, you want them to configure a few key settings (as said before: replicas, autoscaling, env vars/secrets, ports, rollout configuration…) in a very simple way. Hide the accidental complexity and only expose the relevant settings.

But it’s still not enough. There are tons of actions your developers need to perform in production on their daily work (or during incidents). Here are a few real-world examples:

Deleting Kafka topics or consumer groups when decommissioning services or doing refactoring
Kill SQL queries during incidents (for example pg_cancel_backend(query pid) on PostgreSQL): who never made a database crash because of a giant query running without timeout?
Fetch some pprof files from an internal /debug/pprof endpoint for some Go service
Backend services themselves may have administration endpoints for troubleshooting purposes, or to execute privileged actions (to reconcile data for example)

All of this should be done safely, with authentication, limited scope (you’re only allowed to perform specific actions), audit, and critical actions approvals. There’s no point in putting your developers on call if they can’t perform any action in production when being paged. And even outside of on-call hours, not having tooling immediately available for common situations and having to reach out to another team that may not answer immediately is an issue.

What worked well for us was building our own tooling. We have an API gateway plugged into our central authentication services (with 2FA etc) and we, SRE, built services to expose high-level features. We then provide an official CLI (installed and upgraded automatically in all computers) to interact with them. Here are some subcommands examples, all of them hitting homemade APIs:

You want to create a new microservice? Run microservice create go --database --s3 and that’s it, 5 minutes after your service is running in production with some boilerplate code and its infrastructure dependencies (database and s3 bucket)
You need to terminate a SQL query? Just run rds cancel-query --context staging --identifier my-db --pid 1234 (with a real-time dashboard showing you the queries currently running on databases)
You need to delete a Kafka topic? Run kafka topics delete --cluster foo --topic my-topic. You can even add extra security layers on our homemade API, like making sure that no consumer are listening to the topic.
You want to profile a Go app using pprof? Run microservice profiling go --profile profile --seconds 30 --file /tmp/result.pprof. Pass --profile heap for a heap dump
You need a new temporary read replica in production to run some batch processing jobs ? Run rds create-read-replica --source my-db: users don’t (and shouldn’t!) have to specify anything else, like network configuration: all of this is controlled by the SRE team (and taken from the source instance in reality), there’s no reason to expose those settings.

We even have an approval system for those actions, where approval should be given by someone else (not necessarily an SRE, it may be someone from your team). You can see examples on these slides where screenshots are in English.

TL:DR about this topic: you should add self-service capabilities to your infrastructure. Just look at your support requests, take the most frequent ones, and build tools to allow teams to perform the actions without having to ask you. Repeat for a few years.
And you will see that it will be easier and easier to build new features with time because you’ll be able to reuse what was done for previous ones (api gateway, authentication, approval systems, tooling…).

What about observability, one of my favorite topic? I could write pages about it (and this article is already way too long) but here are a few important points to keep in mind:

Observability should be built-in in applications: like, if a team creates a new HTTP service interacting with a PostgreSQL database, the service should have by default proper metrics about the language runtime (like Golang internal metrics), about its HTTP server (rate, latency, with proper labels), about its database (size, IOPS, CPU usage…). It should also have end-to-end tracing with proper semantic convention for attributes (including your own business attributes).
Tracing, and more specifically Opentelemetry traces, are the future (or the present): the open source tooling is still in the early stage on this topic but I’m convinced that in a few years, we’ll just emit spans from services, and derive logs (span events) and metrics (by aggregating spans) from them. No more logger or Prometheus/metrics endpoint, yay! It’s the direction we’re trying to take.
Be careful about alert fatigue (alert on symptoms vs causes), and treat alerts paging people outside of business hours as a top priority issue. If your on-call team is flooded by alerts all the time, first it’s utterly disrespectful to them and makes you a bad employer, but it’s also a huge red flag about your company’s technical practices.
I know a lot of people will not agree with this one but I really think that developers shouldn’t follow their production deployments when shipping day-to-day code. The system should immediately react by aborting the deployment and/or generating alerts in case of error. I don’t want to have developers "following" (what does it mean? I don’t even know, especially on complex systems) deployments for 15 minutes 200 times per day. I never really saw a correlation between following deployments and detecting incidents faster. On the other hand, I always saw the value of a good alerting system, with alerts routed to the right persons/teams.
Just merge your work, trust your continuous deployment pipeline and alerting system, and start working on something else.
Handle incidents properly, during the incident itself (how to declare it, who is the incident commander/driving the incident, how to notify customers…), and after (how do we learn and make continuous improvements thanks to incidents? I did a talk a few years ago about incidents management and PDCA, in English this time).

Embracing failure

We should expect things to fail. Bad things will always happen and we’ll have production incidents no matter what we do. And it’s OK. It’s not that bad to have incidents from time to time if you’re able to limit the impact, quickly roll back, and if it doesn’t happen too often.

That’s also why I think SLOs are a really nice tool (see my article about it): they will help you measure the quality of your features on the long run, and so help you decide if you should invest more on not on reliability topics. Business metrics reflecting the user experience have a lot of value and are good candidates to be used for SLOs.

Design systems that tolerate failures. If after an incident you always need to manually perform actions in production to fix customers' issues, like reconciling data in some services or having to manually relaunch actions, it’s a sign of an issue with how you design systems.
Just make sure that next time a similar issue happens, what you had to do manually will be done automatically by the system. You should always ask yourself "What will happen if this component fails" when building features, and plan for automatic recovery.

The end.

Tags: devops

« Building my own AI toolbox: AI providers, contexts, RAG La track qu'il manque en conférence »

Add a comment

If you have a bug/issue with the commenting system, please send me an email (my email is in the "About" section).

Top of page