Translating an AWS EKS Stack to Azure AKS: The Architectural Decisions Behind a Real Migration
Migrating a production EKS stack to AKS is not a service-by-service translation exercise. It is a sequence of architectural decisions that the AWS-to-Azure mapping tables on Microsoft Learn quietly skip over. This post is about the decisions that shaped a real five-week migration of four production apps from eu-west-2 to West Europe, and what I would do differently.
There is a comforting table on Microsoft Learn that maps AWS services to Azure services. RDS to Flexible Server. Cognito to Entra ID B2C. SQS to Service Bus. ECR to ACR. S3+CloudFront to Blob+Front Door. The table is correct. It is also misleading, because it implies the migration is a translation problem.
It is not. The translation is the easy part. The hard part is the sequence of architectural decisions you have to make once you accept that some services do not translate cleanly, that some Azure equivalents are better, that some are worse, and that doing a literal one-to-one mapping is the most expensive way to ship.
This post is about a five-week migration I did from a production EKS stack in eu-west-2 to a production AKS stack. Four apps, 150+ secrets, dual-cloud GitOps. I will not walk through the mapping table. I will walk through the decisions that mattered.
Table of contents
The starting point
The AWS side was a normal EKS production stack. EKS, 23 fixed nodes on a current-generation instance type, RDS PostgreSQL Multi-AZ, ElastiCache Redis, ECR for images, Cognito for B2B auth, SQS and SNS for async messaging, S3 plus CloudFront for static assets, SES for transactional email, KMS for encryption, Parameter Store for secrets. ArgoCD on top of all of it for GitOps. Four production apps: a Node.js API, a worker for asynchronous report generation, a JVM service for ingest, and a static frontend.
The driver for going to Azure was not technology. It was an enterprise customer with a data residency requirement that AWS could not satisfy in the relevant region. We needed an Azure footprint, parallel to the AWS one, deployed from the same Git repos, with a new branch strategy. The product team kept shipping to AWS. The platform had to keep up on both.
This is the part most blog posts about cloud migrations skip. Migrations rarely happen because someone sat down and chose Azure on technical merits. They happen because a customer, a region, a regulator, or a contract forces a decision. The architecture that comes out of that pressure is shaped by the constraint, not by the cloud’s capabilities.
Decision 1: Translate, do not abstract
The first temptation, when you are told you have to deploy the same product to a second cloud, is to introduce a multi-cloud abstraction. Crossplane to manage AWS and Azure resources from one control plane. A custom Terragrunt structure with provider abstractions. A homegrown Helm chart that takes a cloud: value and conditionally renders manifests for either side.
I have seen each of these tried. None of them are wrong. All of them are too expensive to introduce during a five-week migration of a production system that is not your own pet project.
The decision I made: keep AWS Terraform as the AWS Terraform, write Azure Terraform from scratch as the Azure Terraform, and accept the duplication. Two separate state files, two separate provider configs, two separate variable structures. The applications stay one codebase, one Helm chart, with two values files: values-aws-prod.yaml and values-azure-prod.yaml. ArgoCD has three application sets: aws-prod, aws-dev, azure-prod.
This is duplication, and it costs more lines of HCL. It also costs almost nothing to maintain in practice, because each cloud’s infrastructure changes in a different rhythm and the abstraction would have leaked the moment the first Azure-specific feature came up (which it did, in week two).
The shape of the platform after the split:
flowchart LR
subgraph App["Application code (single source of truth)"]
direction TB
A1[api] --- A2[reporter]
A2 --- A3[ingest]
A3 --- A4[frontend]
end
subgraph Helm["Helm chart (shared)"]
direction TB
H1[values.yaml]
H2[values-aws-prod.yaml]
H3[values-azure-prod.yaml]
end
subgraph AWS["AWS stack"]
direction TB
TFA[Terraform AWS]
AC[ArgoCD aws-prod]
end
subgraph Azure["Azure stack"]
direction TB
TFZ[Terraform Azure]
AZ[ArgoCD azure-prod]
end
App --> Helm
Helm --> AC
Helm --> AZ
TFA -.->|provisions| AC
TFZ -.->|provisions| AZ
The Helm values diff between the two clouds is small but precise. The same chart deploys to both; only the cloud-specific glue changes.
# values-aws-prod.yaml (relevant excerpt)
serviceAccount:
annotations:
eks.amazonaws.com/role-arn: arn:aws:iam::123456789012:role/api
externalSecrets:
backend: aws-secrets-manager
ingress:
className: alb
annotations:
alb.ingress.kubernetes.io/scheme: internet-facing
# values-azure-prod.yaml (same fields, different glue)
serviceAccount:
annotations:
azure.workload.identity/client-id: 8a7b6c5d-1234-...
externalSecrets:
backend: azure-keyvault
ingress:
className: nginx
annotations:
appgw.ingress.kubernetes.io/ssl-redirect: "true"
The anti-pattern: building a multi-cloud abstraction during a migration. Build it after, if the second cloud sticks around long enough to justify the investment. Most don’t.
Decision 2: Pick the Azure region for the right reasons
The original plan was to deploy in a regional Azure region geographically closest to the customer. The data residency requirement seemed satisfied.
Two weeks in, we ended up in a major European region instead.
The reasons were operational, not architectural: AKS feature parity in the smaller region was lagging behind the major ones. Some Azure managed services we needed were either not available or available with reduced SKUs. vCPU quota requests took longer to approve. The third-party services we depended on (observability vendors, email reputation providers) had latency or peering gaps to the smaller region that they did not have to the major one.
The legal review came back saying the major region satisfied the customer’s data residency contract, since the customer was operating an EU subsidiary that controlled the data. This is not always true. It was true in this case, and the change saved us.
The general lesson: do not anchor the region choice to “what is geographically closest”. Anchor it to “what region’s feature set, quota policies, and partner network match the workload”. Verify the legal constraint with someone who actually understands the contract, not with the AWS region map.
Decision 3: Service mapping is not equivalence
Here is where the Microsoft Learn table gets you in trouble. The mapping is correct only at the level of “this Azure service occupies the same architectural slot”. It is wrong at the level of “this Azure service behaves the same way”.
The slots we had to fill:
flowchart LR
subgraph AWS_["AWS stack"]
direction TB
A1[EKS]
A2[RDS PostgreSQL]
A3[ElastiCache Redis]
A4[ECR]
A5[S3 + CloudFront]
A6[SQS / SNS]
A7[Cognito]
A8[KMS + Parameter Store]
A9[SES]
end
subgraph Azure_["Azure stack"]
direction TB
Z1[AKS]
Z2[PostgreSQL Flexible]
Z3[Valkey in-cluster]
Z4[ACR]
Z5[Blob + Front Door]
Z6[Service Bus]
Z7[Entra ID B2C]
Z8[Key Vault]
Z9[Communication Services]
end
A1 --> Z1
A2 --> Z2
A3 --> Z3
A4 --> Z4
A5 --> Z5
A6 --> Z6
A7 --> Z7
A8 --> Z8
A9 --> Z9
A few that bit me.
Cognito to Entra ID B2C
On paper, both are managed identity providers. In practice, the developer experience, the customization model, the user flow editor, and the way you integrate with a NestJS app are all different. Migrating users from Cognito to B2C requires a custom flow because there is no clean export-import path. Password hashes are not portable.
The decision the dev team made: do not migrate users at all. The new Azure deployment is a fresh tenant with its own user base. AWS keeps its Cognito for the existing customers. The architectural cost of merging the two later is real but not urgent.
SQS to Service Bus
Both are managed queue services. SQS has visibility timeouts, dead-letter queues, FIFO and standard. Service Bus has sessions, scheduled messages, dead-lettering, and the concept of topics with subscriptions baked into a single namespace.
If you used SQS in the most basic way (queue, consumer, occasional DLQ retry), the translation is clean. If you used SNS-fans-out-to-SQS with multiple consumers, the right Azure equivalent is Service Bus topics, not Service Bus queues. We had to redesign one of the messaging patterns because the original was “SNS to two SQS queues for two different consumers”, and the translation was not “Service Bus to two Service Bus queues” but “Service Bus topic with two subscriptions”.
This is not in the mapping table. It is in the docs, but only if you read both with the implementation in front of you.
ElastiCache Redis to… something
This is the one I want to spend a paragraph on.
Azure Cache for Redis is the obvious mapping. We did not pick it. We deployed Valkey in-cluster instead, with replication, persistence on a managed disk, and anti-affinity rules between replicas.
Why? Three reasons. First, the cost ratio. For our scale (a primary plus one replica, used as a cache, not as a primary data store), an in-cluster Valkey deployment was significantly cheaper than the equivalent Azure Cache SKU. Second, the licensing wind around Redis Inc. in 2024 made Valkey the more comfortable long-term choice for our team. Third, Azure Cache for Redis private endpoints came with networking complexity we did not need: the in-cluster pod was already reachable from every other pod that mattered.
The lesson: the obvious managed service is the right answer when you do not care about cost or operational ownership. We cared about both. The architectural slot is “in-memory cache reachable by application pods”. The slot does not require a managed service to fill it.
When this is wrong: when you have stringent HA requirements that an in-cluster Redis fork cannot meet without a meaningful operational investment, when your data plane needs cross-region replication, when your security team will not accept stateful workloads in-cluster. We had none of those.
S3 + CloudFront to Blob Storage + Front Door
The architectural slot here is “static site hosting with CDN”. The translation looks clean. The gotchas are in the details.
CloudFront has Origin Access Identity to keep S3 buckets private. Front Door has a similar concept but the configuration model is different. You will spend an afternoon reading docs and getting it wrong, and another afternoon getting it right. Not a deal-breaker. Plan for it.
The bigger gotcha: CloudFront’s behavior when an origin returns 404 (you can configure custom error pages and rewrites declaratively at the distribution level) is more flexible than Front Door’s equivalent. For a React SPA that depends on serving index.html for any unknown path, the Front Door rule engine works but takes more setup.
KMS + Parameter Store to Key Vault
The mapping is clean here. The gotcha is the secret rotation model. Parameter Store SecureString rotation is something you build with Lambda. Key Vault has its own rotation framework and a different mental model. If you wrote rotation logic against Parameter Store, you are rewriting it. If you used the basic “store secret, retrieve secret” pattern, the translation is straightforward.
For our 150+ secrets, we used External Secrets Operator on the Kubernetes side. ESO supports both AWS Secrets Manager (for the AWS cluster) and Key Vault (for the Azure cluster) as backends. The application code does not know which cloud it is on. The Helm values reference an ESO ExternalSecret, the operator pulls from the right backend based on cluster context. This is the only place where we did introduce an abstraction, and it paid for itself by week three.
Decision 4: Identity is the foundation, design it first
On AWS, our pattern was IRSA: Kubernetes service accounts annotated with an IAM role ARN, with the EKS cluster’s OIDC provider trusted by each role. Per-app role, scoped permissions, well-understood.
On Azure, the equivalent is Workload Identity (the modern one, not the deprecated Pod Identity). The mental model is similar but the moving parts are different. You create a managed identity (or, in our case, a federated credential on an existing identity), you trust the AKS cluster’s OIDC issuer URL, and you annotate the service account.
The end-to-end chain looks like this:
flowchart LR GH["GitHub Actions
workflow"] -->|OIDC token| FED["Federated Credential
on Azure AD app"] FED -->|assumes| MI["Managed Identity
api"] MI -->|annotated on| SA["Kubernetes
ServiceAccount"] SA -->|projected into| POD["Pod
api"] POD -->|RBAC| KV["Azure Key Vault"] POD -->|RBAC| SB["Service Bus"] POD -->|RBAC| BLOB["Blob Storage"]
The federated credential block in Terraform that ties it all together:
resource "azurerm_federated_identity_credential" "api" {
name = "api-federated"
resource_group_name = azurerm_resource_group.platform.name
parent_id = azurerm_user_assigned_identity.api.id
audience = ["api://AzureADTokenExchange"]
issuer = azurerm_kubernetes_cluster.aks.oidc_issuer_url
subject = "system:serviceaccount:apps:api"
}
That subject line is the contract: the AKS service account api in the apps namespace can assume this managed identity. Get the namespace or service account name wrong and the pod silently fails to authenticate, with an error message vague enough to take half a day to diagnose.
We hit a problem on day three of the migration that took two days to find. There were three Azure AD app registrations involved in the GitHub Actions OIDC setup, two of which were duplicates created during earlier exploration. Federated credentials had to be added to all three for the deploy pipelines to work consistently, because different parts of the deployment process were using different ones.
The lesson: when you set up identity federation in Azure, audit the app registrations and managed identities first. Delete the duplicates. Pick one canonical identity per role and stick to it. The cost of getting this wrong is hours of “this works locally but not in CI” debugging.
The second lesson: design the identity story end-to-end before you write any Terraform. From “GitHub Actions runs Terraform” to “the pod calls Service Bus”, every step is an identity decision. If you make those decisions reactively, app by app, you will end up with three managed identities for things that should have been one, and you will spend a week figuring out which is which.
Decision 5: Networking and TLS
The traffic shape on the Azure side ended up looking like this:
flowchart LR USER([User]) --> FD["Front Door
app.example.com"] USER --> APIM["APIM Consumption
api.example.com"] FD -->|static site| BLOB[(Blob Storage)] APIM -.->|WAF debt| AGW["App Gateway + WAF
(OWASP 3.2)"] APIM ==>|currently bypasses| NGINX["ingress-nginx
in AKS"] AGW -.-> NGINX NGINX --> APIPOD[api] NGINX --> ING[ingest] NGINX --> REP[reporter]
DNS delegation from Route53 to Azure DNS for the customer subdomain was the cleanest part of the migration. Add an NS record in Route53, the new subdomain resolves into Azure, life goes on.
TLS was less clean. We picked cert-manager with Let’s Encrypt and DNS-01 challenges (since the apps are behind ingress-nginx and we wanted automated cert issuance). DNS-01 in Azure requires a dedicated service principal or managed identity with DNS Zone Contributor rights on the zone. Setting that up correctly takes longer than the docs imply. Once it is set up, it is hands-off.
The cert-manager Issuer that ties cert-manager to Azure DNS via Workload Identity:
apiVersion: cert-manager.io/v1
kind: ClusterIssuer
metadata:
name: letsencrypt-prod
spec:
acme:
server: https://acme-v02.api.letsencrypt.org/directory
email: ops@example.com
privateKeySecretRef:
name: letsencrypt-prod-key
solvers:
- dns01:
azureDNS:
subscriptionID: ${SUBSCRIPTION_ID}
resourceGroupName: rg-platform-prod
hostedZoneName: customer.example.com
environment: AzurePublicCloud
managedIdentity:
clientID: ${CERT_MANAGER_IDENTITY_CLIENT_ID}
The managedIdentity.clientID field is the bit that took an hour of trial and error to find. The cert-manager docs mention it but bury it under three other auth modes that no longer apply on AKS with Workload Identity.
For the static site, Front Door manages its own TLS certificate. For the API endpoints, cert-manager handles them. Two systems, two trust roots, both working, neither one talking to the other. That is fine. The architectural cost of unifying them was not worth the simplicity gain.
The deferred decision was the App Gateway WAF in front of the AKS ingress. We deployed App Gateway with OWASP 3.2 in Prevention mode, but at the time of writing, the API hostname still bypasses it and goes directly to the nginx LoadBalancer. The TLS handoff between Front Door, App Gateway, and ingress-nginx required certificate sync logic we did not have time to build properly. This is a real architectural debt. We knew it on day one. We shipped without resolving it because the alternative was missing the contractual deadline.
Decision 6: The pipeline strategy
The AWS pipelines were one GitHub Actions workflow per app, triggered by pushes to main. For Azure, we created parallel workflows: pipeline-azure.yml per app, triggered by pushes to a dedicated branch (azure for most apps, azure-no-sql for one that had a longer Azure-specific code path).
flowchart LR
DEV[Developer commit] --> BR{Target?}
BR -->|main| MAIN[("main branch")]
BR -->|azure| AZBR[("azure branch")]
MAIN --> PWS["pipeline-aws.yml"]
AZBR --> PWZ["pipeline-azure.yml"]
PWS --> ECR[ECR push]
ECR --> ARGOAWS[ArgoCD aws-prod]
PWZ --> ACR[ACR push]
ACR --> ARGOAZ[ArgoCD azure-prod]
The branching feels primitive. It works. The alternative was a single workflow that does both, gated by some target detection, and we did not want the blast radius of “a bug in the AWS deploy step accidentally triggers the Azure deploy”.
The cost is keeping two branches in sync, which is real but cheap because both branches mostly receive the same commits. The benefit is operational isolation: a broken Azure pipeline does not block AWS deploys, and vice versa.
When this would not be the right choice: a team where the application code diverges meaningfully between clouds. We do not have that. The application is portable. The deploy mechanism is what differs.
What I would do differently
Three things, in order of importance.
Design identity first. I would draw the full identity federation diagram on day one (GitHub Actions to Terraform to AKS to managed identity to Service Bus / Key Vault / etc.) and pick the canonical app registrations before writing any Terraform. The two days lost to duplicate AAD apps were two days I will not get back.
Pick Region by feature parity, not geography. The pivot from a smaller regional Azure region to a major European one was the right call but was made in week two instead of week zero. If I had asked the right questions earlier (does this region have feature X, what is the partner network, what are quota response times) I would have started in the major region.
Defer the WAF decision explicitly. The App Gateway WAF debt is still live. If I were doing this again, I would either commit to making it work in week one, or explicitly defer it with a documented plan for how to ship without it. The half-deployed state we ended up in is the worst of both worlds.
Takeaway
A migration like this is not a translation problem. It is an architectural problem dressed up as a translation problem. The mapping table tells you which slot each service fills. It does not tell you which slots you can leave empty, which ones need a different shape, or which ones are better filled by something the other cloud also has but that the table does not surface.
The translate-do-not-abstract decision was the single biggest payoff. Two clean stacks, two state files, two pipeline trees, one application codebase. The duplication cost was real and small. The complexity savings were real and large.
The migration shipped on time. The product runs in two clouds. The customer is happy. There is real architectural debt left over (the WAF, in particular). All of that is normal. Migrations that ship on time always have debt. The trick is being honest about which debt you are taking on, and writing it down somewhere your future self will find it.
