So, you’re doing the “right thing.” You’re preparing your AKS cluster for the future by enabling the OIDC Issuer and Workload Identity. You haven’t even migrated your apps to use Federated Identity yet—you’re still rocking the classic Azure Pod Identity (or just standard Service Accounts). No harm, no foul, right?
Wrong.
As soon as you flip the switch on OIDC, Kubernetes changes the fundamental way it treats Service Account tokens. If you have long-running batch jobs (like Airflow workers, Spark jobs, or long-polling sensors), you might be walking into a 401 Unauthorized trap.
The “Gotcha”: Token Lifespan
Before OIDC enablement, your pods likely used legacy tokens. These were static, long-lived (often valid for ~1 year), and lived as simple secrets. They were the “set it and forget it” of the auth world.
How do you know if you are using the OIDC tokens? Inspect the token in your containers /var/run/secrets/kubernetes.io/serviceaccount/token
If the Audience has xyz.oic.<env>-aks.azure.com, then its the OIDC token. Even though you have not implemented workload identity yet.
The Moment You Enable OIDC/Workload Identity: AKS shifts to Bound Projected Tokens. These are significantly more secure but come with a strict catch: The default expiration is 1 hour (3600 seconds).
If your app starts a session and doesn’t explicitly refresh that token, it will expire 60 minutes later. For a 4-hour batch job or a persistent sensor, this means your app will work perfectly… until it suddenly doesn’t.
Why It’s Sneaky
Azure Identity Still Works: Your connection to Key Vault or Storage via Pod Identity stays up.
The K8s API Fails: Only the calls within the cluster (like checking the status of another pod or a SparkApplication CRD) start throwing 401s.
It’s a Time Bomb: Everything looks fine in your 10-minute dev test. The failure only triggers in Production when the job hits the 61st minute or the token expired mid process.
The Quick Fix: The 24-Hour Band-Aid
If you aren’t ready to refactor your code to handle token rotation (which is the “real” fix), you can manually override the token lifespan using a Projected Volume in your Deployment or StatefulSet.
By mounting a custom token, you can extend that 1-hour window to something more batch-friendly, like 24 hours.
The Workaround YAML
You need to disable the automatic token mount and provide your own via volumes and volumeMounts.
While the 24-hour token buys you time, it’s a temporary safety net. Microsoft and the Kubernetes community are pushing for shorter token lifespans (AKS 1.33+ will likely enforce this more strictly).
Your to-do list:
Upgrade your SDKs: Modern Kubernetes clients (and Airflow providers) have built-in logic to reload tokens from the disk when they change.
Avoid Persistent Clients: Instead of one long-lived client object, initialize the client inside your retry loops.
Go All In: Finish the migration to Azure Workload Identity and move away from Pod Identity entirely.
Don’t let a security “improvement” become your next P1 incident. Check your batch job durations today!
TL;DR: Pin versions, set sane resources, respect system-node taints, make Gatekeeper happy, no-encoding secrets, and mirror images (Never pull from public registries and blindly trust them).
Works great on AKS, EKS, GKE — examples below use AKS.
The default dynakube template that Dynatrace provides you – will probably not work in the real world. You have zero trust, Calico firewalls, OPA Gatekeeper and perhaps some system pool taints?
Quick checks (healthy install):
dynatrace-operator Deployment is Ready
2x dynatrace-webhook pods
dynatrace-oneagent-csi-driver DaemonSet on every node (incl. system)
OneAgent pods per node (incl. system)
1x ActiveGate StatefulSet ready
Optional OTEL collector running if you enabled it
k get dynakube
NAME APIURL STATUS AGE
xxx-prd-xxxxxxxx https://xxx.live.dynatrace.com/api Running 13d
kubectl -n dynatrace get deploy,sts
# CSI & OneAgent on all nodes
kubectl -n dynatrace get ds
# Dynakube CR status
kubectl -n dynatrace get dynakube -o wide
# RBAC sanity for k8s monitoring
kubectl auth can-i list dynakubes.dynatrace.com \
--as=system:serviceaccount:dynatrace:dynatrace-kubernetes-monitoring --all-namespaces
NAME READY UP-TO-DATE AVAILABLE AGE
deployment.apps/dynatrace-operator 1/1 1 1 232d
deployment.apps/dynatrace-webhook 2/2 2 2 13d
NAME READY AGE
statefulset.apps/xxx-prd-xxxxxxxxxxx-activegate 1/1 13d
statefulset.apps/xxx-prd-xxxxxxxxxxx-otel-collector 1/1 13d
NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE
xxx-prd-xxxxxxxxxxx-oneagent 9 9 9 9 9 <none> 13d
dynatrace-oneagent-csi-driver 9 9 9 9 9 <none> 13d
NAME APIURL STATUS AGE
xxx-prd-xxxxxxxxxxx https://xxx.live.dynatrace.com/api Running 13d
yes
Here are field-tested tips to keep Dynatrace humming on Kubernetes without fighting OPA Gatekeeper, seccomp, or AKS quirks.
1) Start with a clean Dynakube spec (and pin your versions)
Pin your operator chart/image and treat upgrades as real change (PRs, changelog, Argo sync-waves). A lean cloudNativeFullStack baseline that plays nicely with Gatekeeper:
Why this works: it respects control-plane taints, adds the CriticalAddonsOnly toleration for system pools, sets reasonable resource bounds, and preps you for GitOps.
2) System node pools are sacred — add the toleration
If your CSI Driver or OneAgent skips system nodes, your visibility and injection can be patchy. Make sure you’ve got:
Your taints might be different, so check what taints you have on your systempools. This is the difference between “almost there” and “golden”.
3) Resource requests that won’t sandbag the cluster
OneAgent: requests: cpu 100m / mem 512Mi and limits: cpu 300m / mem 1.5Gi are a good starting point for mixed workloads.
ActiveGate: requests: 500m / 1.5Gi, limits: 1000m / 1.5Gi. Tune off SLOs and node shapes; don’t be shy to profile and trim.
4) Make Gatekeeper your mate (OPA policies that help, not hinder)
Enforce the seccomp hint on DynaKube CRs (so the operator sets profiles on init containers and your PSA/Gatekeeper policies stay green).
ConstraintTemplate (checks DynaKube annotations):
5) Secrets: avoid the dreaded encode (akv2k8s tip)
Kubernetes Secret.data is base64 on the wire, but tools like akv2k8s can also feed you values that are already base64. If using tools like akv2k8s, use this to transform the output.
This will ensure Dynatrace can read the Kubernentes Opaque secret as it, no base64 encoding on the secret.
6) Mirror images to your registry (and pin)
Air-gapping or just speeding up pulls? Mirror dynatrace-operator, activegate, dynatrace-otel-collector into your ACR/ECR/GCR and reference them via the Dynakube templates.*.imageRef blocks or Helm values. GitOps + private registry = fewer surprises.
We use ACR Cache.
7) RBAC: fix the “list dynakubes permission is missing” warning
If you see that warning in the UI, verify the service account:
kubectl auth can-i list dynakubes.dynatrace.com \ –as=system:serviceaccount:dynatrace:dynatrace-kubernetes-monitoring –all-namespaces
If “no”, ensure the chart installed/updated the ClusterRoleandClusterRoleBinding that grant list/watch/get on dynakubes.dynatrace.com. Sometimes upgrading the operator or re-syncing RBAC via Helm/Argo cleans it up.
When you install the Dynatrace Operator, you’ll see pods named something like dynatrace-webhook-xxxxx. They back one or more admission webhook configurations. In practice they do three big jobs:
Mutating Pods for OneAgent injection
Adds init containers / volume mounts / env vars so your app Pods load the OneAgent bits that come from the CSI driver.
Ensures the right binaries and libraries are available (e.g., via mounted volumes) and the process gets the proper preload/agent settings.
Respects opt-in/opt-out annotations/labels on namespaces and Pods (e.g. dynatrace.com/inject: "false" to skip a Pod).
Can also add Dynatrace metadata enrichment env/labels so the platform sees k8s context (workload, namespace, node, etc.).
Validating Dynatrace CRs (like DynaKube)
Schema and consistency checks: catches bad combinations (e.g., missing fields, wrong mode), so you don’t admit a broken config.
Helps avoid partial/failed rollouts by rejecting misconfigured specs early.
Hardening/compatibility tweaks
With certain features enabled, the mutating webhook helps ensure injected init containers comply with cluster policies (e.g., seccomp, PSA/PSS).
That’s why we recommend the annotation you’ve been using: feature.dynatrace.com/init-container-seccomp-profile: "true" It keeps Gatekeeper/PSA happy when it inspects the injected bits.
Why two dynatrace-webhook pods?
High availability for admission traffic. If one goes down, the other still serves the API server’s webhook calls.
How this ties into Gatekeeper/PSA
Gatekeeper (OPA) also uses validating admission.
The Dynatrace mutating webhook will first shape the Pod (add mounts/env/init).
Gatekeeper then validates the final Pod spec.
If you’re enforcing “must have seccomp/resources,” ensure Dynatrace’s injected init/sidecar also satisfies those rules (hence that seccomp annotation and resource limits you’ve set).
Dynatrace Active Gate
A Dynatrace ActiveGate acts as a secure proxy between Dynatrace OneAgents and Dynatrace Clusters or between Dynatrace OneAgents and other ActiveGates—those closer to the Dynatrace Cluster. It establishes Dynatrace presence—in your local network. In this way it allows you to reduce your interaction with Dynatrace to one single point—available locally. Besides convenience, this solution optimizes traffic volume, reduces the complexity of the network and cost. It also ensures the security of sealed networks.
The docs on Active Gate and version compatibility with Dynakube are not yet mature. Ensure the following:
With Dynatrace Operator 1.7 the v1beta1 and v1beta2 API versions for the DynaKube custom resource were removed.
ActiveGates up to and including version 1.323 used to call the v1beta1 endpoint. Starting from ActiveGate 1.325, the DynaKube endpoint was changed to v1beta3 Ensure your ActiveGate is up to date with the latest version.
As part of our ongoing platform reliability work, we’ve introduced explicit CPU and memory requests/limits for all Dynatrace components running on AKS.
🧩 Why it matters
Previously, the OneAgent and ActiveGate pods relied on Kubernetes’ default scheduling behaviour. This meant:
No guaranteed CPU/memory allocation → possible throttling or eviction during cluster load spikes.
Risk of noisy-neighbour effects on shared nodes.
Unpredictable autoscaling signals and Dynatrace performance fluctuations.
Setting requests and limits gives the scheduler clear boundaries:
Requests = guaranteed resources for stable operation
Limits = hard ceiling to prevent runaway usage
Helps Dynatrace collect telemetry without starving app workloads
These values were tuned from observed averages across DEV, UAT and PROD clusters. They provide a safe baseline—enough headroom for spikes while keeping node utilisation predictable.
When a product has been proved to be a success and has just come out of a MVP (Minimal Viable Product) or MMP (Minimal Marketable Product) state, usually a lot of corners would have been cut in order to get a product out and act on the valuable feedback. So inevitably there will be technical debt to take care of.
What is important is having a technical vision that will reduce costs and provide value/impact/scaleable/resilient/reliable which can then be communicated to all stakeholders.
A lot of cost savings can be made when scaling out by putting together a Cloud Architecture Roadmap. The roadmap can then be communicate with your stakeholders, development teams and most importantly finance. It will provide a high level “map” of where you are now and where you want to be at some point in the future.
A roadmap is every changing, just like when my wife and I go travelling around the world. We will have a roadmap of where want to go for a year but are open to making changes half way through the trip e.g. An earthquake hits a country we planned to visit etc. The same is true in IT, sometimes budgets are cut or a budget surplus needs to be consumed, such events can affect your roadmap.
It is something that you want to review on a regular schedule. Most importantly you want to communicate the roadmap and get feedback from others.
Feedback from other engineers and stakeholders is crucial – they may spot something that you did not or provide some better alternative solutions.
Decomposition
The first stage is to decompose your ideas. Below is a list that helps get me started in the right direction. This is by no means an exhausted list, it will differ based on your industry.
Component
Description
Example
Application Run-time
Where apps are hosted
Azure Kubernetes
Persistent Storage
Non-Volatile Data
File Store Block Store Object Store CDN Message Database Cache
Once you have an idea of all your components. The next step is to breakdown your road-map into milestones that will ultimately assist in reaching your final/target state. Which of course will not be final in a few years time 😉 or even months!
Sample Roadmap
Below is a link to a google slide presentation that you can use for your roadmap.
You must be logged in to post a comment.