Tag: Technology

Dynatrace on Kubernetes: Tips from the trenches (AKS + Gatekeeper + Policy)

Dynatrace on Kubernetes: Tips from the trenches (AKS + Gatekeeper + Policy)

TL;DR: Pin versions, set sane resources, respect system-node taints, make Gatekeeper happy, no-encoding secrets, and mirror images (Never pull from public registries and blindly trust them).

Works great on AKS, EKS, GKE — examples below use AKS.

The default dynakube template that Dynatrace provides you – will probably not work in the real world. You have zero trust, Calico firewalls, OPA Gatekeeper and perhaps some system pool taints?

Quick checks (healthy install):

  • dynatrace-operator Deployment is Ready
  • 2x dynatrace-webhook pods
  • dynatrace-oneagent-csi-driver DaemonSet on every node (incl. system)
  • OneAgent pods per node (incl. system)
  • 1x ActiveGate StatefulSet ready
  • Optional OTEL collector running if you enabled it
k get dynakube
NAME                  APIURL                                  STATUS    AGE
xxx-prd-xxxxxxxx      https://xxx.live.dynatrace.com/api   Running   13d
kubectl -n dynatrace get deploy,sts

# CSI & OneAgent on all nodes
kubectl -n dynatrace get ds

# Dynakube CR status
kubectl -n dynatrace get dynakube -o wide

# RBAC sanity for k8s monitoring
kubectl auth can-i list dynakubes.dynatrace.com \
  --as=system:serviceaccount:dynatrace:dynatrace-kubernetes-monitoring --all-namespaces

NAME                                 READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/dynatrace-operator   1/1     1            1           232d
deployment.apps/dynatrace-webhook    2/2     2            2           13d

NAME                                                  READY   AGE
statefulset.apps/xxx-prd-xxxxxxxxxxx-activegate       1/1     13d
statefulset.apps/xxx-prd-xxxxxxxxxxx-otel-collector   1/1     13d
NAME                            DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR   AGE
xxx-prd-xxxxxxxxxxx-oneagent    9         9         9       9            9           <none>          13d
dynatrace-oneagent-csi-driver   9         9         9       9            9           <none>          13d
NAME                  APIURL                                  STATUS    AGE
xxx-prd-xxxxxxxxxxx   https://xxx.live.dynatrace.com/api   Running   13d
yes

Here are field-tested tips to keep Dynatrace humming on Kubernetes without fighting OPA Gatekeeper, seccomp, or AKS quirks.

1) Start with a clean Dynakube spec (and pin your versions)

Pin your operator chart/image and treat upgrades as real change (PRs, changelog, Argo sync-waves). A lean cloudNativeFullStack baseline that plays nicely with Gatekeeper:

apiVersion: dynatrace.com/v1beta5
kind: DynaKube
metadata:
  name: dynakube-main
  namespace: dynatrace
  labels:
    dynatrace.com/created-by: "dynatrace.kubernetes"
  annotations:
    # Helps Gatekeeper/PSA by ensuring init containers use a seccomp profile
    feature.dynatrace.com/init-container-seccomp-profile: "true"
    # GitOps safety
    argocd.argoproj.io/sync-wave: "5"
    argocd.argoproj.io/sync-options: SkipDryRunOnMissingResource=true
spec:
  apiUrl: https://<your-environment>.live.dynatrace.com/api
  metadataEnrichment:
    enabled: true

  oneAgent:
    hostGroup: PaaS_Development   # pick a sensible naming scheme: PaaS_<Env>
    cloudNativeFullStack:
      tolerations:
        - key: node-role.kubernetes.io/master
          effect: NoSchedule
          operator: Exists
        - key: node-role.kubernetes.io/control-plane
          effect: NoSchedule
          operator: Exists
        - key: "CriticalAddonsOnly"
          operator: "Equal"
          value: "true"
          effect: "NoSchedule"
      oneAgentResources:
        requests:
          cpu: 100m
          memory: 512Mi
        limits:
          cpu: 300m
          memory: 1.5Gi

  activeGate:
    capabilities: [routing, kubernetes-monitoring, debugging]
    resources:
      requests:
        cpu: 500m
        memory: 1.5Gi
      limits:
        cpu: 1000m
        memory: 1.5Gi

  logMonitoring: {}
  telemetryIngest:
    protocols: [jaeger, otlp, statsd, zipkin]
    serviceName: telemetry-ingest

  templates:
    otelCollector:
      imageRef:
        repository: <your-acr>.azurecr.io/dynatrace/dynatrace-otel-collector
        tag: latest

Why this works: it respects control-plane taints, adds the CriticalAddonsOnly toleration for system pools, sets reasonable resource bounds, and preps you for GitOps.


2) System node pools are sacred — add the toleration

If your CSI Driver or OneAgent skips system nodes, your visibility and injection can be patchy. Make sure you’ve got:

tolerations:
  - key: "CriticalAddonsOnly"
    operator: "Equal"
    value: "true"
    effect: "NoSchedule"

Your taints might be different, so check what taints you have on your systempools. This is the difference between “almost there” and “golden”.

3) Resource requests that won’t sandbag the cluster

  • OneAgent: requests: cpu 100m / mem 512Mi and limits: cpu 300m / mem 1.5Gi are a good starting point for mixed workloads.
  • ActiveGate: requests: 500m / 1.5Gi, limits: 1000m / 1.5Gi.
    Tune off SLOs and node shapes; don’t be shy to profile and trim.

4) Make Gatekeeper your mate (OPA policies that help, not hinder)

Enforce the seccomp hint on DynaKube CRs (so the operator sets profiles on init containers and your PSA/Gatekeeper policies stay green).

ConstraintTemplate (checks DynaKube annotations):

5) Secrets: avoid the dreaded encode (akv2k8s tip)

Kubernetes Secret.data is base64 on the wire, but tools like akv2k8s can also feed you values that are already base64. If using tools like akv2k8s, use this to transform the output.

apiVersion: spv.no/v1
kind: AzureKeyVaultSecret
metadata:
  name: dynatrace-api-token-akvs
  namespace: dynatrace
spec:
  vault:
    name: kv-xxx-001
    object:
      name: DynatraceApiToken
      type: secret
  output:
    transform:
      - base64decode
    secret:
      name: aks-xxx-001
      type: Opaque
      dataKey: apiToken
---
apiVersion: spv.no/v1
kind: AzureKeyVaultSecret
metadata:
  name: dynatrace-dataingest-token-akvs
  namespace: dynatrace
spec:
  vault:
    name: kv-xxx-001
    object:
      name: DynatraceDataIngestToken
      type: secret
  output:
    transform:
        - base64decode
    secret:
      name: aks-xxx-001
      type: Opaque
      dataKey: dataIngestToken

This will ensure Dynatrace can read the Kubernentes Opaque secret as it, no base64 encoding on the secret.

6) Mirror images to your registry (and pin)

Air-gapping or just speeding up pulls? Mirror dynatrace-operator, activegate, dynatrace-otel-collector into your ACR/ECR/GCR and reference them via the Dynakube templates.*.imageRef blocks or Helm values. GitOps + private registry = fewer surprises.

We use ACR Cache.

7) RBAC: fix the “list dynakubes permission is missing” warning

If you see that warning in the UI, verify the service account:

# https://docs.dynatrace.com/docs/ingest-from/setup-on-k8s/reference/security
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: dynatrace-k8smon-extra-perms
rules:
  - apiGroups: ["dynatrace.com"]
    resources: ["dynakubes"]
    verbs: ["get","list","watch"]
  - apiGroups: [""]
    resources: ["configmaps","secrets"]
    verbs: ["get","list","watch"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: dynatrace-k8smon-extra-perms
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: dynatrace-k8smon-extra-perms
subjects:
  - kind: ServiceAccount
    name: dynatrace-kubernetes-monitoring
    namespace: dynatrace

kubectl auth can-i list dynakubes.dynatrace.com \
–as=system:serviceaccount:dynatrace:dynatrace-kubernetes-monitoring –all-namespaces

If “no”, ensure the chart installed/updated the ClusterRole and ClusterRoleBinding that grant list/watch/get on dynakubes.dynatrace.com. Sometimes upgrading the operator or re-syncing RBAC via Helm/Argo cleans it up.

8) HostGroup naming that scales

Keep it boring and predictable:

PaaS_Development
PaaS_NonProduction
PaaS_Production

9) GitOps tricks (ArgoCD/Flux)

  • Use argocd.argoproj.io/sync-wave to ensure CRDs & operator land before Dynakube.
  • For major upgrades or URL/token churn:
    1. kubectl -n dynatrace delete dynakube <name>
    2. wait for operator cleanup
    3. sync the new spec (Force + Prune if needed).

10) Networking & egress

If you restrict egress, either:

  • Allow ActiveGate to route traffic out and keep workload egress tight; or
  • Allowlist Dynatrace SaaS endpoints directly.
    Don’t forget webhook call-backs and OTLP ports if you’re shipping traces/logs.

11) Troubleshooting you’ll actually use

  • OneAgent not injecting? Check the CSI Driver DaemonSet and the node you’re scheduling on. Make sure tolerations cover system pools.
  • Pods crash-loop with sidecar errors? Often token/secret issues — confirm you didn’t double-encode.
  • UI shows “permission missing”? Re-check RBAC and chart version; reconcile with Helm/Argo.
  • Gatekeeper blocking? Dry-run constraints first; add namespace/label-based exemptions for operator internals.

12) What “good” looks like

A healthy cluster shows:

  • dynatrace-operator 1/1
  • dynatrace-webhook 2/2
  • dynatrace-oneagent-csi-driver DESIRED == READY == node count
  • OneAgent pods present on all worker and system nodes
  • ActiveGate 1/1
  • Optional OTEL collector 1/1
    …and dashboards populating within minutes.

That’s it — keep it simple, pin your bits, let Gatekeeper help (not hurt), and your Dynatrace setup will surf smooth swells instead of close-outs.

Other useful commands – hardcore diagnosis

kubectl exec -n dynatrace deployment/dynatrace-operator -- dynatrace-operator support-archive --stdout > operator-support-archive.zip

What the Dynatrace webhooks do on Kubernetes

When you install the Dynatrace Operator, you’ll see pods named something like dynatrace-webhook-xxxxx. They back one or more admission webhook configurations. In practice they do three big jobs:

  1. Mutating Pods for OneAgent injection
    • Adds init containers / volume mounts / env vars so your app Pods load the OneAgent bits that come from the CSI driver.
    • Ensures the right binaries and libraries are available (e.g., via mounted volumes) and the process gets the proper preload/agent settings.
    • Respects opt-in/opt-out annotations/labels on namespaces and Pods (e.g. dynatrace.com/inject: "false" to skip a Pod).
    • Can also add Dynatrace metadata enrichment env/labels so the platform sees k8s context (workload, namespace, node, etc.).
  2. Validating Dynatrace CRs (like DynaKube)
    • Schema and consistency checks: catches bad combinations (e.g., missing fields, wrong mode), so you don’t admit a broken config.
    • Helps avoid partial/failed rollouts by rejecting misconfigured specs early.
  3. Hardening/compatibility tweaks
    • With certain features enabled, the mutating webhook helps ensure injected init containers comply with cluster policies (e.g., seccomp, PSA/PSS).
    • That’s why we recommend the annotation you’ve been using:
      feature.dynatrace.com/init-container-seccomp-profile: "true"
      It keeps Gatekeeper/PSA happy when it inspects the injected bits.

Why two dynatrace-webhook pods?

  • High availability for admission traffic. If one goes down, the other still serves the API server’s webhook calls.

How this ties into Gatekeeper/PSA

  • Gatekeeper (OPA) also uses validating admission.
  • The Dynatrace mutating webhook will first shape the Pod (add mounts/env/init).
  • Gatekeeper then validates the final Pod spec.
  • If you’re enforcing “must have seccomp/resources,” ensure Dynatrace’s injected init/sidecar also satisfies those rules (hence that seccomp annotation and resource limits you’ve set).

Dynatrace Active Gate

A Dynatrace ActiveGate acts as a secure proxy between Dynatrace OneAgents and Dynatrace Clusters or between Dynatrace OneAgents and other ActiveGates—those closer to the Dynatrace Cluster.
It establishes Dynatrace presence—in your local network. In this way it allows you to reduce your interaction with Dynatrace to one single point—available locally. Besides convenience, this solution optimizes traffic volume, reduces the complexity of the network and cost. It also ensures the security of sealed networks.

The docs on Active Gate and version compatibility with Dynakube are not yet mature. Ensure the following:

With Dynatrace Operator 1.7 the v1beta1 and v1beta2 API versions for the DynaKube custom resource were removed.
 
ActiveGates up to and including version 1.323 used to call the v1beta1 endpoint. Starting from ActiveGate 1.325, the DynaKube endpoint was changed to v1beta3
Ensure your ActiveGate is up to date with the latest version.

Dynatrace CPU and Memory Requests and Limits

Sources:
https://docs.dynatrace.com/docs/ingest-from/setup-on-k8s/guides/deployment-and-configuration/resource-management/dto-resource-limits


https://community.dynatrace.com/t5/Troubleshooting/Troubleshooting-Kubernetes-CPU-Throttling-Problems-in-Dynatrace/ta-p/250345

As part of our ongoing platform reliability work, we’ve introduced explicit CPU and memory requests/limits for all Dynatrace components running on AKS.

🧩 Why it matters

Previously, the OneAgent and ActiveGate pods relied on Kubernetes’ default scheduling behaviour. This meant:

  • No guaranteed CPU/memory allocation → possible throttling or eviction during cluster load spikes.
  • Risk of noisy-neighbour effects on shared nodes.
  • Unpredictable autoscaling signals and Dynatrace performance fluctuations.

Setting requests and limits gives the scheduler clear boundaries:

  • Requests = guaranteed resources for stable operation
  • Limits = hard ceiling to prevent runaway usage
  • Helps Dynatrace collect telemetry without starving app workloads

⚙️ Updated configuration

OneAgent

oneAgentResources:
  requests:
    cpu: 100m
    memory: 512Mi
  limits:
    cpu: 300m
    memory: 1.5Gi

ActiveGate

resources:
  requests:
    cpu: 200m
    memory: 512Mi
  limits:
    cpu: 500m
    memory: 1Gi

These values were tuned from observed averages across DEV, UAT and PROD clusters. They provide a safe baseline—enough headroom for spikes while keeping node utilisation predictable.

🧠 Key takeaway

Explicit resource boundaries = fewer throttled agents, steadier telemetry, and happier nodes.

Other resources:

installCRD: true

operator:
  resources:
    requests:
      cpu: "50m"
      memory: "64Mi"
    limits:
      cpu: "100m"
      memory: "128Mi"

webhook:
  resources:
    requests:
      cpu: "150m"
      memory: "128Mi"
    limits:
      cpu: "300m"
      memory: "128Mi"

csidriver:
  csiInit:
    resources:
      requests:
        cpu: "50m"
        memory: "100Mi"
      limits:
        cpu: "50m"
        memory: "100Mi"
  server:
    resources:
      requests:
        cpu: "50m"
        memory: "100Mi"
      limits:
        cpu: "100m"
        memory: "100Mi"
  provisioner:
    resources:
      requests:
        cpu: "200m"
        memory: "100Mi"
      limits:
        cpu: "300m"
        memory: "100Mi"
  registrar:
    resources:
      requests:
        cpu: "20m"
        memory: "30Mi"
      limits:
        cpu: "30m"
        memory: "30Mi"
  livenessprobe:
    resources:
      requests:
        cpu: "20m"
        memory: "30Mi"
      limits:
        cpu: "30m"
        memory: "30Mi"

Dynakube

apiVersion: dynatrace.com/v1beta5
kind: DynaKube
metadata:
  name: xxx
  namespace: dynatrace
  labels:
    dynatrace.com/created-by: "dynatrace.kubernetes"
  annotations:
    feature.dynatrace.com/k8s-app-enabled: "true"
    argocd.argoproj.io/sync-wave: "5"
    argocd.argoproj.io/sync-options: SkipDryRunOnMissingResource=true
    feature.dynatrace.com/init-container-seccomp-profile: "true"
# Link to api reference for further information: https://docs.dynatrace.com/docs/ingest-from/setup-on-k8s/reference/dynakube-parameters
spec:
  apiUrl: https://xxx.live.dynatrace.com/api
  metadataEnrichment:
    enabled: true
  oneAgent:
    hostGroup: xxx
    cloudNativeFullStack:
      tolerations:
        - effect: NoSchedule
          key: node-role.kubernetes.io/master
          operator: Exists
        - effect: NoSchedule
          key: node-role.kubernetes.io/control-plane
          operator: Exists
        - effect: "NoSchedule"
          key: "CriticalAddonsOnly"
          operator: "Equal"
          value: "true"
      oneAgentResources:
        requests:
          cpu: 100m
          memory: 512Mi
        limits:
          cpu: 300m
          memory: 1.5Gi
  activeGate:
    capabilities:
      - routing
      - kubernetes-monitoring
      #- debugging
    resources:
      requests:
        cpu: 200m
        memory: 512Mi
      limits:
        cpu: 500m
        memory: 1Gi
  logMonitoring: {}
  telemetryIngest:
    protocols:
      - jaeger
      - otlp
      - statsd
      - zipkin
    serviceName: telemetry-ingest

  templates:
    otelCollector:
      imageRef:
        repository: xxx.azurecr.io/dynatrace/dynatrace-otel-collector
        tag: latest
      resources:
        requests:
          cpu: 150m
          memory: 256Mi
        limits:
          cpu: 500m
          memory: 1Gi

Creating a Cloud Architecture Roadmap

Creating a Cloud Architecture Roadmap

Image result for cloud architecture jpg

Overview

When a product has been proved to be a success and has just come out of a MVP (Minimal Viable Product) or MMP (Minimal Marketable Product) state, usually a lot of corners would have been cut in order to get a product out and act on the valuable feedback. So inevitably there will be technical debt to take care of.

What is important is having a technical vision that will reduce costs and provide value/impact/scaleable/resilient/reliable which can then be communicated to all stakeholders.

A lot of cost savings can be made when scaling out by putting together a Cloud Architecture Roadmap. The roadmap can then be communicate with your stakeholders, development teams and most importantly finance. It will provide a high level “map” of where you are now and where you want to be at some point in the future.

A roadmap is every changing, just like when my wife and I go travelling around the world. We will have a roadmap of where want to go for a year but are open to making changes half way through the trip e.g. An earthquake hits a country we planned to visit etc. The same is true in IT, sometimes budgets are cut or a budget surplus needs to be consumed, such events can affect your roadmap.

It is something that you want to review on a regular schedule. Most importantly you want to communicate the roadmap and get feedback from others.

Feedback from other engineers and stakeholders is crucial – they may spot something that you did not or provide some better alternative solutions.

Decomposition

The first stage is to decompose your ideas. Below is a list that helps get me started in the right direction. This is by no means an exhausted list, it will differ based on your industry.

Component Description Example
Application Run-timeWhere apps are hostedAzure Kubernetes
Persistent StorageNon-Volatile DataFile Store
Block Store
Object Store
CDN
Message
Database
Cache
Backup/RecoveryBackup/Redundant SolutionsManaged Services
Azure OMS
Recovery Vaults
Volume Images
GEO Redundancy
Data/IOTConnected Devices / SensorsStreaming Analytics
Event Hubs
AI/Machine Learning
GatewayHow services are accessedAzure Front Door, NGIX, Application Gateway, WAF, Kubernetes Ingress Controllers
Hybrid ConnectivityOn-Premise Access
Cross Cloud
Express Route
Jumpboxes
VPN
Citrix
Source ControlWhere code lives
Build – CI/CD
Github, Bitbucket
Azure Devops, Octopus Deploy, Jenkins
Certificate ManagementSSL CertificatesAzure Key Vault
SSL Offloading strategies
Secret ManagementStore sensitive configurationPuppet (Hiera), Azure Keyvault, Lastpass, 1Password
Mobile Device ManagementGoogle Play
AppStore
G-Suite Enterprise MDM etc

Once you have an idea of all your components. The next step is to breakdown your road-map into milestones that will ultimately assist in reaching your final/target state. Which of course will not be final in a few years time 😉 or even months!

Sample Roadmap

Below is a link to a google slide presentation that you can use for your roadmap.

https://docs.google.com/presentation/d/1Hvw46vcWJyEW5b7o4Xet7jrrZ17Q0PVzQxJBzzmcn2U/edit?usp=sharing