Enterprise – Code-less composition in Azure using Terraform

When it comes to deploying enterprise environments in Azure, managing complexity effectively is crucial. The Microsoft Cloud Adoption Framework (CAF) for Azure advocates for using multiple state files to configure various landing zones—this helps to balance risk, manage lifecycles, and accommodate diverse team functions more effectively. The traditional challenges associated with managing multiple state files can be mitigated through what we call “code-less composition,” which is an innovative approach facilitated by Terraform.

What is Code-less Composition?

In the realm of Terraform, every state file potentially interacts with others. Traditionally, configuring these interactions required manual scripting, which could be error-prone and tedious. Code-less composition simplifies this by allowing state files’ outputs to be used as input variables for another landing zone without writing any lines of code.

This feature is particularly valuable in complex architectures where you need to manage dependencies and configurations across multiple landing zones automatically. Essentially, it allows for seamless and scalable infrastructure as code practices.

How Does It Work in Azure with Terraform?

Terraform facilitates this through a feature that reads the state file’s output from one landing zone and uses it as input for another. This process is implemented through a simple variable in the Terraform configuration, vastly simplifying the setup of complex configurations. Here’s a look at how you can utilize this in your Azure environment:

Example Configuration for a Management Landing Zone

Consider a management landing zone configured at level 1:

hclCopy codelandingzone = {
  backend_type        = "azurerm"
  level               = "level1"
  key                 = "management"
  global_settings_key = "launchpad"
  tfstates = {
    launchpad = {
      tfstate   = "caf_launchpad.tfstate"
      workspace = "tfstate"
      level     = "lower"
    }
  }
}

In this configuration, tfstates is an object where you specify the Terraform state file to load. For instance, the launchpad object loads the caf_launchpad.tfstate from a workspace (or storage container) called tfstate located one level lower. This setup indicates that any objects within this landing zone can refer to objects deployed in the same or a lower deployment level.

Referencing Resources Across Levels

For deploying resources that depend on configurations from another level, you can reference the necessary elements directly through your configurations:

hclCopy codeautomations = {
  account1 = {
    name = "automationAccount1"
    sku  = "Basic"
    resource_group = {
      key    = "auto-account"
      lz_key = "launchpad"
    }
  }
}

This snippet showcases how to deploy an automation account within a resource group provisioned in a lower level, demonstrating the composability of the framework.

Handling External Objects

When dealing with resources that are not deployed through Terraform or are managed outside of the Azure CAF object model, you can still reference these using their resource names or IDs:

Example with Resource Name

hclCopy codeautomations = {
  account1 = {
    name = "automationAccount1"
    sku  = "Basic"
    resource_group = {
      name    = "caf-auto-account-zooz-001"
    }
  }
}

Example with Resource ID

hclCopy codevirtual_hub_connections = {
  vnet_to_hub = {
    name = "vnet-connectivity-prod-fw-plinks-TO-vhub-prod"
    virtual_hub = {
      lz_key = "connectivity_virtual_hubs_prod"
      key    = "prod"
    }
    vnet = {
      resource_id = "/subscriptions/dklsdfk/etc."
    }
  }
}

Global Settings and Diagnostics

The hierarchy model of Azure CAF allows for global settings and diagnostics settings to be applied across all levels, ensuring consistent application of configurations like supported regions, naming conventions, and tag inheritance.

Conclusion

Code-less composition in Azure using Terraform represents a significant step forward in infrastructure automation. By reducing the need for manual coding, it not only minimizes human error but also speeds up the deployment process, allowing IT teams to focus more on strategic initiatives rather than getting bogged down by configuration complexities. This approach aligns with modern DevOps practices, offering a scalable, repeatable, and efficient method for managing cloud resources.

Azure DevOps – AzureKeyVault@ and Empty Secrets

In cloud architecture and DevOps, managing secrets securely is paramount. Azure Key Vault provides a robust solution by enabling the secure storage of secrets, keys, and certificates. However, integrating Azure Key Vault with Azure DevOps through the AzureKeyVault@ task can present unique challenges, mainly when dealing with empty secrets. This blog post delves into these challenges and provides a practical workaround, which is especially useful when bootstrapping environments with Terraform.

The Challenge with Empty Secrets

When using Terraform, specifically with the Cloud Adoption Framework (CAF) Super Module, to bootstrap an environment, you might encounter a scenario where certain secrets in Azure Key Vault are intended to be empty. This could be by design, especially in dynamic environments where secret values are not immediately available or required. A typical example is the initialization of SSH keys for virtual machine scale sets (VMSS).

Note: It is impossible to have an empty secret in Keyvault if done via the portal, but who uses the Azure Portal nowadays, Flinstone?

However, when using the AzureKeyVault@ task in Azure DevOps pipelines to fetch these secrets, a peculiar behavior is observed: if a secret is empty, the task does not map it to a variable. Instead, the variable’s value defaults to the variable’s name. This behaviour can lead to unexpected results, especially when the presence or content of a secret dictates subsequent pipeline logic.

Understanding the Workaround

To effectively manage this situation, a strategic approach involves testing for valid secret values before proceeding with operations that depend on these secrets. Specifically, we employ pattern matching or regular expressions to verify that the secrets fetched from Azure Key Vault contain expected values.

Below is a simplified explanation of how to implement this workaround in an Azure DevOps pipeline:

Fetch Secrets with AzureKeyVault@ Task: Initially, use the AzureKeyVault@ task to attempt retrieving the desired secrets from Azure Key Vault, specifying the necessary parameters such as azureSubscription and KeyVaultName.
Validate Secret Values in a Bash Task: Following the retrieval, incorporate a Bash task to validate the contents of these secrets. The logic involves checking if the secret values meet predefined patterns. For SSH keys, for instance, public keys typically begin with ssh-rsa, and private keys contain BEGIN OPENSSH PRIVATE KEY.
Handle Empty or Invalid Secrets: If the secrets do not meet the expected patterns—indicative of being empty or invalid—proceed to generate new SSH key pairs and set them as pipeline variables. Furthermore, upload these newly generated keys back to Azure Key Vault for future use.
Success and Error Handling: Proceed with the intended operations upon successful validation or generation of secrets. Ensure that error handling is incorporated to manage failures, mainly when uploading keys to Azure Key Vault.

Code Implementation

Here’s a code snippet illustrating the key parts of this workaround:
Note that you can access pipeline variables in three ways in Bash Scripts

ENV mapping – $var
Direct referencing using $(varname)
vscode task variable – ##vso[task.setvariable variable=varname]$variableFromKeyvault

For the sake of this blog post, I will demonstrate all three approaches.

steps:
  - task: AzureKeyVault@2
    inputs:
      azureSubscription: ${{ parameters.azureSubscription }}
      KeyVaultName: ${{ parameters.keyVaultName }}
      SecretsFilter: 'vmss-img-public-key, vmss-img-private-key'
  - task: Bash@3
    displayName: 'Manage SSH Key'
    inputs:
      targetType: 'inline'
      script: |
        set -e  # Exit immediately if a command exits with a non-zero status.
        set -o pipefail # Makes pipeline return the exit status of the last command in the pipe that failed
        # Check if the keys exist in the Azure Key Vault
        if [[ $VMSS_IMG_PUBLIC_KEY != ssh-rsa* ]] || [[ $VMSS_IMG_PRIVATE_KEY != *"BEGIN OPENSSH PRIVATE KEY"* ]]; then
          # Generate the SSH key pair
          ssh-keygen -t rsa -b 2048 -f "$(Build.SourcesDirectory)/sshkey" -q -N ""
          echo "SSH key pair generated."

          # Read public key and set it as a pipeline variable
          VMSS_IMG_PUBLIC_KEY=$(cat "$(Build.SourcesDirectory)/sshkey.pub")
          VMSS_IMG_PRIVATE_KEY=$(cat "$(Build.SourcesDirectory)/sshkey")
          echo "##vso[task.setvariable variable=vmss-img-public-key]$VMSS_IMG_PUBLIC_KEY"
          echo "##vso[task.setvariable variable=vmss-img-private-key]$VMSS_IMG_PRIVATE_KEY"

          # Upload the public key to Azure Key Vault
          az keyvault secret set --name vmss-img-public-key --vault-name "$KEYVAULT_NAME" --file "$(Build.SourcesDirectory)/sshkey.pub" || {
            echo "Failed to upload the public key to Azure Key Vault."
            exit 1
          }

          # Upload the private key to Azure Key Vault
          az keyvault secret set --name vmss-img-private-key --vault-name "$KEYVAULT_NAME" --file "$(Build.SourcesDirectory)/sshkey" || {
            echo "Failed to upload the private key to Azure Key Vault."
            exit 1
          }
        else
          echo "Skipping SSH Key generation, keys already present in Key Vault: $KEYVAULT_NAME"
          echo "Public Key in Keyvault $KEYVAULT_NAME is: $(vmss-img-public-key)"
        fi
    env:
      KEYVAULT_NAME: ${{ parameters.keyVaultName }}
      VMSS_IMG_PUBLIC_KEY: $(vmss-img-public-key)
      VMSS_IMG_PRIVATE_KEY: $(vmss-img-private-key)

The above script can be simplified and use better regular expressions and does not require a lot of verbose output, this is here to demonstrate different ways to access the variables vmss-img-public-key and vmss-img-private-key.

For the Bash guru’s out there, you might say, why not check for null or empty:

if [[ -z $VMSS_IMG_PUBLIC_KEY ]] || [[ -z $VMSS_IMG_PRIVATE_KEY ]]

The above will not work for variables originating from a Keyvault task where the secret is an empty string. The variable value will be the variable name and this is not a nice way to check if its empty.

There you have it. If you ever find your key vault tasks variables not being mapped to ENV automatically or accessible directly, e.g., $(vmss-img-public-key), it could be that the secret is null or empty, which can occur when using Terraform or the https://github.com/aztfmod/terraform-azurerm-caf/blob/main/dynamic_secrets.tf module.

# When called from the CAF module it can only be used to set secret values
# For that reason, object must not be set.
# This is only used here for examples to run
# the normal recommendation for dynamic keyvault secrets is to call it from a landingzone
module "dynamic_keyvault_secrets" {
  source     = "./modules/security/dynamic_keyvault_secrets"
  depends_on = [module.keyvaults]
  for_each = {
    for keyvault_key, secrets in try(var.security.dynamic_keyvault_secrets, {}) : keyvault_key => {
      for key, value in secrets : key => value
      if try(value.value, null) != null && try(value.value, null) != ""
    }
  }

  settings = each.value
  keyvault = local.combined_objects_keyvaults[local.client_config.landingzone_key][each.key]
}

output "dynamic_keyvault_secrets" {
  value = module.dynamic_keyvault_secrets
}

Why not just deploy VMSS via Terraform and have this all in the logic, you ask? Well, that’s like expecting your pet cat to fetch your slippers – it’s just not possible! VMSS and Terraform are not supported if the Orchestration Mode is Uniform ( –orchestration-mode Uniform), so we have to make do with combining the worlds of AZ CLI and Terraform to dance together like an awkward couple. Think of it as a robot tango, with lots of beeps and boops!

Optimizing Terraform – CICD Pipelines Rover

Optimizing Terraform’s performance, especially for plan and apply operations can involve several strategies. Here are some tips to help speed up these commands:

Parallelism Adjustment: Terraform performs operations concurrently. You can adjust the number of concurrent operations with the -parallelism flag. However, increasing this number can lead to higher memory and CPU usage. Find a balance that suits your machine or CI/CD runner specifications.
Targeted Terraform Runs: If you know exactly which resources need updating, you can use the -target option to run Terraform on specific resources. This reduces the time spent planning and applying by focusing on a subset of your resources.
Incremental Changes: Apply small, incremental changes to your infrastructure rather than large updates. Smaller changes will be quicker to plan and apply.
Module Optimization: Break down your configurations into smaller, reusable modules. This modular approach helps Terraform to process less at any given time.
State Management: Store the Terraform state in a remote backend that supports state locking and consistent reads, such as Azure Blob Storage with state locking enabled. For large infrastructures, consider breaking your configuration into smaller, independent state files to reduce read/write times.
Resource Deferment: Some resources may be inherently slow to create or update due to the nature of the service provider. If possible, manage these resources separately and apply them in different runs.
Minimize Dependencies: Avoid creating unnecessary dependencies between resources. Terraform can’t parallelize dependent resources, so the fewer interdependencies, the more it can do in parallel.
Use Provider Features: For providers like AWS, use features like depends_on to create explicit dependencies to help Terraform better plan parallelism.
Optimize Resource Usage: Check your resources’ performance on the CI/CD runner or environment where Terraform runs. Upgrading the machine or allocating more CPU/memory might be necessary if you’re consistently seeing exit codes like 137.
Refactor and Review Configurations: Over time, configurations can become inefficient or bloated. Regularly review and refactor Terraform code to simplify and remove unnecessary complexity.
Leverage Data Sources: Prefer data sources over resources for read-only operations where possible, as they can be quicker to evaluate.
Use Terraform Cloud: If you’re using open-source Terraform, consider using Terraform Cloud or Terraform Enterprise for more robust state management and operations.
Caching: Some CI/CD systems support caching between runs. If you’re running Terraform in a CI/CD pipeline, make sure to cache the .terraform directory to avoid re-downloading plugins and modules.
Avoid Unnecessary Outputs: Excessive use of outputs, especially when they contain large amounts of data, can slow down Terraform’s performance. Keep outputs to the minimum necessary.
Profile Apply Time: Use TF_LOG=TRACE for a one-off apply to see where time is being spent. Be aware this will generate a lot of logs but can be useful to spot any bottlenecks.

Lastly, upgrade your DevOps agent CPU and Memory. I ran into Terraform exit code 137 and upgraded the CPU and Memory, which helped the DevOps Agents a lot.

and Now...

CAF Rover Terraform Pipeline Resuse

parameters:
- name: job
  type: string
  default: ''
- name: displayName
  type: string
  default: ''
- name: environment
  type: string
  values:
  - Production
  - Test
  default: Test
- name: launchpad
  type: boolean
  default: false
- name: agentPool
  type: string
- name: agentClientId
  type: string
- name: landingZoneDir
  type: string
- name: configurationDir
  type: string
- name: level
  type: number
  values:
  - 0
  - 1
  - 2
  - 3
  - 4
- name: tfstate
  type: string
- name: action
  type: string
  values:
    - plan
    - apply
  default: plan
- name: dependsOn
  type: string
  default: ''
- name: tfstateSubscriptionId
  type: string
  default: '' 
- name: targetSubscription
  type: string
  default: ''
- name: token
  type: string
  default: ''  
  

jobs:
- ${{ if ne(parameters.action, 'plan') }}:
  - job: "${{ parameters.job }}waitForValidation"
    ${{ if not(eq(parameters.dependsOn, '')) }}:
      dependsOn: ${{ parameters.dependsOn }}
      condition: and(not(failed()), not(canceled()))
    displayName: "Wait for manual approval"
    pool: "server"
    timeoutInMinutes: "4320" # job times out in 3 days
    steps:
      - task: ManualValidation@0
        timeoutInMinutes: "1440" # task times out in 1 day
        inputs:
          notifyUsers: ''               
          instructions: "Confirm ${{ parameters.job }}"
          onTimeout: "reject"

- job: ${{ parameters.job }}
  variables:
    ${{ if eq(parameters.launchpad, true) }}:
      launchpad_opt: "-launchpad"
      level_opt: '-level level0'  
    ${{ if not(eq(parameters.launchpad, true)) }}:
      launchpad_opt: ''
      level_opt: "-level level${{ parameters.level }}"
    ${{ if not(eq(parameters.tfstateSubscriptionId, '')) }}:
      tfstate_opt: "-tfstate_subscription_id ${{ parameters.tfstateSubscriptionId }}"
    ${{ if not(eq(parameters.targetSubscription, '')) }}:
      target_opt: "-target_subscription ${{ parameters.targetSubscription }}"
    ${{ if not(eq(parameters.token, '')) }}:
      set_token: "export TF_VAR_token=${{ parameters.token }}"
  
  pool: ${{ parameters.agentPool }}
  displayName: ${{ parameters.displayName }}
  ${{ if eq(parameters.action, 'plan') }}:
    dependsOn: ${{ parameters.dependsOn }}
    condition: and(not(failed()), not(canceled()))
  ${{ if ne(parameters.action, 'plan') }}:
    dependsOn: "${{ parameters.job }}waitForValidation"
    condition: and(not(failed()), not(canceled()))

  
  steps:
  - checkout: self
    path: s/tf
  - checkout: caf-terraform-landingzones
    path: s/tf/landingzones
  - checkout: terraform-azurerm-caf
  - bash: | 
        git config --global http.https://edg-technology.visualstudio.com.extraheader "AUTHORIZATION: bearer $(System.AccessToken)"
  - bash: |
        ${{ variables.set_token }}
        
        az login --identity -u ${{parameters.agentClientId}} -o none

        /tf/rover/rover.sh \
        -lz $(Pipeline.Workspace)/s/tf/landingzones/${{ parameters.landingZoneDir }} \
        ${{ variables.tfstate_opt }} \
        ${{ variables.target_opt }} \
        ${{ variables.launchpad_opt }} \
        -var-folder $(Pipeline.Workspace)/s/tf/${{ parameters.configurationDir }} \
        -parallelism 20 \
        -tfstate ${{ parameters.tfstate }} \
        -env ${{ parameters.environment }} \
        ${{ variables.level_opt }} \
        -a ${{ parameters.action }}

        retVal=$?
        if [ $retVal -eq 137 ]; then
          echo "The process was killed, possibly due to a CPU or memory issue."
          exit $retVal
        elif [ $retVal -ne 1  ]; then
          exit 0
        fi
    failOnStderr: true
    displayName: ${{ parameters.displayName }}

DevOps Level 3: Automate Your Azure Policy with CAF Enterprise Scale Rover

In the ever-evolving landscape of cloud computing, the need for automation and governance at scale has never been more critical. Microsoft Azure, a leading cloud service provider, offers many features to manage and secure cloud resources effectively. However, the real game-changer in this domain is the Cloud Adoption Framework (CAF) Enterprise Scale Rover, a tool designed to supercharge your Azure governance strategy. This blog post will delve into automating the deployment of Azure Policy Definitions, Policy Sets (Initiatives), and Policy Assignments using CAF Enterprise Scale Rover, ensuring your Azure environment remains compliant, secure, and optimized.

Introduction to Azure Policies and CAF Rover

Azure Policies play a pivotal role in the governance framework of Azure environments. They enable organizations to define, assign, and manage policies that enforce rules over their resources, ensuring compliance with company standards and regulatory requirements. While Azure Policies are powerful, managing them across a large-scale environment can be daunting.

Enter CAF Enterprise Scale Rover, an innovative solution that streamlines the deployment and management of Azure Policies. It is designed to automate the process, making it easier, faster, and more efficient. By leveraging the CAF Rover, IT professionals can focus on strategic tasks, leaving the heavy lifting to the automation processes.

Setting Up Your Environment for CAF Rover

Before diving into the automation process, it’s essential to set up your environment to run the CAF Rover. This setup involves ensuring your development environment is ready, installing necessary tools like Docker, Terraform, Git, and configuring VSCode with specific extensions for Azure Policy and Docker support. Detailed guidance on setting up your environment can be found in the provided recommended reading, highlighting the importance of a properly configured dev environment for a seamless automation experience.

CAF Level Structure and eslz module for Policies

You must plan out your Policy Definitions, then group them into Initiatives, and then assign initiatives to scopes (Management Groups or Subscriptions).

Automating Policy Definitions Deployment

The journey begins with automating Policy Definitions, the cornerstone of Azure Policy management. CAF Rover simplifies this process by leveraging a structured JSON format for defining policies, focusing on key areas such as allowed regions, naming conventions, and resource compliance checks. The process entails writing your Policy Definition in JSON, committing it to your Git repository, and deploying it to your Azure environment via CAF Rover commands. This approach ensures that all your cloud resources adhere to defined governance standards from the get-go.

Sample Policy Definition

{
  "name": "Append-AppService-httpsonly",
  "type": "Microsoft.Authorization/policyDefinitions",
  "apiVersion": "2021-06-01",
  "scope": null,
  "properties": {
    "policyType": "Custom",
    "mode": "All",
    "displayName": "AppService append enable https only setting to enforce https setting.",
    "description": "Appends the AppService sites object to ensure that  HTTPS only is enabled for  server/service authentication and protects data in transit from network layer eavesdropping attacks. Please note Append does not enforce compliance use then deny.",
    "metadata": {
      "version": "1.0.0",
      "category": "App Service",
      "source": "https://github.com/Azure/Enterprise-Scale/",
      "alzCloudEnvironments": [
        "AzureCloud",
        "AzureChinaCloud",
        "AzureUSGovernment"
      ]
    },
    "parameters": {
      "effect": {
        "type": "String",
        "defaultValue": "Append",
        "allowedValues": [
          "Audit",
          "Append",
          "Disabled"
        ],
        "metadata": {
          "displayName": "Effect",
          "description": "Enable or disable the execution of the policy"
        }
      }
    },
    "policyRule": {
      "if": {
        "allOf": [
          {
            "field": "type",
            "equals": "Microsoft.Web/sites"
          },
          {
            "field": "Microsoft.Web/sites/httpsOnly",
            "notequals": true
          }
        ]
      },
      "then": {
        "effect": "[parameters('effect')]",
        "details": [
          {
            "field": "Microsoft.Web/sites/httpsOnly",
            "value": true
          }
        ]
      }
    }
  }
}

Streamlining Policy Sets (Initiatives) Deployment

Next, we focus on Policy Sets, also known as Initiatives, which group multiple Policy Definitions for cohesive management. The CAF Rover enhances the deployment of Policy Sets by automating their creation and assignment. By grouping related policies, you can ensure comprehensive coverage of governance requirements, such as naming conventions and compliance checks, across your Azure resources. The automation process involves defining your Policy Sets in JSON format, committing them to your repository, and deploying them through CAF Rover, streamlining the governance of your cloud environment.

Sample Policy Set (Initiative)

{
  "name": "Audit-UnusedResourcesCostOptimization",
  "type": "Microsoft.Authorization/policySetDefinitions",
  "apiVersion": "2021-06-01",
  "scope": null,
  "properties": {
    "policyType": "Custom",
    "displayName": "Unused resources driving cost should be avoided",
    "description": "Optimize cost by detecting unused but chargeable resources. Leverage this Azure Policy Initiative as a cost control tool to reveal orphaned resources that are contributing cost.",
    "metadata": {
      "version": "2.0.0",
      "category": "Cost Optimization",
      "source": "https://github.com/Azure/Enterprise-Scale/",
      "alzCloudEnvironments": [
        "AzureCloud",
        "AzureChinaCloud",
        "AzureUSGovernment"
      ]
    },
    "parameters": {
      "effectDisks": {
        "type": "String",
        "metadata": {
          "displayName": "Disks Effect",
          "description": "Enable or disable the execution of the policy for Microsoft.Compute/disks"
        },
        "allowedValues": [
          "Audit",
          "Disabled"
        ],
        "defaultValue": "Audit"
      },
      "effectPublicIpAddresses": {
        "type": "String",
        "metadata": {
          "displayName": "PublicIpAddresses Effect",
          "description": "Enable or disable the execution of the policy for Microsoft.Network/publicIpAddresses"
        },
        "allowedValues": [
          "Audit",
          "Disabled"
        ],
        "defaultValue": "Audit"
      },
      "effectServerFarms": {
        "type": "String",
        "metadata": {
          "displayName": "ServerFarms Effect",
          "description": "Enable or disable the execution of the policy for Microsoft.Web/serverfarms"
        },
        "allowedValues": [
          "Audit",
          "Disabled"
        ],
        "defaultValue": "Audit"
      }
    },
    "policyDefinitions": [
      {
        "policyDefinitionReferenceId": "AuditDisksUnusedResourcesCostOptimization",
        "policyDefinitionId": "${current_scope_resource_id}/providers/Microsoft.Authorization/policyDefinitions/Audit-Disks-UnusedResourcesCostOptimization",
        "parameters": {
          "effect": {
            "value": "[parameters('effectDisks')]"
          }
        },
        "groupNames": []
      },
      {
        "policyDefinitionReferenceId": "AuditPublicIpAddressesUnusedResourcesCostOptimization",
        "policyDefinitionId": "${current_scope_resource_id}/providers/Microsoft.Authorization/policyDefinitions/Audit-PublicIpAddresses-UnusedResourcesCostOptimization",
        "parameters": {
          "effect": {
            "value": "[parameters('effectPublicIpAddresses')]"
          }
        },
        "groupNames": []
      },
      {
        "policyDefinitionReferenceId": "AuditServerFarmsUnusedResourcesCostOptimization",
        "policyDefinitionId": "${current_scope_resource_id}/providers/Microsoft.Authorization/policyDefinitions/Audit-ServerFarms-UnusedResourcesCostOptimization",
        "parameters": {
          "effect": {
            "value": "[parameters('effectServerFarms')]"
          }
        },
        "groupNames": []
      },
      {
        "policyDefinitionReferenceId": "AuditAzureHybridBenefitUnusedResourcesCostOptimization",
        "policyDefinitionId": "${current_scope_resource_id}/providers/Microsoft.Authorization/policyDefinitions/Audit-AzureHybridBenefit",
        "parameters": {
          "effect": {
            "value": "Audit"
          }
        },
        "groupNames": []
      }
    ],
    "policyDefinitionGroups": null
  }
}

Automating Policy Assignments

The final piece of the automation puzzle is Policy Assignments. This step activates the policies, applying them to your Azure resources. CAF Rover facilitates the automation of both custom and built-in Policy Assignments, ensuring your resources are governed according to the defined policies. Whether you are assigning custom initiatives or leveraging Azure’s built-in policies for zone resilience, the process is simplified through automation, allowing for efficient and effective governance at scale.

Sample Policy Assignment

{
  "type": "Microsoft.Authorization/policyAssignments",
  "apiVersion": "2022-06-01",
  "name": "as_baseline_security",
  "dependsOn": [],
  "properties": {
    "description": "This assignment includes EDG baseline security policies.",
    "displayName": "Custom baseline security",
    "policyDefinitionId": "${current_scope_resource_id}/providers/Microsoft.Authorization/policySetDefinitions/custom_baseline_security",
    "enforcementMode": null,
    "metadata": {
    },
    "nonComplianceMessages": [
      {
        "policyDefinitionReferenceId": "custom_audit_function_app_require_msi_tf_1",
        "message": "FUNC-001 - Use Azure-managed identity to securely authenticate to other cloud services/resources" 
      },
      {
        "policyDefinitionReferenceId": "custom_deny_function_app_remotedebugging_tf_1",
        "message": "FUNC-014 - Turn off Remote debugging on your Function apps" 
      },
      {
        "policyDefinitionReferenceId": "custom_deny_mismatched_res_resgroup_locations_tf_1",
        "message": "AZ-001 - Resource has been deployed in a different location from the resource group containing it" 
      },
      {
        "policyDefinitionReferenceId": "custom_deny_non_allowed_resource_locations_tf_1",
        "message": "AZ-002 - Resource has been deployed in an unauthorised location" 
      },
      {
        "policyDefinitionReferenceId": "custom_deny_storage_acc_accessible_over_http_tf_1",
        "message": "ST-013 - Enforce data encryption in transit by enabling HTTPS only" 
      },
      {
        "policyDefinitionReferenceId": "custom_deny_storage_acc_disable_public_network_tf_1",
        "message": "ST-001 - Disable public network access" 
      },
      {
        "policyDefinitionReferenceId": "custom_deploy_function_app_accessible_over_http_tf_1",
        "message": "FUNC-003 - Enforce data encryption in transit by enabling HTTPS only" 
      },
      {
        "policyDefinitionReferenceId": "custom_deploy_function_app_require_ftps_only_tf_1",
        "message": "FUNC-009 - Disable FTP based deployment or configure to accept FTPS only" 
      },
      {
        "policyDefinitionReferenceId": "custom_deploy_function_app_require_tls12_tf_1",
        "message": "FUNC-004 - Enforce minimum TLS version to 1.2" 
      }
    ],
    "parameters": {
    },
    "scope": "${current_scope_resource_id}",
    "notScopes": []
  },
  "location": "${default_location}",
  "identity": {
    "type": "SystemAssigned"
  }
}

Archetypes

Archetypes are used in the Azure landing zone conceptual architecture to describe the Landing Zone configuration using a template-driven approach. The archetype is what fundamentally transforms Management Groups and Subscriptions into Landing Zones.

An archetype defines which Azure Policy and Access control (IAM) settings are needed to secure and configure the Landing Zones with everything needed for safe handover to the Landing Zone owner. This covers critical platform controls and configuration items, such as:

Consistent role-based access control (RBAC) settings
Guardrails for security settings
Guardrails for common workload configurations (e.g. SAP, AKS, WVD, etc.)
Automate provisioning of critical platform resources such as monitoring and networking solutions in each Landing Zone

This approach provides improved autonomy for application teams, whilst ensuring security policies and standards are enforced.

Why using CAF rover?

This tool greatly simplifies secure state management on Azure storage accounts. Additionally, it helps with testing different versions of binaries such as new versions of Terraform, Azure CLI, jq, tflint, etc. This tool also provides a ubiquitous development environment, which means everyone works with the same versions of the DevOps toolchain, always up-to-date, and runs on laptops, pipelines, GitHub Codespaces, and other platforms. It also facilitates the identity transition to any CI/CD, as all CI/CD have container capabilities. This tool allows for easy transition from one DevOps environment to another, including GitHub Actions, Azure DevOps, Jenkins, CircleCI, etc. Lastly, it’s an open-source tool and leverages open-source projects that are often needed with Terraform.

Helps testing different versions of binaries (new version of Terraform, Azure CLI, jq, tflint etc.)
Ubiquitous development environment: everyone works with the same versions of the DevOps toolchain, always up-to-date, running on laptop, pipelines, GitHub Codespaces, etc.
Facilitates the identity transition to any CI/CD: namely all CI/CD have container capabilities.
Allows easy transition from one DevOps environment to another (GitHub Actions, Azure DevOps, Jenkins, CircleCI etc.)
It’s open-source and leveraging open-source projects that you often need with Terraform.

Integrating with Azure DevOps Pipelines

A critical aspect of automating Azure Policy deployment using CAF Enterprise Scale Rover is its seamless integration with Azure DevOps pipelines. This integration enables organizations to adopt a DevOps approach to cloud governance, where policy changes are version-controlled, reviewed, and deployed through automated CI/CD pipelines. By incorporating CAF Rover into Azure DevOps pipelines, you can ensure that policy deployments are consistent, repeatable, and auditable across different environments. This process not only enhances governance and compliance but also aligns with best practices for Infrastructure as Code (IaC), facilitating a collaborative and efficient workflow among development, operations, and security teams. Leveraging Azure DevOps pipelines with CAF Rover automation empowers organizations to maintain a high governance standard while embracing the agility and speed that cloud environments offer.

Conclusion

Automating the deployment of Azure Policy Definitions, Policy Sets (Initiatives), and Policy Assignments using CAF Enterprise Scale Rover represents a significant leap forward in cloud governance. This approach not only saves time and reduces the potential for human error but also ensures a consistent and compliant Azure environment. By embracing automation with CAF Rover, organizations can achieve a robust governance framework that scales with their Azure deployments, securing their cloud journey’s success.

For those keen to automate their Azure Policies, diving into the CAF Rover’s capabilities is a must. The combination of detailed documentation, structured JSON for policy definitions, and automated deployment processes provides a clear path to efficient and effective Azure governance. Embrace the power of automation with CAF Enterprise Scale Rover and take your Azure governance to the next level.

Demystifying .NET Core Memory Leaks: A Debugging Adventure with dotnet-dump

It has been a while since I wrote about memory dump analysis; the last post on the subject was back in 2011. Lets get stuck into the dark arts.

First and foremost, .NET is very different to .NET Core, down to App Domains and how the MSIL is executed. Understanding this is crucial before you kick off a clrstack! or dumpdomain! Make sure you understand the architecture of what you debugging from ASP to console apps. Dumpdomain caught me off guard, as you would use dumpdomain to get the sourcecode and decompile via the PDB files in the past.

Feature	ASP.NET Core	ASP.NET Framework
Cross-Platform Support	Runs on Windows, Linux, and macOS.	Primarily runs on Windows.
Hosting	Can be hosted on Kestrel, IIS, HTTP.sys, Nginx, Apache, and Docker.	Typically hosted on IIS.
Performance	Optimized for high performance and scalability.	Good performance, but generally not as optimized as ASP.NET Core.
Application Model	Unified model for MVC and Web API.	Separate models for MVC and Web API.
Configuration	Uses a lightweight, file-based configuration system (appsettings.json).	Uses web.config for configuration.
Dependency Injection	Built-in support for dependency injection.	Requires third-party libraries for dependency injection.
App Domains	Uses a single app model and does not support app domains.	Supports app domains for isolation between applications.
Runtime Compilation	Supports runtime compilation of Razor views (optional).	Supports runtime compilation of ASPX pages.
Modular HTTP Pipeline	Highly modular and configurable HTTP request pipeline.	Fixed HTTP request pipeline defined by the Global.asax and web.config.
Package Management	Uses NuGet for package management, with an emphasis on minimal dependencies.	Also uses NuGet but tends to have more complex dependency trees.
Framework Versions	Applications target a specific version of .NET Core, which is bundled with the app.	Applications target a version of the .NET Framework installed on the server.
Update Frequency	Rapid release cycle with frequent updates and new features.	Slower release cycle, tied to Windows updates.
Side-by-Side Deployment	Supports running multiple versions of the app or .NET Core side-by-side.	Does not support running multiple versions of the framework side-by-side for the same application.
Open Source	Entire platform is open-source.	Only a portion of the platform is open-source.

So we embark on a quest to uncover hidden memory leaks that lurk within the depths of .NET Core apps, armed with the mighty dotnet-dump utility. This tale of debugging prowess will guide you through collecting and analyzing dump files, uncovering the secrets of memory leaks, and ultimately conquering these elusive beasts.

Preparing for the Hunt: Installing dotnet-dump

Our journey begins with the acquisition of the dotnet-dump tool, a valiant ally in our quest. This tool is a part of the .NET diagnostics toolkit, designed to collect and analyze dumps without requiring native debuggers. It’s a lifesaver on platforms like Alpine Linux, where traditional tools shy away.

To invite dotnet-dump into your arsenal, you have two paths:

The Global Tool Approach: Unleash the command dotnet tool install --global dotnet-dump into your terminal and watch as the latest version of the dotnet-dump NuGet package is summoned.
The Direct Download: Navigate to the mystical lands of the .NET website and download the tool executable that matches your platform’s essence.

The First Step: Collecting the Memory Dump

With dotnet-dump by your side, it’s time to collect a memory dump from the process that has been bewitched by the memory leak. Invoke dotnet-dump collect --process-id <PID>, where <PID> is the identifier of the cursed process. This incantation captures the essence of the process’s memory, storing it in a file for later analysis.

The Analytical Ritual: Unveiling the Mysteries of the Dump

Now, the real magic begins. Use dotnet-dump analyze <dump_path> to enter an interactive realm where the secrets of the dump file are yours to discover. This enchanted shell accepts various SOS commands, granting you the power to scrutinize the managed heap, reveal the relationships between objects, and formulate theories about the source of the memory leak.

Common Spells and Incantations:

clrstack: Summons a stack trace of managed code, revealing the paths through which the code ventured.
dumpheap -stat: Unveils the statistics of the objects residing in the managed heap, highlighting the most common culprits.
gcroot <address>: Traces the lineage of an object back to its roots, uncovering why it remains in memory.

The Final Confrontation: Identifying the Memory Leak

Armed with knowledge and insight from the dotnet-dump analysis, you’re now ready to face the memory leak head-on. By examining the relationships between objects and understanding their roots, you can pinpoint the source of the leak in your code.

Remember, the key to vanquishing memory leaks is patience and perseverance. With dotnet-dump as your guide, you’re well-equipped to navigate the complexities of .NET Core memory management and emerge victorious.

Examine managed memory usage

Before you start collecting diagnostic data to help root cause this scenario, make sure you’re actually seeing a memory leak (growth in memory usage). You can use the dotnet-counters tool to confirm that.

Open a console window and navigate to the directory where you downloaded and unzipped the sample debug target. Run the target:

dotnet run

From a separate console, find the process ID:

dotnet-counters ps

The output should be similar to:

4807 DiagnosticScena /home/user/git/samples/core/diagnostics/DiagnosticScenarios/bin/Debug/netcoreapp3.0/DiagnosticScenarios

Now, check managed memory usage with the dotnet-counters tool. The --refresh-interval specifies the number of seconds between refreshes:

dotnet-counters monitor --refresh-interval 1 -p 4807

The live output should be similar to:

Press p to pause, r to resume, q to quit.
    Status: Running

[System.Runtime]
    # of Assemblies Loaded                           118
    % Time in GC (since last GC)                       0
    Allocation Rate (Bytes / sec)                 37,896
    CPU Usage (%)                                      0
    Exceptions / sec                                   0
    GC Heap Size (MB)                                  4
    Gen 0 GC / sec                                     0
    Gen 0 Size (B)                                     0
    Gen 1 GC / sec                                     0
    Gen 1 Size (B)                                     0
    Gen 2 GC / sec                                     0
    Gen 2 Size (B)                                     0
    LOH Size (B)                                       0
    Monitor Lock Contention Count / sec                0
    Number of Active Timers                            1
    ThreadPool Completed Work Items / sec             10
    ThreadPool Queue Length                            0
    ThreadPool Threads Count                           1
    Working Set (MB)                                  83

Focusing on this line:

    GC Heap Size (MB)                                  4

You can see that the managed heap memory is 4 MB right after startup.

Now, go to the URL https://localhost:5001/api/diagscenario/memleak/20000.

Observe that the memory usage has grown to 30 MB.

GC Heap Size (MB)                                 30


By watching the memory usage, you can safely say that memory is growing or leaking. The next step is to collect the right data for memory analysis.

Generate memory dump

When analyzing possible memory leaks, you need access to the app’s memory heap to analyze the memory contents. Looking at relationships between objects, you create theories as to why memory isn’t being freed. A common diagnostic data source is a memory dump on Windows or the equivalent core dump on Linux. To generate a dump of a .NET application, you can use the dotnet-dump tool.

Using the sample debug target previously started, run the following command to generate a Linux core dump:

dotnet-dump collect -p 4807

The result is a core dump located in the same folder.

Writing minidump with heap to ./core_20190430_185145
Complete

For a comparison over time, let the original process continue running after collecting the first dump and collect a second dump the same way. You would then have two dumps over a period of time that you can compare to see where the memory usage is growing.

Restart the failed process

Once the dump is collected, you should have sufficient information to diagnose the failed process. If the failed process is running on a production server, now it’s the ideal time for short-term remediation by restarting the process.

In this tutorial, you’re now done with the Sample debug target and you can close it. Navigate to the terminal that started the server, and press Ctrl+C.

Analyze the core dump

Now that you have a core dump generated, use the dotnet-dump tool to analyze the dump:

dotnet-dump analyze core_20190430_185145

Where core_20190430_185145 is the name of the core dump you want to analyze.

If you see an error complaining that libdl.so cannot be found, you may have to install the libc6-dev package. For more information, see Prerequisites for .NET on Linux.

You’ll be presented with a prompt where you can enter SOS commands. Commonly, the first thing you want to look at is the overall state of the managed heap:

> dumpheap -stat

Statistics:
              MT    Count    TotalSize Class Name
...
00007f6c1eeefba8      576        59904 System.Reflection.RuntimeMethodInfo
00007f6c1dc021c8     1749        95696 System.SByte[]
00000000008c9db0     3847       116080      Free
00007f6c1e784a18      175       128640 System.Char[]
00007f6c1dbf5510      217       133504 System.Object[]
00007f6c1dc014c0      467       416464 System.Byte[]
00007f6c21625038        6      4063376 testwebapi.Controllers.Customer[]
00007f6c20a67498   200000      4800000 testwebapi.Controllers.Customer
00007f6c1dc00f90   206770     19494060 System.String
Total 428516 objects

Here you can see that most objects are either String or Customer objects.

You can use the dumpheap command again with the method table (MT) to get a list of all the String instances:

> dumpheap -mt 00007f6c1dc00f90

         Address               MT     Size
...
00007f6ad09421f8 00007faddaa50f90       94
...
00007f6ad0965b20 00007f6c1dc00f90       80
00007f6ad0965c10 00007f6c1dc00f90       80
00007f6ad0965d00 00007f6c1dc00f90       80
00007f6ad0965df0 00007f6c1dc00f90       80
00007f6ad0965ee0 00007f6c1dc00f90       80

Statistics:
              MT    Count    TotalSize Class Name
00007f6c1dc00f90   206770     19494060 System.String
Total 206770 objects

You can now use the gcroot command on a System.String instance to see how and why the object is rooted:

> gcroot 00007f6ad09421f8

Thread 3f68:
    00007F6795BB58A0 00007F6C1D7D0745 System.Diagnostics.Tracing.CounterGroup.PollForValues() [/_/src/System.Private.CoreLib/shared/System/Diagnostics/Tracing/CounterGroup.cs @ 260]
        rbx:  (interior)
            ->  00007F6BDFFFF038 System.Object[]
            ->  00007F69D0033570 testwebapi.Controllers.Processor
            ->  00007F69D0033588 testwebapi.Controllers.CustomerCache
            ->  00007F69D00335A0 System.Collections.Generic.List`1[[testwebapi.Controllers.Customer, DiagnosticScenarios]]
            ->  00007F6C000148A0 testwebapi.Controllers.Customer[]
            ->  00007F6AD0942258 testwebapi.Controllers.Customer
            ->  00007F6AD09421F8 System.String

HandleTable:
    00007F6C98BB15F8 (pinned handle)
    -> 00007F6BDFFFF038 System.Object[]
    -> 00007F69D0033570 testwebapi.Controllers.Processor
    -> 00007F69D0033588 testwebapi.Controllers.CustomerCache
    -> 00007F69D00335A0 System.Collections.Generic.List`1[[testwebapi.Controllers.Customer, DiagnosticScenarios]]
    -> 00007F6C000148A0 testwebapi.Controllers.Customer[]
    -> 00007F6AD0942258 testwebapi.Controllers.Customer
    -> 00007F6AD09421F8 System.String

Found 2 roots.

You can see that the String is directly held by the Customer object and indirectly held by a CustomerCache object.

You can continue dumping out objects to see that most String objects follow a similar pattern. At this point, the investigation provided sufficient information to identify the root cause in your code.

This general procedure allows you to identify the source of major memory leaks.

Epilogue: Cleaning Up After the Battle

With the memory leak defeated and peace restored to your application, take a moment to clean up the battlefield. Dispose of the dump files that served you well, and consider restarting your application to ensure it runs free of the burdens of the past.

Embark on this journey with confidence, for with dotnet-dump and the wisdom contained within this guide, you are more than capable of uncovering and addressing the memory leaks that challenge the stability and performance of your .NET Core applications. Happy debugging!

Sources:
https://learn.microsoft.com/en-us/dotnet/core/diagnostics/debug-memory-leak

Securing Docker Containers with a Risk-Based Approach

Embracing Pragmatism in Container Security

In the world of container orchestration, securing thousands of Docker containers is no small feat. But with a pragmatic approach and a keen understanding of risk assessment, it’s possible to create a secure environment that keeps pace with the rapid deployment of services.

The Risk Matrix: A Tool for Prioritization

At the heart of our security strategy is a Risk Matrix, a critical tool that helps us assess and prioritize vulnerabilities. The matrix classifies potential security threats based on the severity of their consequences and the likelihood of their occurrence. By focusing on Critical and High Common Vulnerabilities and Exposures (CVEs), we use this matrix to identify which issues in our Kubernetes clusters need immediate attention.

Risk Matrix – Use the following to deduce an action/outcome.

Likelihood: The Critical Dimension for SMART Security

To ensure our security measures are Specific, Measurable, Achievable, Relevant, and Time-bound (SMART), we add another dimension to the matrix: Likelihood. This dimension helps us pinpoint high-risk items that require swift action, balancing the need for security with the practicalities of our day-to-day operations.

DevSecOps: Tactical Solutions for the Security-Minded

As we journey towards a DevSecOps culture, we often rely on tactical solutions to reinforce security, especially if the organization is not yet mature in TOGAF Security practices. These solutions are about integrating security into every step of the development process, ensuring that security is not an afterthought but a fundamental component of our container management strategy.

Container Base Images

Often, you might find Critical and High CVEs that are not under your control but are due to a 3rd party base image; take Cert-Manager, External-DNS as prime examples, the backbone of many Kubernentes Clusters in the wild. These images will rely on Google’s GoLang Images, which in turn use a base image from jammy-tiny-stack. You see where I am going here? Many 3rd party images can lead you down a rabbit hole.

Remember, the goal is to MANAGE RISK, not eradicate risk; the latter is futile and leads to impractical security management. Look at ways to mitigate risks by reducing public service ingress footprints or improving North/South and East/West firewall solutions such as Calico Cloud. This allows you to contain security threats if a network segment is breached.

False Positives
Many CVEs contradict the security severity ratings, so always look at the Risk and likelihood e.g.
Though not every CVE is removed from the images, we take CVEs seriously and try to ensure that images contain the most up-to-date packages available within a reasonable time frame. For many of the Official Images, a security analyzer, like Docker Scout or Clair might show CVEs, which can happen for a variety of reasons:

The CVE has not been addressed in that particular image
- Upstream maintainers don’t consider a particular CVE to be a vulnerability that needs to be fixed and so won’t be fixed.
  - e.g., CVE-2005-2541 is considered a High severity vulnerability, but in Debian is considered “intended behavior,” making it a feature, not a bug.
- The OS Security team only has so much available time and has to deprioritize some security fixes over others. This could be because the threat is considered low or that it is too intrusive to backport to the version in “stable”.e.g., CVE-2017-15804 is considered a High severity vulnerability, but in Debian it is marked as a “Minor issue” in Stretch and no fix is available.
- Vulnerabilities may not have an available patch, and so even though they’ve been identified, there is no current solution.
The listed CVE is a false positive
- In order to provide stability, most OS distributions take the fix for a security flaw out of the most recent version of the upstream software package and apply that fix to an older version of the package (known as backporting).e.g., CVE-2020-8169 shows that curl is flawed in versions 7.62.0 though 7.70.0 and so is fixed in 7.71.0. The version that has the fix applied in Debian Buster is 7.64.0-4+deb10u2 (see security-tracker.debian.org and DSA-4881-1).
- The binary or library is not vulnerable because the vulnerable code is never executed. Security solutions make the assumption that if a dependency has a vulnerability, then the binary or library using the dependency is also vulnerable. This correctly reports vulnerabilities, but this simple approach can also lead to many false positives. It can be improved by using other tools to detect if the vulnerable functions are used. govulncheck is one such tool made for Go based binaries.e.g., CVE-2023-28642 is a vulnerability in runc less than version 1.1.5 but shows up when scanning the gosu 1.16 binary since runc 1.1.0 is a dependency. Running govulncheck against gosu shows that it does not use any vulnerable runc functions.
The security scanners can’t reliably check for CVEs, so it uses heuristics to determine whether an image is vulnerable. Those heuristics fail to take some factors into account:
- Is the image affected by the CVE at all? It might not be possible to trigger the vulnerability at all with this image.
- If the image is not supported by the security scanner, it uses wrong checks to determine whether a fix is included.
  - e.g., For RPM-based OS images, the Red Hat package database is used to map CVEs to package versions. This causes severe mismatches on other RPM-based distros.
  - This also leads to not showing CVEs which actually affect a given image.

Conclusion

By combining a risk-based approach with practical solutions and an eye toward DevSecOps principles, we’re creating a robust security framework that’s both pragmatic and effective. It’s about understanding the risks, prioritizing them intelligently, and taking decisive action to secure our digital landscape.

The TOGAF® Series Guide focuses on integrating risk and security within an enterprise architecture. It provides guidance for security practitioners and enterprise architects on incorporating security and risk management into the TOGAF® framework. This includes aligning with standards like ISO/IEC 27001 for information security management and ISO 31000 for risk management principles. The guide emphasizes the importance of understanding risk in the context of achieving business objectives and promotes a balanced approach to managing both negative consequences and seizing positive opportunities. It highlights the need for a systematic approach, embedding security early in the system development lifecycle and ensuring continuous risk and security management throughout the enterprise architecture.

Sometimes, security has to be driven from the bottom up. Ideally, it should be driven from the top down, but we are all responsible for security; if you own many platforms and compute runtimes in the cloud, you must ensure you manage risk under your watch. Otherwise, it is only a matter of time before you get pwned, something I have witnessed repeatedly.

The Real World:
1. Secure your containers from the bottom up in your CICD pipelines with tools like Snyk
2. Secure your containers from the top down in your cloud infrastructure with tools like Azure Defender – Container Security

3. Look at ways to enforce the above through governance and policies; this means you REDUCE the likelihood of a threat occurring from both sides of the enterprise.

4. Ensure firewall policies are in place to segment your network so that a breach in one area will not fan out in other network segments. This means you must focus initially on North / South Traffic (Ingress/EgresS) and then on East / West traffic (Traversing your network segments and domains).

There is a plethora of other risk management strategies from Penetration Testing, using honey pots to SIEM, ultimately you all can make a difference no matter where in the technology chart you sit.

Principles
The underlying ingredient for establishing a vision for your reorganisation in the TOGAF framework is defining the principles of the enterprise. In my view, protecting customer data is not just a legal obligation; it’s a fundamental aspect of building trust and ensuring the longevity of an enterprise.

Establishing a TOGAF principle to protect customer data during the Vision stage of enterprise architecture development is crucial because it sets the tone for the entire organization’s approach to cybersecurity. It ensures that data protection is not an afterthought but a core driver of the enterprise’s strategic direction, technology choices, and operational processes. With cyber threats evolving rapidly, embedding a principle of customer data protection early on ensures that security measures are integrated throughout the enterprise from the ground up, leading to a more resilient and responsible business.

Sources:
GitHub – docker-library/faq: Frequently Asked Questions

https://pubs.opengroup.org/togaf-standard/integrating-risk-and-security/integrating-risk-and-security_0.html

https://www.tigera.io/tigera-products/calico-cloud/

https://snyk.io/

GeekOut – Get vCores for Kubernetes

Often you will be working out licensing costs and more often than not, you will need to know the number of vCores - as a baseline use the following script.

Get-AKSVCores.ps1

<#

.SYNOPSIS

    Calculates total vCores for each Azure Kubernetes Service (AKS) cluster listed in a CSV file.



.DESCRIPTION

    This script imports a CSV file containing AKS cluster information, iterates through each cluster, and calculates the total number of virtual cores (vCores) based on the node pools associated with each cluster. It requires Azure CLI and assumes that the user has the necessary permissions to access the AKS clusters and VM sizes.



.PARAMETER CsvFilePath

    Full path to the CSV file containing AKS cluster information. The CSV file should have columns named 'ClusterName', 'Subscription', and 'ResourceGroup'.



.PARAMETER VmLocation

    Azure region to get VM sizes for the calculation. Default is 'Australia East'.



.PARAMETER PerformAzureLogin

    Indicates whether the script should perform Azure login. Set to $true if Azure login is required within the script; otherwise, $false. Default is $false.



.EXAMPLE

    .\Get-AKSVCores.ps1 -CsvFilePath "C:\path\to\aks_clusters.csv" -VmLocation "Australia East" -PerformAzureLogin $true



    This example runs the script with the specified CSV file path, VM location, and performs Azure login within the script.



.INPUTS

    CSV file



.OUTPUTS

    Console output of each AKS cluster's name, subscription, resource group, and total vCores.



.NOTES

    Version:        1.0

    Author:         Romiko Derbynew

    Creation Date:  2024-01-22

    Purpose/Change: Get Total VCores for Clusters

#>



param(

    [Parameter(Mandatory = $true)]

    [string]$CsvFilePath,



    [Parameter(Mandatory = $false)]

    [string]$VmLocation = "Australia East",



    [Parameter(Mandatory = $false)]

    [bool]$PerformAzureLogin = $true

)



# Azure login if required

if ($PerformAzureLogin) {

    az login

}



# Import the CSV file

$aksClusters = Import-Csv -Path $CsvFilePath



Write-Host "ClusterName,Subscription,ResourceGroup,TotalVCores"



# Iterate through each AKS cluster

foreach ($cluster in $aksClusters) {

    # Set the current subscription

    az account set --subscription $cluster.Subscription



    # Logic to get the resource group

    $resourceGroup = $cluster.ResourceGroup



    # Get the node pools for the AKS cluster

    $nodePools = az aks nodepool list --resource-group $resourceGroup --cluster-name $cluster.ClusterName --query "[].{name: name, count: count, vmSize: vmSize}" | ConvertFrom-Json



    $totalVCores = 0



    # Iterate through each node pool and calculate total vCores

    foreach ($nodePool in $nodePools) {

        # Get the VM size details

        $vmSizeDetails = az vm list-sizes --location $VmLocation --query "[?name=='$($nodePool.vmSize)'].{numberOfCores: numberOfCores}" | ConvertFrom-Json

        $vCores = $vmSizeDetails.numberOfCores



        # Calculate total vCores for the node pool

        $totalVCores += $vCores * $nodePool.count

    }



    # Output the total vCores for the cluster

    Write-Host "$($cluster.ClusterName),$($cluster.Subscription),$($cluster.ResourceGroup),$totalVCores"

}

Embracing Microservices Architecture with the TOGAF® Framework

Introduction to Microservices Architecture (MSA) in the TOGAF® Framework

In the ever-evolving digital landscape, the TOGAF® Standard, developed by The Open Group, offers a comprehensive approach for managing and governing Microservices Architecture (MSA) within an enterprise. This guide is dedicated to understanding MSA within the context of the TOGAF® framework, providing insights into the creation and management of MSA and its alignment with business and IT cultures.

What is Microservices Architecture (MSA)?

MSA is a style of architecture where systems or applications are composed of independent and self-contained services. Unlike a product framework or platform, MSA is a strategy for building large, distributed systems. Each microservice in MSA is developed, deployed, and operated independently, focusing on a single business function and is self-contained, encapsulating all necessary IT resources. The key characteristics of MSA include service independence, single responsibility, and self-containment.

The Role of MSA in Enterprise Architecture

MSA plays a crucial role in simplifying business operations and enhancing interoperability within the business. This architecture style is especially beneficial in dynamic market environments where companies seek to manage complexity and enhance agility. The adoption of MSA leads to better system availability and scalability, two crucial drivers in modern business environments.

Aligning MSA with TOGAF® Standards

The TOGAF® Standard, with its comprehensive view of enterprise architecture, is well-suited to support MSA. It encompasses all business activities, capabilities, information, technology, and governance of the enterprise. The Preliminary Phase of TOGAF® focuses on determining the architecture’s scope and principles, which are essential for MSA development. This phase addresses the skills, capabilities, and governance required for MSA and ensures alignment with the overall enterprise architecture.

Implementing MSA in an Enterprise

Enterprises adopting MSA should integrate it with their architecture principles, acknowledging the benefits of resilience, scalability, and reliability. Whether adapting a legacy system or launching a new development, the implications for the organization and architecture governance are pivotal. The decision to adopt MSA principles should be consistent with the enterprise’s overall architectural direction.

Practical Examples
TOGAF is independent on directing what tools to use. However I have found it very useful to couple Domain Driven Design with Event Storming in a Miro Board where you can get all stakeholders together and nut out the various domain, subdomains and events.

https://www.eventstorming.com/

*Event Storming – Business Process Example* – source Lucidcharts

Within each domain, you can start working on ensuring data is independent as well, with patterns such as Strangler-Fig or IPC.

Extract a service from a monolith

After you identify the ideal service candidate, you must identify a way for both microservice and monolithic modules to coexist. One way to manage this coexistence is to introduce an inter-process communication (IPC) adapter, which can help the modules work together. Over time, the microservice takes on the load and eliminates the monolithic component. This incremental process reduces the risk of moving from the monolithic application to the new microservice because you can detect bugs or performance issues in a gradual fashion.

The following diagram shows how to implement the IPC approach:

Figure 2. An IPC adapter coordinates communication between the monolithic application and a microservices module.

In figure 2, module Z is the service candidate that you want to extract from the monolithic application. Modules X and Y are dependent upon module Z. Microservice modules X and Y use an IPC adapter in the monolithic application to communicate with module Z through a REST API.

The next document in this series, Interservice communication in a microservices setup, describes the Strangler Fig pattern and how to deconstruct a service from the monolith.

Learn more about these patterns here – https://cloud.google.com/architecture/microservices-architecture-refactoring-monoliths

Conclusion

When integrated with the TOGAF® framework, Microservices Architecture (MSA) provides a strong and adaptable method for handling intricate, distributed architectures. Implementing MSA enables businesses to boost their agility, scalability, and resilience, thereby increasing their capacity to adjust to shifting market trends.

During the vision phase, establish your fundamental principles.

Identify key business areas to concentrate on, such as E-Commerce or online services.

Utilize domain-driven design (code patterns) and event storming (practical approach) to delineate domains and subdomains, using this framework as a business reference model to establish the groundwork of your software architecture.

Develop migration strategies like IPC Adapters/Strangler Fig patterns for database decoupling.

In the Technology phase of the ADM, plan for container orchestration tools, for example, Kubernetes.

Subsequently, pass the project to Solution Architects to address the remaining non-functional requirements from observability to security during step F of the ADM. This enables them to define distinct work packages adhering to SMART principles.

TIP: When migrating databases sometimes the legacy will be the main data and the microservice DB will be the secondary until a full migration, do not underestimate tools like CDC to assist.

Change Data Capture (CDC) is an approach used by microservices for tracking changes made to the data in a database. It enables microservices to be notified of any modifications in the data so that they can be updated accordingly. This real-time mechanism saves a lot of time that would otherwise be spent on regular database scans. In this blog post, we will explore how CDC can be used with microservices and provide some practical use cases and examples.

References

https://pubs.opengroup.org/togaf-standard/guides/microservices-architecture.html#:~:text=moving%20business%20environment.-,2%20Microservices%20Architecture%20Defined,for%20building%20large%20distributed%20systems.

https://www.lucidchart.com/blog/ddd-event-storming

https://waqasahmeddev.medium.com/how-to-migrate-to-microservices-with-the-strangler-pattern-64f6144ae4db

https://cloud.google.com/architecture/microservices-architecture-refactoring-monoliths

https://learn.microsoft.com/en-us/azure/architecture/patterns/strangler-fig

Navigating the TOGAF Government Reference Model (GRM)

Hey Tech Gurus!

Today, let’s decode the Government Reference Model (GRM) from the TOGAF Series Guide. This model is a game-changer for public sector organizations, aiming to standardize the maze of public sector business architecture.

What is the GRM? The GRM is an exhaustive, mutually exclusive framework designed for the public sector. It categorizes various government departments and provides a unified language to describe their business architecture. It’s split across sectors like Defense and Security, Health and Wellbeing, Education, and more.

Objective and Overview The GRM aims to provide a standard reference model template adaptable across different architectural approaches. It’s all about enabling collaboration between architecture service providers and fostering the Business Architecture profession.

Breaking Down the GRM The GRM is structured into three levels:

Level 1: Sectors defining business areas of the government.
Level 2: Functions detailing what the government does at an aggregated level.
Level 3: Services, further refining government functions at a component level.

Why does GRM matter? For tech folks in the public sector, the GRM is a toolkit to plan and execute effective transformational changes. It’s about understanding the big picture of public services and aligning technology to strategic objectives.

GRM and TOGAF ADM The GRM aligns with Phase B: Business Architecture of the TOGAF ADM (Architecture Development Method). It provides a pattern for accelerating the development of reference models within Business Architecture.

In a Nutshell, GRM is a breakthrough in organizing and understanding the complex ecosystem of public sector services. It’s about bringing consistency, collaboration, and clarity to how we view public sector architecture.

So, next time you’re navigating the complex world of public sector IT, remember that the GRM is your compass!

References
https://pubs.opengroup.org/togaf-standard/reference-models/government-reference-model.html

Untangling TOGAF’s C-MDM (Master Data): A Friendly Guide

Hey Tech Friends,

Let’s decode the TOGAF® Series Guide: Information Architecture – Customer Master Data Management (C-MDM). This document isn’t just about mastering data; it’s a journey into the heart of harmonizing customer data across an organization. C is for the stage in the ADM cycle, and MDM is all about the enterprises’ data.

The Core Idea: C-MDM is all about streamlining and enhancing how an organization manages its customer data. It’s like giving every customer information a VIP treatment, ensuring it’s accurate, accessible, and secure.

Generic Description of the Capabilities of the Organization

Sources: https://pubs.opengroup.org/togaf-standard/master-data-management/index.html (inspired by Michael Porter’s value chain)

Why It Matters: In our tech-driven world, customer data is gold. But it’s not just about having data; it’s about making it work efficiently. C-MDM is the toolkit for ensuring this data is managed smartly, reducing duplication, and enhancing access to this vital resource.

The TOGAF Twist: The guide integrates C-MDM within TOGAF’s Architecture Development Method (ADM). This means it’s not just a standalone concept but a part of the larger enterprise architecture landscape. It’s like having a detailed map for your journey in data management, ensuring every step aligns with the broader organizational goals.

Key Components:

Information Architecture Capability: Think of this as the foundation. It’s about understanding and handling the complexity of data across the organization.
Data Management Capabilities: This is where things get practical. It involves managing the lifecycle of data – from its creation to its retirement.
C-MDM Capability: The star of the show. This section delves into managing customer data as a valuable asset, focusing on quality, availability, and security.
Process and Methodology: Here, the guide adapts TOGAF ADM for C-MDM, offering a structured yet flexible approach to manage customer data.
Reference Models: These models provide a clear picture of what C-MDM entails, including the scope of customer data and detailed business functions.
Integration Methodologies: It’s about fitting C-MDM into the existing IT landscape, ensuring smooth integration and operation.

What’s in It for Tech Gurus? As a tech enthusiast, this guide offers a deep dive into managing customer data with precision. It’s not just about handling data; it’s about transforming it into an asset that drives business value.

Sources: https://pubs.opengroup.org/togaf-standard/master-data-management/index.html

So, whether you’re an enterprise architect, data manager, or just a tech aficionado, this guide is your compass in navigating the complex world of customer data management. It’s about making data not just big, but smart and efficient.

Happy Data Managing!

PS: Fostering a culture of data-driven decisions at all levels of your organisation, from value streams in the Business Domain to Observability in the Technology Domain, will allow your stakeholders and teams to make better strategic and tactical decisions. Invest wisely here and ensure insights are accessible to all key stakeholders – those stakeholders that have the influence and vested interest. This is where AI will revolutionise data-driven decisions; instead of looking at reports, you can “converse” with AI about your data in a customised reference vector DB.

References:

AI Chatbots to make Data-Driven Decisions

Sources: https://pubs.opengroup.org/togaf-standard/master-data-management/index.html

What is Code-less Composition?

How Does It Work in Azure with Terraform?

Example Configuration for a Management Landing Zone

Referencing Resources Across Levels

Handling External Objects

Example with Resource Name

Example with Resource ID

Global Settings and Diagnostics

Conclusion

The Challenge with Empty Secrets

Code Implementation

CAF Rover Terraform Pipeline Resuse

Introduction to Azure Policies and CAF Rover

Setting Up Your Environment for CAF Rover

Automating Policy Definitions Deployment

Sample Policy Definition

Streamlining Policy Sets (Initiatives) Deployment

Sample Policy Set (Initiative)

Automating Policy Assignments

Sample Policy Assignment

Archetypes

Why using CAF rover?​

Integrating with Azure DevOps Pipelines

Conclusion

Recommended Reading

Preparing for the Hunt: Installing dotnet-dump

The First Step: Collecting the Memory Dump

The Analytical Ritual: Unveiling the Mysteries of the Dump

Common Spells and Incantations:

The Final Confrontation: Identifying the Memory Leak

Examine managed memory usage

Generate memory dump

Restart the failed process

Analyze the core dump

Epilogue: Cleaning Up After the Battle

Extract a service from a monolith

Why using CAF rover?