Service Fabric – Upgrading VMSS Disks, Operating System on Primary Node Type

How do you upgrade the existing Data Disk on a primary Node Type Virtual Machine ScaleSet in Service Fabric?

How do you upgrade the existing Operating System on a primary Node Type VMSS in Service Fabric?

How do you move the Data Disk on a primary Node Type VMSS in Service Fabric?

How do you monitor the status during the upgrade, so you know exactly how many seed nodes have migrated over to the new scale set?

note – We successfully increased the SKU size as well, however this is not supported by Microsoft. However just increase your SKU in ARm and later, after the successful transfer to the new VMSS, run Update-AzureRmServiceFabricDurability.

Considerations

  • You have knowledge to use ARM to deploy an Azure Load Balancer
  • You have knowledge to use ARM to deploy a VMSS Scale Set
  • Service Fabric Durability Tier/Reliability Tier must be at least Silver
  • Keep the original Azure DNS name on the Load Balancer that is used to connect to the Service Fabric Endpoint. Very Important to write it down as a backup
  • You will need to reduce the TTL of all your DNS settings to reduce downtime during the upgrade which will just be the TTL value e.g. 10 minutes. (Ensure you have access to your primary DNS provider to do this)
  • Prepare an ARM template to add the new Azure Load Balancer that the new VMSS scaleset will attach to (Backend Pool)
  • Prepare an ARM template to add the new VMSS to an existing Service Fabric primary Node Type
  • Deploy the new Azure Load Balancer + Virtual Machine Scale Set to the Service Fabric Primary node
  • Run the RemoveScaleSetFromClusterController.ps1 – Run this script on the NEW node in the NEW VMSS. This script will monitor and facilitate moving the Primary Node Type to the new VMSS for you.  It will show you the status of the Seed nodes moving from the original Primary Node Type to the new VMSS.
  • When it completed, the last part will be to update DNS.
  • Run MoveDNSToNewPublicIPController.ps1

ARM Templates

You will need only 2 templates. One to Deploy a new Azure Load Balancer and one to Deploy the new VMSS Scale Set to the existing Service Fabric Cluster.

You will also need a powershell script that will run a custom script extension.

Custom Script – prepare_sf_vm.ps1


$disks = Get-Disk | Where partitionstyle -eq 'raw' | sort number

$letters = 70..89 | ForEach-Object { [char]$_ }
$count = 0
$label = "datadisk"

foreach ($disk in $disks) {
    $driveLetter = $letters[$count].ToString()
    $disk | 
    Initialize-Disk -PartitionStyle GPT -PassThru |
    New-Partition -UseMaximumSize -DriveLetter $driveLetter |
    Format-Volume -FileSystem NTFS -NewFileSystemLabel "$label$count" -Confirm:$false -Force
$count++
}

# Disable Windows Update
Set-ItemProperty -Path 'HKLM:\SOFTWARE\Policies\Microsoft\Windows\WindowsUpdate\AU' -Name NoAutoUpdate -Value 1

 

Load Balancer – azuredeploy_servicefabric_loadbalancer.json

Use your particular Load Balancer ARM Templates. No need to attached a backend pool, as this will be done by the VMSS script below.

Service Fabric attach new VMSS – azuredeploy_add_new_VMSS_to_nodeType.json

Create your own VMSS scaleset that you attach to Service fabric. The important aspect are the following.

nodeTypeRef (To attach VMSS to existing PrimaryNodeType).
dataPath (To use a new Disk for data)
dataDisk (to add a new managed physical disk)

We use F:\ onwards as D is reserved for Temp storage and E: is reserved for a CD ROM in Azure VM’s.


{
                                "name": "[concat('ServiceFabricNodeVmExt',variables('vmNodeType0Name'))]",
                                "properties": {
                                    "type": "ServiceFabricNode",
                                    "autoUpgradeMinorVersion": true,
                                    "protectedSettings": {
                                        "StorageAccountKey1": "[listKeys(resourceId('Microsoft.Storage/storageAccounts', variables('supportLogStorageAccountName')),'2015-05-01-preview').key1]",
                                        "StorageAccountKey2": "[listKeys(resourceId('Microsoft.Storage/storageAccounts', variables('supportLogStorageAccountName')),'2015-05-01-preview').key2]"
                                    },
                                    "publisher": "Microsoft.Azure.ServiceFabric",
                                    "settings": {
                                        "clusterEndpoint": "[parameters('existingClusterConnectionEndpoint')]",
                                        "nodeTypeRef": "[parameters('existingNodeTypeName')]",
                                        "dataPath": "F:\\\\SvcFab",
                                        "durabilityLevel": "Silver",
                                        "enableParallelJobs": true,
                                        "nicPrefixOverride": "[variables('subnet0Prefix')]",
                                        "certificate": {
                                            "thumbprint": "[parameters('certificateThumbprint')]",
                                            "x509StoreName": "[parameters('certificateStoreValue')]"
                                        }
                                    },
                                    "typeHandlerVersion": "1.0"
                                }
                            },
....
.......
.........
"storageProfile": {
                        "imageReference": {
                            "publisher": "[parameters('vmImagePublisher')]",
                            "offer": "[parameters('vmImageOffer')]",
                            "sku": "2016-Datacenter-with-Containers",
                            "version": "[parameters('vmImageVersion')]"
                        },
                        "osDisk": {
                            "managedDisk": {
                                "storageAccountType": "[parameters('storageAccountType')]"
                            },
                            "caching": "ReadWrite",
                            "createOption": "FromImage"
                        },
                        "dataDisks": [
                            {
                                "managedDisk": {
                                    "storageAccountType": "[parameters('storageAccountType')]"
                                },
                                "lun": 0,
                                "createOption": "Empty",
                                "diskSizeGB": "[parameters('dataDiskSize')]",
                                "caching": "None"
                            }
                        ]
                    }

...
....
.....
 "virtualMachineProfile": {
                    "extensionProfile": {
                        "extensions": [
                            {
                                "name": "PrepareDataDisk",
                                "properties": {
                                    "publisher": "Microsoft.Compute",
                                    "type": "CustomScriptExtension",
                                    "typeHandlerVersion": "1.8",
                                    "autoUpgradeMinorVersion": true,
                                    "settings": {
                                    "fileUris": [
                                        "[variables('vmssSetupScriptUrl')]"
                                    ],
                                    "commandToExecute": "[concat('powershell -ExecutionPolicy Unrestricted -File prepare_sf_vm.ps1 ')]"
                                    }
                                }
                            },


 

Once you have a new VMSS scale set attached to the existing NodeType, you should see in Service Fabric the extra nodes. the next step is to disable and remove the existing VMSS scaleset. This is an online operation, so you should be fine. However later we will need to update DNS for the Cluster Endpoint. This is important for Powershell Admin tools to still connect to the Service Fabric cluster.

RemoveScaleSetFromClusterController.ps1

Remote into one of the NEW VMSS virtual machines and run the following command. It will make dead sure that your seed nodes migrate over. it can take a long time (Microsoft docs say it takes a long time, how long?). it depends, for a cluster with 5 seed nodes, it took nearly 4 hours! So be patient and update the loop timeout to match your environment, increase the timeout if you have more than 5 seed nodes. My general rule is allow 45 minutes per seed node transfer.


#Requires -Version 5.0
#Requires -RunAsAdministrator



param (
    [Parameter(Mandatory = $true)]
    [string]
    $subscriptionName,

    [Parameter(Mandatory = $true)]
    [string] 
    $scaleSetToDisable,

    [Parameter(Mandatory = $true)]
    [string]
    $scaleSetToEnable,

    [Parameter(Mandatory = $true)]
    [string] 
    $resourceGroupName
)

Install-Module AzureRM.Compute -Force

Import-Module ServiceFabric -Force
Import-Module AzureRM.Compute -Force

function Disable-InternetExplorerESC {
    $AdminKey = "HKLM:\SOFTWARE\Microsoft\Active Setup\Installed Components\{A509B1A7-37EF-4b3f-8CFC-4F3A74704073}"
    $UserKey = "HKLM:\SOFTWARE\Microsoft\Active Setup\Installed Components\{A509B1A8-37EF-4b3f-8CFC-4F3A74704073}"
    Set-ItemProperty -Path $AdminKey -Name "IsInstalled" -Value 0
    Set-ItemProperty -Path $UserKey -Name "IsInstalled" -Value 0
    Stop-Process -Name Explorer
    Write-Host "IE Enhanced Security Configuration (ESC) has been disabled." -ForegroundColor Green
}

function Enable-InternetExplorerESC {
    $AdminKey = "HKLM:\SOFTWARE\Microsoft\Active Setup\Installed Components\{A509B1A7-37EF-4b3f-8CFC-4F3A74704073}"
    $UserKey = "HKLM:\SOFTWARE\Microsoft\Active Setup\Installed Components\{A509B1A8-37EF-4b3f-8CFC-4F3A74704073}"
    Set-ItemProperty -Path $AdminKey -Name "IsInstalled" -Value 1
    Set-ItemProperty -Path $UserKey -Name "IsInstalled" -Value 1
    Stop-Process -Name Explorer
    Write-Host "IE Enhanced Security Configuration (ESC) has been enabled." -ForegroundColor Green
}

$ErrorActionPreference = "Stop"

Disable-InternetExplorerESC

Login-AzureRmAccount -SubscriptionName $subscriptionName

Write-Host "Before you continue:  Ensure IE Enhanced Security is off."
Write-Host "Before you continue:  Ensure your new scaleset is ALREADY added to the Service Fabric Cluster"
Pause

try {
    Connect-ServiceFabricCluster
    Get-ServiceFabricClusterHealth
} catch {
    Write-Error "Please run this script from one of the new nodes in the cluster."
}

Write-Host "Please do not continue unless the Cluster is healthy and both Scale Sets are present in the SFCluster."
Pause

$nodesToDisable = Get-ServiceFabricNode | Where NodeName -match "_($scaleSetToDisable)_\d+"
$OldSeedCount = ( $nodesToDisable | Where IsSeedNode -eq  $true | Measure-Object).Count
$nodesToEnable = Get-ServiceFabricNode | Where NodeName -match "_($scaleSetToEnable)_\d+"

if($OldSeedCount -eq 0){
    Write-Error "Node Seed count must be greater than zero."
    exit
}

if($nodesToDisable.Count -eq 0){
    Write-Error "No nodes to disable found."
    exit
}

if($nodesToEnable.Count -eq 0){
    Write-Error "No nodes to enable found."
    exit
}

If (-not ($nodesToEnable.Count -ge $OldSeedCount)) {
    Write-Error "The new VM Scale Set must have at least $OldSeedCount nodes in order for the Seed Nodes to migrate over."
    exit
}

Write-Host "Disabling nodes in VMSS $scaleSetToDisable. Are you sure?"
Pause

foreach($node in $nodesToDisable){
    Disable-ServiceFabricNode -NodeName $node.NodeName -Intent RemoveNode -Force
}

Write-Host "Checking node status..."
$loopTimeout = 360
$loopWait = 60
$oldNodesDeactivated = $false
$newSeedNodesReady = $false

while ($loopTimeout -ne 0) {
    Get-Date -Format o
    Write-Host
    Write-Host "Nodes To Remove"

    foreach($nodeToDisable in $nodesToDisable) {
        $state = Get-ServiceFabricNode -NodeName $nodeToDisable.NodeName
        $msg = "{0} NodeDeactivationInfo: {1} IsSeedNode: {2} NodeStatus {3}" -f $nodeToDisable.NodeName, $state.NodeDeactivationInfo.Status, $state.IsSeedNode, $state.NodeStatus
        Write-Host $msg
    }

    $oldNodesDeactivated = ($nodesToDisable |  Where-Object { ($_.NodeStatus -eq [System.Fabric.Query.NodeStatus]::Disabled) -and ($_.NodeDeactivationInfo.Status -eq "Completed") } | Measure-Object).Count -eq $nodesToDisable.Count

    Write-Host
    Write-Host "Nodes To Add Status"

    foreach($nodeToEnable in $nodesToEnable) {
        $state = Get-ServiceFabricNode -NodeName $nodeToEnable.NodeName
        $msg = "{0} IsSeedNode: {1}, NodeStatus: {2}" -f $nodeToEnable.NodeName, $state.IsSeedNode, $state.NodeStatus
        Write-Host $msg
    }
    $newSeedNodesReady = ($nodesToEnable |  Where-Object { ($_.NodeStatus -eq [System.Fabric.Query.NodeStatus]::Up) -and $_.IsSeedNode} | Measure-Object).Count -ge $OldSeedCount
    if($oldNodesDeactivated -and $newSeedNodesReady) {
        break
    }
    $loopTimeout -= 1
    Start-Sleep $loopWait
}

if (-not ($oldNodesDeactivated)) {
    Write-Error "A node failed to deactivate within the time period specified."
    exit
}

$loopTimeout = 180
while ($loopTimeout -ne 0) {
    Write-Host
    Write-Host "Nodes To Add Status"

    foreach($nodeToEnable in $nodesToEnable) {
        $state = Get-ServiceFabricNode -NodeName $nodeToEnable.NodeName
        $msg = "{0} IsSeedNode: {1}, NodeStatus: {2}" -f $nodeToEnable.NodeName, $state.IsSeedNode, $state.NodeStatus
        Write-Host $msg
    }
    $newSeedNodesReady = ($nodesToEnable |  Where-Object { ($_.NodeStatus -eq [System.Fabric.Query.NodeStatus]::Up) -and $_.IsSeedNode} | Measure-Object).Count -ge $OldSeedCount
    if($newSeedNodesReady) {
        break
    }
    $loopTimeout -= 1
    Start-Sleep $loopWait
}

$NewSeedNodes = Get-ServiceFabricNode | Where-Object {($_.NodeName -match "_($scaleSetToEnable)_\d+") -and ($_.IsSeedNode -eq $True)}
Write-Host "New Seed Nodes are:"
$NewSeedNodes | Select NodeName
$NewSeedNodesCount = ($NewSeedNodes  | Measure-Object).Count

if($NewSeedNodesCount -ge $OldSeedCount) {
    Write-Host "Removing the scale set $scaleSetToDisable"
    Remove-AzureRmVmss -ResourceGroupName $ResourceGroupName -VMScaleSetName $scaleSetToDisable -Force
    Write-Host "Removed scale set $scaleSetToDisable"

    Write-Host "Removing Node State for old nodes"
    $nodesToDisable | Remove-ServiceFabricNodeState -Force
    Write-Host "Done"

    Get-ServiceFabricClusterHealth
    Get-ServiceFabricNode
} else {
    Write-Host "New Seed Nodes do not match the minimum requirements $NewSeedNodesCount."
    Write-Host "Manually run  Remove-AzureRmVmss"
    Write-Host "Then Manually run  Remove-ServiceFabricNodeState"
    Get-ServiceFabricClusterHealth
    Get-ServiceFabricNode
}

Enable-InternetExplorerESC

This script is extremely useful, you can see the progress of the transfer of seed nodes and disabling of existing primary node types.

You know it is successful, when the old nodes have ZERO seed nodes. All SEED nodes must transfer over to the new nodes, and all nodes in the old  scale set shoul dbe set to false by the end of the script execution.

MoveDNSToNewPublicIPController.ps1

Lastly you MUST update DNS to use the original CNAME . This script can help with this, what it does is actually detach the original internal Azure CNAME from the old public IP and move it to your new public IP attached to the new load balancer.




param (
        [Parameter(Mandatory = $true)]
        [string]
        $subscriptionName,

        [Parameter(Mandatory = $true)]
        [string]
        $oldLoadBalancerName,

        [Parameter(Mandatory = $true)]
        [string]
        $resourceGroupName=,

        [Parameter(Mandatory = $true)]
        [string]
        $oldPublicIpName=,

        [Parameter(Mandatory = $true)]
        [string]
        $newPublicIpName=
)

    Install-Module AzureRM.Network -Force
    Import-Module AzureRM.Network -Force

    $ErrorActionPreference = "Stop"
    Login-AzureRmAccount -SubscriptionName $subscriptionName

    Write-Host "Are you sure you want to do this. There will be brief connectivty downtime?"
    Pause

    $oldprimaryPublicIP = Get-AzureRmPublicIpAddress -Name $oldPublicIpName -ResourceGroupName $resourceGroupName
    $primaryDNSName = $oldprimaryPublicIP.DnsSettings.DomainNameLabel
    $primaryDNSFqdn = $oldprimaryPublicIP.DnsSettings.Fqdn
    
    if($primaryDNSName.Length -gt 0 -and $primaryDNSFqdn -gt 0) {
        Write-Host "Found the Primary DNS Name" $primaryDNSName
        Write-Host "Found the Primary DNS FQDN" $primaryDNSFqdn
    } else {
        Write-Error "Could not find the DNS attached to Old IP $oldprimaryPublicIP"
        Exit
    }
    
        Write-Host "Moving the Azure DNS Names to the new Public IP"
    $PublicIP = Get-AzureRmPublicIpAddress -Name $newPublicIpName -ResourceGroupName $resourceGroupName
    $PublicIP.DnsSettings.DomainNameLabel = $primaryDNSName
    $PublicIP.DnsSettings.Fqdn = $primaryDNSFqdn
    Set-AzureRmPublicIpAddress -PublicIpAddress $PublicIP

    Get-AzureRmPublicIpAddress -Name $newPublicIpName -ResourceGroupName $resourceGroupName
    Write-Host "Transfer Done"

    Write-Host "Removing Load Balancer related to old Primary NodeType."
    Write-Host "Are you sure?"
    Pause

    Remove-AzureRmLoadBalancer -Name $oldLoadBalancerName -ResourceGroupName $resourceGroupName -Force
    Remove-AzureRmPublicIpAddress -Name $oldPublicIpName -ResourceGroupName $resourceGroupName -Force

    Write-Host "Done"

Summary

In this article you followed the process to:

  • Configure ARM to add a new VMSS with OS, Data Disk and Operating System
  • Add a new Virtual Machine Scale Set to an Existing Service Fabric Node Type
  • Ran a powershell script controller to monitor the outcome of the VMSS transfer.
  • Transferred the original management DNS CNAME to the new Public IP Address

Conclusion

This project requires a lot of testing for your environment, allocate at least a a few days to test the entire process before you try it out on your production services.

HTH

Advertisement

Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s