Recently I had a customer hit an issue that was hard to resolve…..until we stopped looking at the data and reconsidered our design.
What We Had
- Virtual DC (Hyper-V guest running in Azure, but the location and virtualization doesn’t matter!)
- C$ OS, E$ Sysvol/NTDS
- Bitlocker enabled for all drives
The customer installed a relatively small and innocent piece of software, rebooted, and then we entered the BSOD loop — hard to see since it was on a guest in Azure. After working through the night on the Azure aspect and out of ideas, we asked an AD guru to take a look. Within minutes he had figured it out!
So what was going on?
- When the VM boots up, it tries to unencrypts the OS drive first. This OS disk key is stored in AD (accessible on another DC) or in 3rd party tool, like CloudLink
- Once the OS is unencrypted, Bitlocker frantically tries to unlock any data drives…..
- Meanwhile AD services are starting up (among the first services), but they can’t get to the AD database (it’s sitting on that locked E$…..)
- AD determines it can’t get to it’s database, crashes, throws the BSOD (with a nice pause so you can frantically try to write down the error message), and then reboots the server.
- The cycle repeats….
Basically our DC is so secure, even we can’t use it! Luckily a DC is easily rebuilt and we all know better than to only have a single DC in the domain, right?
If you want to Bitlocker your DCs, put all those critical DC bits on the C$!
A big thanks to John Bay, our AD guru!
*previously posted at https://blogs.msdn.microsoft.com/nicole_welch/2016/01/bitlocker-and-domain-controller-logical-disks/
*previously posted at https://blogs.msdn.microsoft.com/nicole_welch/2016/04/moving-azure-provider-images-from-the-commercial-to-azure-government-cloud-mag/
Some times we see customers move Microsoft provided images from the Azure commercial cloud to the Azure Government (MAG) cloud. While this is technically supported, there are several things to consider. When using the Microsoft provided images, there are configuration settings that are specific to the cloud environment. When you move the VM (by moving the VHD), you risk having the wrong settings for your new cloud location.
Below is a list of settings that need to be changed. This is NOT comprehensive and will be updated as needed. Keeping mind the various endpoint that are different as well (https://azure.microsoft.com/en-us/documentation/articles/azure-government-developer-guide/)
When you have a problematic IaaS VM (won’t start, won’t stop, can’t RDP even though it worked just yesterday….) and you’ve exhausted your usual tricks, turn to the
Three Four R’s (the “did you reboot?” of Azure).
- Restart – Most users will think of this, just be sure you restart (and I mean a stop and start, not the restart button) from Azure (portal or PS) and not from the VM. The Azure restart will give the fabric a chance to look for issues and self-heal. Note: If you have a classic VMs (old portal) this also applies to your cloud service. I’ve seen VMs acting up due to issues with their cloud service. You can try restarting the cloud service….but keep in mind all hosted VMs by the cloud service will be restarted as well!
- Resize – Resizing (esp. if you increase the VM size to the largest possible), it will recreate certain elements of the VM <-> Fabric connection and could even move you to a new cluster node on the backend. Resize, test to confirm it’s working, and then size back to your original size. You will get charged at the higher VM size rate, but if it’s only for 15min that’s a minimal cost. *Note: if you’re in ARM (VMs deployed via the new portal, not a classic VM) you can directly redeploy to a new cluster node using Azure powershell: https://azure.microsoft.com/en-us/documentation/articles/virtual-machines-linux-redeploy-to-new-node/ (step 4 below)
- Recreate – This one sounds scary, but if you make a note of your current configuration FIRST the recreation is quick (generally <20min) and relatively painless. If you need to move your VM to a new cloud service, VNet, etc. or are having those “it’s just acting up” issues this is a good troubleshooting step to try out. You basically are removing the Azure components and then recreating the Azure components (modifying if needed) — all while leaving your disks untouched. See https://www.petri.com/recreate-virtual-machine-in-microsoft-azure for step-by-step instructions. *Note: The link is specific to ASM (classic portal) but the premise works for both classic and ARM VMs.
- Redeploy – Available in the new portal only for ARM VMs (not classic). Effectively the same as a resize, only guaranteed that you change nodes. https://azure.microsoft.com/en-us/documentation/articles/virtual-machines-linux-redeploy-to-new-node/
If these all fail, and you already confirmed there are no outages that could impact you (https://azure.microsoft.com/en-us/status/), it may be time to engage Microsoft.
See also https://azure.microsoft.com/en-us/documentation/articles/virtual-machines-windows-allocation-failure/
*previously posted at https://blogs.msdn.microsoft.com/nicole_welch/2016/05/azure-iaas-vms-and-the-three-rs/
If you forgot your VM’s logon ID/PW, you have a few options to get back in there. Note: Domain Controllers don’t have local users so these tricks won’t work with them….
- If this is a brand-new VM (never logged on or configured), it’s generally fastest to delete and recreate it.
- If the VM extension installed, you can reset the password via powershell.
- For classic VMs, you can reset your password via the new portal – https://azure.microsoft.com/en-us/documentation/articles/virtual-machines-windows-reset-rdp/#windows-vms-in-the-classic-deployment-model *does not work in MAG
- There is a complicated method (mount the C$ to another VM as a data disk and update a gpt.ini) that will also work. A peer of mine will be blogging about this process in the future.
*MAG = Microsoft Azure for Government
*previously posted at https://blogs.msdn.microsoft.com/nicole_welch/2016/05/azure-iaas-cant-logon-wrong-idpw/
When customers move into the cloud, they tend to mimic their setup on-prem. Not a bad thing, but when it comes to blocking internet access for servers this can create some unusual problems.
If you are using network security groups (NSGs), user defined routing (UDR), or forced-tunneling be sure to put in an exception for your Azure data center IP ranges, as lack of connectivity will impact many services including these:
- VM Extensions see https://blogs.msdn.microsoft.com/mast/2016/04/27/vm-stuck-in-updating-when-nsg-rule-restricts-outbound-internet-connectivity/
- Azure Backup see https://azure.microsoft.com/en-us/documentation/articles/backup-azure-vms-prepare/#network-connectivity
- Monitoring Agent/Extension see https://docs.microsoft.com/en-us/azure/log-analytics/log-analytics-proxy-firewall#configure-settings-with-the-microsoft-monitoring-agent
- KMS – https://docs.microsoft.com/en-us/azure/virtual-machines/troubleshooting/custom-routes-enable-kms-activation
Update 16 Aug 2018 – The use of service endpoints will limit the damage of blocking internet access. Ensure all services you use/require are covered by service endpoints before blocking internet access. https://docs.microsoft.com/en-us/azure/virtual-network/virtual-network-service-endpoints-overview
*previously posted at https://blogs.msdn.microsoft.com/nicole_welch/2016/08/azure-vms-need-internet-access/
When using ASR to replication VMware or physical machines into Azure two roles are required – the configuration and process servers (often combined on a single server) – to help coordinate and facilitate the data replication (https://docs.microsoft.com/en-us/azure/site-recovery/site-recovery-vmware-to-azure#run-site-recovery-unified-setup). On the configuration server, configuration data is stored in a MySQL database. *at this time this is a requirement to use MySQL, other databases types are not supported.
There are several scenarios when you may need to verify or modify data stored in this database. Below are samples for your reference.
Note – database modifications will impact ASR and should be done with care
Login to MySQL and Connect to the ASR Database
from a command prompt:
mysql –u root –p (you will be prompted to enter the password specified during installation)
show databases; (will list all databases for your reference)
use svsdb1; (selects the ASR database so future queries will run against it)
To list all machines registered with the configuration server (CS)
select id as hostid, name, ipaddress, ostype as operatingsystem, from_unixtime(lasthostupdatetime) as heartbeat from hosts where name!=’InMageProfiler’\G;
To Cleanup Duplicate/Stale Entries
To Update the IP of a Machine
update hosts set ipaddress='[new address]’ where ipaddress='[old address]’;
example, update hosts set ipaddress=’192.168.0.4′ where ipaddress=’184.108.40.206′;
*previously posted at https://blogs.msdn.microsoft.com/nicole_welch/2017/02/connecting-to-mysql-for-azure-site-recovery-asr/
*Updated June 15, 2018*
For a myriad of reasons it’s nice to know what IPs you can expect to see coming to/from your Azure space. Below is a quick cheat sheet.
Microsoft Azure Datacenter IP Ranges
updated 20 Aug 2018, thanks to Michael Ketchum of Microsoft for the additional information
The XML file now breaks down the IP ranges as follows:
- “<SERVICE>” = Includes all IP’s for that service across all regions in the applicable cloud
- “<SERVICE>.<REGION>” = Includes all IP’s for that service in a specific region
- “AzureCloud.<REGION>” = Includes all IP’s/and services for that region
- “AzureCloud” = Includes all IP’s/and services for that cloud, such as Gov, commercial, etc.
The “Secret Azure IPs” you MUST Include – 220.127.116.11 and 169.254.169.254
Office 365 URLs and IP address ranges
Office 365 US Government: Endpoints for US Federal and US Defense Clouds (preview)
*previously posted at https://blogs.msdn.microsoft.com/nicole_welch/2017/02/azure-ip-ranges/