Matt on ... Whatever

Wednesday, August 11, 2021

Customizing MacOS guest VMs in Parallels 17 on Apple Silicon

Those of us that need to test and package software for MacOS on Apple Silicon (aka M1) have spent the past many months bemoaning the lack of virtualization options for MacOS on Apple's new flagship hardware platform. Modern build environments lean heavily on VM images (and/or containers, where available) to ensure safety, isolation, and repeatability. Unfortunately, since the boot process for Apple Silicon MacOS uses a bootloader borrowed from iOS, the typical EFI-based bootloader that's used to boot Intel Mac images won't work. Since its release, Apple Silicon has not offered any virtualization support for MacOS itself, which also explains the lack of things like GHA/AZP build workers for M1/Apple Silicon. So when Apple quietly announced that the upcoming MacOS Monterey offered MacOS guest support, there was much rejoicing.

After installing Monterey Beta4 on my M1 Mac Mini, I spent a long evening playing around with the new Parallels 17 support for MacOS guests. It's definitely *extremely* early days- most of the support for common Parallels features isn't wired up for MacOS guests, as they use a completely different set of disk images and tools that are a very thin wrapper around Apple's new Virtualization framework. Most of Parallels' great automation and command-line tooling is currently completely unaware of the new MacOS guests on Apple Silicon. If you're going through the UI "front door", it doesn't appear possible to customize the VM in any way (even its name in Control Center; as of this writing, creating multiple Mac guests names them all "macOS"). The bigger issue that I set out to solve is the inability to customize the default disk image size of 30GB to make the VM useful for simple development tasks- the default size is too small to even install the XCode command-line tools. While none of the usual Parallels tools or APIs appear capable of customizing an M1 MacOS guest or its images, a bit of poking revealed a couple of command-line tools buried in the Parallels 17 package that will allow some basic customization of new VMs using undocumented args.

The Parallels tool that wraps the Virtualization framework APIs for creating a new VM image from an Apple IPSW archive can be found at:

/Applications/Parallels\ Desktop.app/Contents/MacOS/prl_macvm_create

It has a couple of modes; calling it with `--getipswurl` will try to find a working download link for a compatible IPSW package to use to seed the new image, though I prefer to just use the list maintained by MrMacintosh. Regardless where it comes from, downloading an IPSW image is the first step to creating a new VM. When you're using the Parallels "New" button, most of the time is spent on the IPSW download, so if you want to make a lot of VMs, downloading and reusing the IPSW for each VM will save a lot of time and bandwidth (as they're ~13GB each).

Once you have an IPSW image locally, run prl_macvm_create with the path to the IPSW, and the path where you want the VM image to live (the default is under '~/Parallels/macOS 12.macvm'). This is also your chance to increase the default disk size by adding `--disksize` and the desired disk image size (in bytes). If you omit this arg, your VM image will be created with a tiny 30GB disk that can't do much more than run the OS itself and allow for some small software installations. If you're planning to install XCode, I'd recommend at least 60GB.

Here's an invocation that uses a local copy of an Monterey IPSW image to create a new VM at ~/Parallels/devmac1.macvm with a 60GB disk:

/Applications/Parallels\ Desktop.app/Contents/MacOS/prl_macvm_create ~/Downloads/UniversalMac_12.0_21A5294g_Restore.ipsw ~/Parallels/devmac1.macvm --disksize 60000000000

Assuming all's well, it should create a few image and config files under the path you specified, followed by `Starting installation.` and some progress messages. This process usually completes in a couple of minutes.

Once you're greeted with `Installation succeeded.`, your VM should be ready for its first boot. You can use "Open" in the Parallels Control Center to do this (any directory ending with the `.macvm` extension should be visible there) if you want it to behave as if you'd created it in Parallels, or if you want to run the VM directly from the command-line (which has some advantages), you can use

/Applications/Parallels\ Desktop.app/Contents/MacOS/Parallels\ Mac\ VM.app/Contents/MacOS/prl_macvm_app

With no args, this will run the default VM at '~/Parallels/macOS 12.macvm', or you can pass the `--openvm` argument to run any VM you wish.

Here's an invocation that runs the VM I created above:

/Applications/Parallels\ Desktop.app/Contents/MacOS/Parallels\ Mac\ VM.app/Contents/MacOS/prl_macvm_app --openvm ~/Parallels/devmac1.macvm

This runs the VM inside the launched process, so Ctrl-C'ing the command or otherwise killing that process stops the VM. This is definitely a feature in my book for ephemeral VMs; it makes it pretty trivial to manage the running worker VMs in a CI environment by just starting the VM process and hanging onto its handle, signaling/killing it when you're done.

Side note: there are several OSS projects (eg, KhaosT's MacVM) that also wrap the calls to the Virtualization framework to create and run new MacOS guests under Monterey. The big win for Parallels right now is with its single command to create a new image. The open source image builders that I've seen will call into the virt framework to create a blank VM running in DFU recovery mode, but then require you to use Apple Configurator 2 to load the IPSW yourself into the VM. It works fine, but definitely less convenient for automation than what Parallels has rolled up into a convenient one-stop package, and I assume much of the rest of the Parallels value add from their excellent automation will come with time.

One thing we need right away is cheap throwaway VM clones; fully realized 60GB disk images are very expensive to copy around, run for a couple minutes, then delete and repeat. Thankfully, APFS' copy-on-write cloned files (created by cp -c on an APFS-formatted filesystem) fit the bill perfectly with the disk image files that Apple's Virtualization.framework uses. Once you've configured a VM image to include whatever tools and startup behavior you want, simply shut down the VM, and copy the entire VM directory as many times as you'd like for (basically) free, eg:

cp -c -r ~/Parallels/devmac1.macvm ~/Parallels/cloned_ephemeral_mac.macvm

The filesystem will only record changed blocks in the copied files once the VM boots up and starts doing writes. Once the clone directory is deleted, so are all the filesystem changes made under it.

Virtualization support is still quite early in the Apple Silicon ecosystem, but now we've got at least the very basic tools to do what's needed. Thanks Apple, Parallels, and all the OSS folks out there taking this stuff apart!

Another random side note: just for giggles, I tried creating a VM with a Big Sur 11.5.1 IPSW (nice to build against a released + supported OS), but it says "prl_macvm_create[10407:185738] No supported Mac configuration." - I assume there's some extra magic in the package required to allow it to be virtualized, so at least for now, it looks like Monterey+ is the only option for guest VMs.

Wednesday, March 18, 2020

Why no Ansible controller for Windows?

As Ansible's first full-time hire working with Windows back in 2015, I often get the question "Why can't I run an Ansible controller on Windows?". It's a really good question, and one that I've spent a lot of time thinking about (and advocating for, and prototyping) over the years. There have been statements in our docs and by our core devs in the community that basically amount to "not gonna happen, don't ask", but I think we're overdue for a deeper dive into the challenges. Rather than sprinkling that discussion over a bunch of Github issues and IRC, I'll try to cover the big stuff all at once in this post.

TL;DR

There are a lot of UNIX-isms deeply baked into most of Ansible that prevent it from working on native Windows at all, and even if we solved every one of them, the likelihood of real-world playbooks executing with 100% fidelity between a *nix controller and a Windows controller is almost zero. If you want to run an Ansible controller on Windows anytime soon, use WSL.

Okay, if you're still with me, I'll assume you're looking for more detail. I've actually done two internal prototypes of a Windows Ansible controller just to see what broke, and how hard it'd be to address. I'll describe the largest issues, and ways they could potentially be solved. This list is by no means exhaustive, but should hopefully illustrate that the overall effort is a non-trivial problem, as is "fixing" it without potentially breaking a lot of other things in Ansible.

Worker Process Model

Ansible's controller worker model (as of 2.10) makes heavy use of the POSIX fork() syscall, which effectively clones the controller process for each task+host as a worker process, executes the host-specific action/connection/module code in the cloned worker, marshals the results of the task to the controller, and exits. This is a tried-and-true mechanism for concurrent execution that works effectively the same on all UNIX-flavored hosts, and especially with Python, often yields much better performance than with threads (for reasons I won't go into here). So what's the problem? Windows doesn't have fork(). This means that the entire worker execution subsystem (including connections, actions, modules) is 100% non-functional on Windows as currently implemented. POSIX-compatibility projects like Cygwin have attempted to implement fork() for Windows, but even after years of really smart people working on it, they admit that sometimes it just breaks, which implies that it shouldn't be relied on for anything important. WSL takes care of this problem in its new process model by implementing a proper fork(), but that's not Windows-native either (and TMK can only be used by WSL Linux processes).

So why not have threaded workers as an option? Significant effort has been expended to prototype threaded workers in Ansible, but without pretty major changes to the various plugin APIs to optimize their behavior for Python's well-documented limitations around threaded execution, acceptable performance and scaling cannot be achieved. The other issue with threaded workers in the main controller (or any shared/reusable worker process model) is that most plugins (including 3rd party plugins not shipped by Ansible) were written to assume that the worker process is both isolated from the controller and ephemeral. Side effects of things commonly done by plugins that are completely innocuous when they're fork-isolated could range from "annoying and weird" to "fatal" when they're happening concurrently in a shared process. This is an area where we've got a lot of ideas to improve the model in the future (and most of them would be Windows-friendly), but doing so while preserving backward-compatibility with existing user-developed plugins will take a great deal of effort.

UNIX-isms in Core Plugins and Executor

Once the process model problem is solved, the next issue is that much of the Ansible Python module subsystem, core modules, and other parts of the execution engine assume they're running in a POSIX-y environment. Things like POSIX file modes, shebangs, hardcoded forward-slashes on generated paths, assumed presence of POSIX command-line tools/syscalls/TTYs, to name just a few... Many of these items can be addressed with Windows-specific code paths, but it's not a simple task. There are also exceptions that are effectively unresolvable- things like file mode args on common core modules like file, template, and copy. What should a Windows host do when asked to set a UNIX file mode on a Windows filesystem? This is why there are Windows-specific versions of those modules. We try to keep the common arguments and behavior consistent (within reason), but having separate implementations allows us to use the native management technology for the platform (eg Powershell/.NET on Windows, Python on POSIX), and to let the module UIs differ in platform-specific ways where it makes sense. For historical reasons, there are some places where the action/module names are the same for Windows and POSIX, but due to the numerous problems it's caused, we've tried to minimize that as a policy for new work. So effectively, this means that many core POSIX modules could never be fully functional on Windows- it'd be necessary to use the Windows equivalents.

We sometimes get asked why we don't accept pull requests that fix some of these things, along with a policy to reject future changes that regress the fixes... The main reason is that, without comprehensive sanity/unit/integration tests in CI to ensure no regressions, these kinds of changes rot quickly. Policy without enforcement is not really policy, especially on a project as large as Ansible with as many folks able to approve and merge pull requests as we have. We learned long ago that manual code reviews looking for XYZ policy violations will always let things slip through, so sweeping policy-based changes throughout the codebase must be enforced in an automated fashion. Until the code reaches a point where integration tests could actually test and enforce Windows-specific non-regression tests on Windows, sweeping changes in the name of future Windows support probably aren't going to be accepted.

Content Execution Parity

Let's say we've eliminated or built Windows equivalents to all the UNIX-isms in the core codebase, and that everything is working. Huzzah! Now we should be able to run all the Ansible content out there on our shiny new Windows-native Ansible controller, and the world is all rainbows and unicorns! Right? Sorry, nope. Even though we've gotten rid of the UNIX-isms in the code, that doesn't address UNIX-isms that exist in the Ansible content that the world runs on. The most obvious issues are around POSIX-flavored plays that use localhost or the local connection plugin, since it's necessary to use platform-specific versions of the modules to deal with things like paths and file modes. But that's not the only issue; content with commonly-used features like the pipe and template filters, glob lookups, become methods like su and sudo, to name a few, will never be able to execute the same way in a native Windows environment. If you're an all-Windows shop, or your Ansible content is developed specifically to run on a Windows controller, maybe that's all fine, but without a lot of guardrails to inform when something unsupported or unworkable is happening, it's a recipe for frustration for the folks that are just trying to automate all the things.

Honestly, I don't think there's a realistic comprehensive solution to this one. The best we could probably do is to tell you when you're trying to do something that's unsupported on your controller platform, so at least it's obvious that some conditional behavior is necessary in your content if you want to support running on both Windows and POSIX controllers. Maybe part of the solution is also that the implicit localhost for Windows doesn't exist at all, or is called something else (so we won't even try to run POSIX stuff on Windows or vice-versa). That eliminates the need to make most of the POSIX/Python modules (and related subsystem) work on Windows at all, while still allowing the language and controller to work there. Remember: this is only about the behavior of localhost and the local connection plugin- for the majority of tasks where Ansible is managing remote targets of any type, execution parity should be achievable.

Things That Give Me Hope

None of these things are insurmountable. But they're also not going to happen the right way without some serious investment. Red Hat is clearly not afraid to invest in Windows where it makes sense; look to the existing Windows target efforts in Ansible, official support for Windows OpenJDK, Windows containers on Openshift, to name a few. Ansible has historically been an easy sell to all-Linux and mixed Linux/Windows groups, but without native Windows controller support, most all-Windows groups tend to stop the conversation pretty early. If you fall into this latter camp, be sure to let your Red Hat salesperson know how many Ansible nodes they're missing out on because we don't support this configuration today.

All that said, there's never been a better time to run Ansible controllers on Windows. Ansible works great on WSL and WSL2, and is pretty darn seamless. While that configuration is not capital-S-supported by Red Hat, most of the minor issues we've encountered have been easily addressed. We still tell people to avoid using Ansible under Cygwin, as the previously-mentioned fork unreliability will eventually cause things to break.

As we work on the future of Ansible, we're trying to make sure we eliminate barriers to native Windows controllers, and don't erect any new ones. I'd love to someday announce first-class native Windows Ansible controller support. But it's not something that's going to come easily or quickly.

Thursday, August 23, 2018

Testing and modifying GitHub PRs without extra remotes, branches, or clones

As a popular open-source project, Ansible sees dozens of pull requests (PRs) each day from numerous members of our awesome community. Our CI system puts each one of those PRs through its paces on a litany of hosts and containers, but sometimes that's not enough. During the process of reviewing a PR, we may need to run it locally on a specialized test system, and sometimes we'll need to submit changes to it that should also be run through the CI gantlet before being merged. GitHub made this process a lot easier with the ability to commit changes to PR branches on forks, but most of the official documentation of the process either requires a whole new clone of the remote repo, or adding remotes or branches to your local repo. That's a lot of extra unnecessary work for ephemeral branches and forks I don't want to keep around.

It's possible to locally pull down and test a PR, as well as push changes back to the original fork/branch, without messing with any local clones/remotes/branches. This relies on a couple of oft-misunderstood git features: detached HEADs, and the FETCH_HEAD ref. Basically, the process involves fetching the PR branch directly from the remote fork via its URL, then checking out the resultant FETCH_HEAD ref as a disconnected head (so we don't have to create a local branch either). At that point, we have exactly the commits as they exist on the PR's source branch. This is important, because if we were to use a rebased tree, we can no longer just add commits to the original PR branch. With the original commits, we can make modifications, test things, whatever. Any commits we make are added to the disconnected head, which we can then push directly back to the PR fork's branch (again by URL), and GitHub will add the new commits to the end of the branch, just as if the original submitter had pushed new commits. All CI and checks on the PR will be triggered as usual, code reviews and comments can happen, etc., - we're still taking full advantage of GitHub's PR feature set (instead of direct-merging the changes back to the main Ansible repo and bypassing all the rest of GitHub and Ansible's pre-merge infrastructure).

So let's get to it already!

Let's assume you have an Ansible clone laying around in ~/projects/ansible, and that it's your current working directory...

Before we can fetch a PR branch, we'll first need to know the source fetch URL and branch. As of this writing, when viewing a PR, it can most easily be found just below the PR title, and looks like "(user) wants to merge (commits) into (target_fork:target_branch) from (source_user:source_branch)". That last part is what we need: the username or org where the source fork lives, and the source branch name it's coming from.

The source fork fetch URL should be "https://github.com/(user-or-org)/(repo_name).git" to fetch over HTTPS, or "git@github.com:(user-or-org)/(repo-name).git" for SSH. So if the submitter's username is "bob", the project repo is called "ansible", and the source branch name for the PR is "fix_frob", the HTTPS fetch URL would be "https://github.com/bob/ansible.git", and the SSH version would be "git@github.com:bob/ansible.git".

With these two pieces of information, we can now fetch the PR branch with a command like:

git fetch (source fork fetch URL) (source branch)

For our hypothetical example:

git fetch https://github.com/bob/ansible.git fix_frob

We now have the necessary objects from the remote sitting locally in a temporary ref called FETCH_HEAD (which is used internally by git for all fetch operations). In order to do something useful with them, we need to check them out into a working copy:

git checkout FETCH_HEAD

This gives us the contents of the temporary FETCH_HEAD ref into what's called a "detached HEAD"- it behaves just like a branch checkout in every way, but there's no named branch "handle" for us to use to refer to it, which means there's nothing we need to worry about cleaning up when we're done!

At this point, we can do whatever operations we like, just as if it were a normal working copy or branch checkout. If it was just a local test, and there's nothing we need to push back to the source branch, the next checkout of any branch will zap the state, and there's nothing for us to clean up. If we want to keep it around for some reason, it's easy to convert a detached HEAD into a normal branch.

But maybe there's a small change you want to add to the PR- say the submitter forgot a changelog stub and we just want to get it merged without waiting. GitHub's UI will allow you to make a change to an existing file in a PR as a new commit, but you can't add new files through the UI. No worries- we can use a similar process to push new commits back to the original source branch!

Make whatever changes are necessary and commit them as normal (as many commits as needed)...

For our hypothetical example:

echo "bugfixes: tweaked the frobdingnager to only frob once" > changelogs/fragments/fix_frob.yaml
git add changelogs/fragments/fix_frob.yaml
git commit -m "added missing changelog fragment"

We could just push our changes up, but remember, we're talking about pushing commits to someone else's repo. It's a neighborly thing to do to verify that we've only included the changes we expect, and that the submitter hasn't added anything more. To do that, we'll use the same command we did originally to refresh the FETCH_HEAD ref with the current contents of the source branch (which are hopefully unchanged):

git fetch (source fork fetch URL) (source branch)

so for our example:

git fetch https://github.com/bob/ansible.git fix_frob

and then we'll diff our detached HEAD contents that we want to push against the just-updated FETCH_HEAD:

git diff FETCH_HEAD HEAD

which should show us only our new commits. If anything else shows up, we've either accidentally committed some unrelated stuff, or new stuff has shown up in the original source branch, and it needs to be reconciled before we push (an exercise left for the reader).

Assuming all's well and we're ready to push, using the same source repo URL and branch we figured out above, push the changes back to the source repo with a command like:

git push (source fork fetch URL) HEAD:(source branch name)

For our example:

git push https://github.com/bob/ansible.git HEAD:fix_frob

If all's well, you should be prompted for credentials, then the new commits will be pushed. At this point, you can check out any other branch/ref and work on as normal, or repeat this process for other PRs- no cleanup necessary!

If you see an error about "failed to push some refs", it usually means the PR owner has changed something on the source branch, and you'll need to reconcile before you push. Force-pushing is almost never the right thing to do- you may potentially overwrite other commits!

A few other notes:
* Support was later added for SSH push, which makes life much easier if you're using 2FA (of course you are, right?). Pushing over HTTPS with 2FA enabled requires jumping through some extra hoops... You'll have to use a personal access token as your password, since GitHub's 2FA support doesn't support command-line authentication.
* Be very careful about merging or rebasing from other branches if you'll be pushing changes back. A rebase will prevent you from pushing altogether (without force-pushing, but don't do that), and a careless merge from your own target branch will add all the intermediate commits since the PR owner last rebased. At least for Ansible, that's a deal-breaker...

Testing and updating PRs without extra remotes, branches or clones using this process saves me a lot of hassle and cleanup- hope it's useful to you!

Thursday, September 3, 2015

Manage stock Windows AMIs with Ansible (part 2)

In part 1, we demonstrated the use of an AWS User Data script to set a known Administrator password, and configure WinRM on a stock Windows AMI. In part 2, we'll use this technique with Ansible to spin up Windows hosts from scratch and put them to work.

We'll assume that you've got Ansible configured properly for your AWS account (eg, boto installed, IAM credentials set up). See Ansible's AWS Guide if you need help getting this going. The examples in this post were tested against Ansible 2.0 (in alpha as of this writing), however, most of the content is applicable to Ansible 1.9. For simplicity, these samples also assume that you have a functional default VPC in your region (you should, unless you've deleted it). If you need help getting that configured, see Amazon's page on default VPCs.

We'll build up the files throughout the post, but a gist with complete file content is available at https://gist.github.com/nitzmahone/aaf4340ea8d87c7fa578.

First, we'll set up a basic inventory that includes localhost, and define a couple of groups. The hosts we create or connect with in AWS will be added dynamically to the inventory and those groups. Create a file called hosts in your current directory, with the following contents:

localhost ansible_connection=local

[win]

[win:vars]
ansible_connection=winrm
ansible_ssh_port=5986
ansible_ssh_user=Administrator
ansible_ssh_pass={{ win_initial_password }}

Note that we're using a variable in our inventory for the password- in conjunction with a vault, that keeps the password private. We'll set that up next. Create a vault file called secret.yml in the same directory with your inventory by running:

ansible-vault create secret.yml

Assign a strong password to the vault file when prompted, then put the following contents inside it when the editor pops up: [note- the default vault editor is vim- ensure it's installed, or preface the command with EDITOR=(your editor of choice here) to use a different one]:

win_initial_password: myTempPassword123!

Save and exit the editor to encrypt the vault file.

Next, we'll create a template of the User Data script we used in Part 1, so that the initial instance password can be set dynamically. Create a file called userdata.txt.j2 with the following content:

<powershell>
$admin = [adsi]("WinNT://./administrator, user")
$admin.PSBase.Invoke("SetPassword", "{{ win_initial_password }}")
Invoke-Expression ((New-Object System.Net.Webclient).DownloadString('https://raw.githubusercontent.com/ansible/ansible/devel/examples/scripts/ConfigureRemotingForAnsible.ps1'))
</powershell>

Note that we've replaced the hardcoded password from Part 1 with the variable win_initial_password (that's being set in our vault file).

Finally, we'll create the playbook that will set up our Windows machine. Create a file called win-aws.yml; we'll build our playbook inside.

Since our first play will be talking only to AWS (from our control machine), it only needs to target localhost, and we don't need to gather facts, so we can shut that off. We'll set a play-level var for the AWS region, and load the passwords from secret.yml. The first task looks up an Amazon-owned AMI named for the OS we want to run. The version number changes frequently, and old images are often retired, so we'll wildcard that part of the name, and sort descending so that the first image in the list should be the newest. Thankfully, Amazon pads these version numbers to two digits, so an ASCII sort works here. We want the module to fail if no images are found. Last, we'll register the output from the module to a var named found_amis, so we can refer to it later. Place the following content in win-aws.yml:

- name: infrastructure setup
hosts: localhost
gather_facts: no
vars:
target_aws_region: us-west-2
vars_files:
- secret.yml
tasks:
- name: find current Windows AMI in this region
ec2_ami_find:
region: "{{ target_aws_region }}"
platform: windows
virtualization_type: hvm
owner: amazon
name: Windows_Server-2012-R2_RTM-English-64Bit-Base-*
no_result_action: fail
sort: name
sort_order: descending
register: found_amis

Next, we'll take the first found AMI result and set its ami_id value into a var called win_ami_id:

- set_fact:
win_ami_id: "{{ (found_amis.results | first).ami_id }}"

Before we can fire up our instance, we'll need to ensure that there's a security group we can use to access it (in the default VPC, in this case). The group allows inbound access on port 80 for the web app we'll set up later, port 5986 for WinRM over https, and port 3389 for RDP in case we need to log in and poke around interactively. Again, we'll register the output to a var called sg_out so we can get its ID:

- name: ensure security group is present

ec2_group:

name: WinRM RDP

description: Inbound WinRM and RDP

region: "{{ target_aws_region }}"

rules:

- proto: tcp

from_port: 80

to_port: 80

cidr_ip: 0.0.0.0/0

- proto: tcp

from_port: 5986

to_port: 5986

cidr_ip: 0.0.0.0/0

- proto: tcp

from_port: 3389

to_port: 3389

cidr_ip: 0.0.0.0/0

rules_egress:

- proto: -1

cidr_ip: 0.0.0.0/0

Now that we know the image and security group IDs, we have everything we need to ensure that we have an instance in the default VPC:

- name: ensure instances are running

ec2:

region: "{{ target_aws_region }}"

image: "{{ win_ami_id }}"

instance_type: t2.micro

group_id: [ "{{ sg_out.group_id }}" ]

wait: yes

wait_timeout: 500

exact_count: 1

count_tag:

Name: stock-win-ami-test

instance_tags:

Name: stock-win-ami-test

user_data: "{{ lookup('template', 'userdata.txt.j2') }}"

We're just passing through the target_aws_region var we set earlier, as well as the win_ami_id we looked up. From the sg_out variable that contains the output from the security group module, we pull out just the group_id value and pass that as the instance's security group. For our sample, we just want one instance to exist, so we ask for an exact_count of 1, which is enforced by the count_tag arg finding instances with the Name tag set to "stock-win-ami-test". Finally, we use an inline template render to substitute the password into our User Data script template and pass it directly to the user_data arg; that will cause our instance to set up WinRM and reset the admin password on initial bootup. Once again, we register the output to the ec2_result var, as we'll need it later to add the EC2 hosts to inventory. Once this task has run, we need some way to ensure that the instances have booted, and that WinRM is answering (which can take some time). The easiest way is to use the wait_for action, against the WinRM port:

- name: wait for WinRM to answer on all hosts

wait_for:

port: 5986

host: "{{ item.public_ip }}"

timeout: 300

with_items: ec2_result.tagged_instances

This task will return immediately if the instance is already answering on the WinRM port, and if not, poll it for up to 300 seconds before giving up and failing. Our next step will consume the output from the ec2 task to add the host to our inventory dynamically:

- name: add hosts to groups

add_host:

name: win-temp-{{ item.id }}

ansible_ssh_host: "{{ item.public_ip }}"

groups: win

with_items: ec2_result.tagged_instances

This task loops over all the instances that matched the tags we passed (whether they were created or pre-existing) and adds them to our in-memory inventory, placing them in the win group (that we defined statically in the inventory earlier). This allows us to use the group_vars we set on the win group with all the WinRM connection details, so the only values we have to supply are the host's name and it's IP address (via ansible_ssh_host, so WinRM knows how to reach it). Once this task completes, we have fully-functional Windows instances that we can immediately target in another play in the same playbook (for instance, to do common configuration tasks, like resetting the password), or we could use a separate playbook run later against an ec2 dynamic inventory that targets these hosts. Let's do the former; we'll install IIS and configure up a simple Hello World web app. First, let's create a web page that we'll copy over. Create a file called default.aspx with the following content:

Hello from <%= Environment.MachineName %> at <%= DateTime.UtcNow %>

Next, add the following play to the end of the playbook we've been working with:

- name: web app setup
hosts: win
vars_files: [ "secret.yml" ]
tasks:
- name: ensure IIS and ASP.NET are installed
win_feature:
name: AS-Web-Support

- name: ensure application dir exists
win_file:
path: c:\inetpub\foo
state: directory

- name: ensure default.aspx is present
win_copy:
src: default.aspx
dest: c:\inetpub\foo\default.aspx

- name: ensure that the foo web application exists
win_iis_webapplication:
name: foo
physical_path: c:\inetpub\foo
site: Default Web Site

- name: ensure that application responds properly
uri:
url: http://{{ ansible_ssh_host}}/foo
return_content: yes
register: uri_out
delegate_to: localhost
until: uri_out.content | search("Hello from")
retries: 3

- debug:
msg: web application is available at http://{{ ansible_ssh_host}}/foo

This play targets the win group with the dynamic hosts we just added to it. We pull in our secrets file again (as the inventory will always need the password value inside). The play ensures that IIS and ASP.NET are installed with the win_feature module, creates a directory for the web application with win_file, copies the application content into that directory with win_copy, and ensures that the web application is created in IIS. Finally, we delegate a uri task to the local Ansible runner, and have it make up to 3 requests to the foo application, looking for the content that should be there.

At this point, we've got a complete playbook that will idempotently stand up a Windows machine in AWS with a stock AMI, then configure and test a simple web application. To run it, just tell ansible-playbook where to get its inventory, what to run, and that you'll need to specify a vault password, like:

ansible-playbook -i hosts win-aws.yml --ask-vault-pass

After supplying your vault password, the playbook should run to completion, at which point you should be able to access the web application via http://(your AWS host IP)/foo/.

We've shown that it's pretty easy to use Ansible to provision Windows instances in AWS without needing custom AMIs. These techniques can be expanded to set up and deploy most any application with Ansible's growing Windows support. Give it a try for your code today! Happy automating...

Manage stock Windows AMIs with Ansible (part 1)

Ever wished you could just spin up a stock Windows AMI and manage it with Ansible directly? Linux AMIs usually have SSH enabled and private key support configured at first boot, but most stock Windows images don't have WinRM configured, and the administrator passwords are randomly assigned and only accessible via APIs several minutes post-boot. People go to some pretty awful lengths to get plug-and-play Windows instances working with Ansible under AWS, but the most common solution seems to be building a derivative AMI from an instance with WinRM pre-configured and a hard-coded Administrator password. This isn't too hard to do once, but between Amazon's frequent base AMI updates, and the need to repeat the process in multiple regions, it can quickly turn into an ongoing hassle.

Enter User Data. If you're not familiar with it, you're not alone. It's a somewhat obscure option buried in the Advanced area of the AWS instance launch UI. It can be used for many different purposes; much of the AWS documentation treats it as a mega-tag that can hold up to 16k of arbitrary data, accessible only from inside the instance. Less well-known is that scripts embedded in User Data will be executed by the EC2 Config Windows service near the end of the first boot. This allows a small degree of first-boot customization on a vanilla instance, including setting up WinRM and changing the administrator password; once those two items are completed, the instance is manageable with Ansible immediately!

We'll build up the files throughout the post, but a gist with complete file content is available at https://gist.github.com/nitzmahone/4271319ab8e7acf3330c.

Scripts can be embedded in User Data by wrapping them in <powershell> or <script> tags for Windows batch scripts- in this case, we'll stick to Powershell. The following User Data script will set the local Administrator password to a known value, then download and run a script hosted in Ansible's GitHub repo to auto-configure WinRM:

<powershell>
$admin = [adsi]("WinNT://./administrator, user")
$admin.PSBase.Invoke("SetPassword", "myTempPassword123!")
Invoke-Expression ((New-Object System.Net.Webclient).DownloadString('https://raw.githubusercontent.com/ansible/ansible/devel/examples/scripts/ConfigureRemotingForAnsible.ps1'))
</powershell>

A word of caution: User Data is accessible via http from inside the instance without any authentication. While the following technique will get your instances quickly accessible from Ansible, DO NOT use a sensitive password (eg, your master domain admin password), as it will be visible as long as the User Data exists, and User Data requires an instance stop/start cycle to modify. Anyone/anything inside your instance that can make an http request to an arbitrary host can see the password you set with this technique. A good practice is to have one of your first Ansible tasks against your new instance change the password to a different value. Another thing to keep in mind is that the default Windows password policy is usually enabled, so the passwords you choose need to satisfy its complexity requirements.

Before we get to the Holy Grail of actually using Ansible to spin up Windows instances using this technique, let's just try it manually from the AWS Console first. Click Launch Instance, and select a Windows image, then under Configure Instance Details, expand Advanced Details at the bottom to see the User Data textbox.

Paste the script above into the textbox, then click through to Configure Security Group, and ensure that TCP ports 3389 and 5986 are open for all IPs. Continue to Review and Launch, select your private key (which doesn't make any difference now, since you know the admin password), and wait for the instance to launch. If all's well, after the instance has booted you should be able to reach RDP on port 3389, and WinRM on port 5986 with Ansible (both protocols using the Administrator password set by the script). It can often take several minutes for Windows instances set up this way to begin responding, so be patient!

Let's test this using the win_ping module with a dirt simple inventory. Create a file called hosts with the following contents:

aws-win-host ansible_ssh_host=(your aws host public IP here)

[win]
aws-win-host

[win:vars]
ansible_connection=winrm
ansible_ssh_port=5986
ansible_ssh_user=Administrator
ansible_ssh_pass=myTempPassword123!

then run the win_ping module using Ansible, referencing this inventory file:

ansible win -i hosts -m win_ping

If all's well, you should see the ping response, and your AWS Windows host is fully manageable by Ansible without using a custom AMI!

In part 2, we'll show an end-to-end example of using Ansible to provision Windows AWS instances.

Monday, November 19, 2012

DHCP Failover Breaks with Custom Options

I was really itching to try out the new DHCP Failover goodies in Windows Server 2012. I ran into a couple weird issues when trying to configure it- hopefully I can save someone else the trouble.

When I tried to create the partner server relationship and configure failover, I'd get the following error: Configure failover failed. Error: 20010. The specified option does not exist.

We have a few custom scope options defined for our IP phones. Apparently, it won't propagate the custom option configuration during the partner relationship setup- you have to do it manually. I haven't found this step or error message documented anywhere in the context of failover configuration.

Since we only had one custom option, and I knew what it was, I just manually added it. If you don't know which options are custom and need to be copied over, it's not hard to figure out. In the DHCP snap-in on the primary server, right-click the IPv4 container and choose Set Predefined Options, then scroll through values in the Option Name dropdown with the keyboard arrows or mouse wheel until you see the Delete button light up (that'll be a custom value). Hit Edit and copy the values down, then in the same place on the partner server, hit Add and poke in the custom values. If you have lots of custom options, you can use netsh dhcp or PowerShell to get/set the custom option config.

Once the same set of custom options exist on both servers, you can do Configure Failover as normal on the scopes and it should work fine. The values of any custom options defined under the scopes will sync up just fine.

I also had one scope where Configure Failover wasn't an option. I had imported all my scopes from a 2003 DC awhile back, so I'm guessing there was something else corrupted in the scope config- just deleting and recreating the scope fixed the problem (it was on a rarely used network, so no big deal; YMMV).

Hope this helps!

Friday, March 2, 2012

Enabling AHCI/RAID on Windows 8 after installation

UPDATE: MS has recently published a KB article on a simpler way to address this. Thanks to commenter Keymapper for the heads up!

Been playing around with Windows 8 Consumer Preview and Windows 8 Server recently. After installing, I needed to enable RAID mode (Intel ICH9R) on one of the machines that was incorrectly configured for legacy IDE mode (why is this the default BIOS setting Dell?). In Win7, you would just ensure that the Start value for the Intel AHCI/RAID driver is set to 0 in the registry, then flip the switch in the BIOS, and all's well. Under Win8 though, you still end up with the dreaded INACCESSIBLE_BOOT_DEVICE. The answer is simple enough: turns out they've added a new registry key underneath the driver you'll need to tweak: StartOverride. I just deleted the entire key, but if you're paranoid, you can probably just set the named value 0 to "0".

So, the full process:

- Enable the driver you need before changing the RAID mode setting in the BIOS:
(for Intel stuff, the driver name is usually iaStorV or iaStorSV, others may use storahci)
-- Delete the entire StartOverride key (or tweak the value) under HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\(DriverNameHere)
- Reboot to BIOS setup
- Enable AHCI or RAID mode
- Profit!