Wednesday, March 18, 2020

Why no Ansible controller for Windows?

As Ansible's first full-time hire working with Windows back in 2015, I often get the question "Why can't I run an Ansible controller on Windows?". It's a really good question, and one that I've spent a lot of time thinking about (and advocating for, and prototyping) over the years. There have been statements in our docs and by our core devs in the community that basically amount to "not gonna happen, don't ask", but I think we're overdue for a deeper dive into the challenges. Rather than sprinkling that discussion over a bunch of Github issues and IRC, I'll try to cover the big stuff all at once in this post.


There are a lot of UNIX-isms deeply baked into most of Ansible that prevent it from working on native Windows at all, and even if we solved every one of them, the likelihood of real-world playbooks executing with 100% fidelity between a *nix controller and a Windows controller is almost zero. If you want to run an Ansible controller on Windows anytime soon, use WSL.

Okay, if you're still with me, I'll assume you're looking for more detail. I've actually done two internal prototypes of a Windows Ansible controller just to see what broke, and how hard it'd be to address. I'll describe the largest issues, and ways they could potentially be solved. This list is by no means exhaustive, but should hopefully illustrate that the overall effort is a non-trivial problem, as is "fixing" it without potentially breaking a lot of other things in Ansible.

Worker Process Model

Ansible's controller worker model (as of 2.10) makes heavy use of the POSIX fork() syscall, which effectively clones the controller process for each task+host as a worker process, executes the host-specific action/connection/module code in the cloned worker, marshals the results of the task to the controller, and exits. This is a tried-and-true mechanism for concurrent execution that works effectively the same on all UNIX-flavored hosts, and especially with Python, often yields much better performance than with threads (for reasons I won't go into here). So what's the problem? Windows doesn't have fork(). This means that the entire worker execution subsystem (including connections, actions, modules) is 100% non-functional on Windows as currently implemented. POSIX-compatibility projects like Cygwin have attempted to implement fork() for Windows, but even after years of really smart people working on it, they admit that sometimes it just breaks, which implies that it shouldn't be relied on for anything important. WSL takes care of this problem in its new process model by implementing a proper fork(), but that's not Windows-native either (and TMK can only be used by WSL Linux processes).

So why not have threaded workers as an option? Significant effort has been expended to prototype threaded workers in Ansible, but without pretty major changes to the various plugin APIs to optimize their behavior for Python's well-documented limitations around threaded execution, acceptable performance and scaling cannot be achieved. The other issue with threaded workers in the main controller (or any shared/reusable worker process model) is that most plugins (including 3rd party plugins not shipped by Ansible) were written to assume that the worker process is both isolated from the controller and ephemeral. Side effects of things commonly done by plugins that are completely innocuous when they're fork-isolated could range from "annoying and weird" to "fatal" when they're happening concurrently in a shared process. This is an area where we've got a lot of ideas to improve the model in the future (and most of them would be Windows-friendly), but doing so while preserving backward-compatibility with existing user-developed plugins will take a great deal of effort.

UNIX-isms in Core Plugins and Executor

Once the process model problem is solved, the next issue is that much of the Ansible Python module subsystem, core modules, and other parts of the execution engine assume they're running in a POSIX-y environment. Things like POSIX file modes, shebangs, hardcoded forward-slashes on generated paths, assumed presence of POSIX command-line tools/syscalls/TTYs, to name just a few... Many of these items can be addressed with Windows-specific code paths, but it's not a simple task. There are also exceptions that are effectively unresolvable- things like file mode args on common core modules like file, template, and copy. What should a Windows host do when asked to set a UNIX file mode on a Windows filesystem? This is why there are Windows-specific versions of those modules. We try to keep the common arguments and behavior consistent (within reason), but having separate implementations allows us to use the native management technology for the platform (eg Powershell/.NET on Windows, Python on POSIX), and to let the module UIs differ in platform-specific ways where it makes sense. For historical reasons, there are some places where the action/module names are the same for Windows and POSIX, but due to the numerous problems it's caused, we've tried to minimize that as a policy for new work. So effectively, this means that many core POSIX modules could never be fully functional on Windows- it'd be necessary to use the Windows equivalents.

We sometimes get asked why we don't accept pull requests that fix some of these things, along with a policy to reject future changes that regress the fixes... The main reason is that, without comprehensive sanity/unit/integration tests in CI to ensure no regressions, these kinds of changes rot quickly. Policy without enforcement is not really policy, especially on a project as large as Ansible with as many folks able to approve and merge pull requests as we have. We learned long ago that manual code reviews looking for XYZ policy violations will always let things slip through, so sweeping policy-based changes throughout the codebase must be enforced in an automated fashion. Until the code reaches a point where integration tests could actually test and enforce Windows-specific non-regression tests on Windows, sweeping changes in the name of future Windows support probably aren't going to be accepted. 

Content Execution Parity

Let's say we've eliminated or built Windows equivalents to all the UNIX-isms in the core codebase, and that everything is working. Huzzah! Now we should be able to run all the Ansible content out there on our shiny new Windows-native Ansible controller, and the world is all rainbows and unicorns! Right? Sorry, nope. Even though we've gotten rid of the UNIX-isms in the code, that doesn't address UNIX-isms that exist in the Ansible content that the world runs on. The most obvious issues are around POSIX-flavored plays that use localhost or the local connection plugin, since it's necessary to use platform-specific versions of the modules to deal with things like paths and file modes. But that's not the only issue; content with commonly-used features like the pipe and template filters, glob lookups, become methods like su and sudo, to name a few, will never be able to execute the same way in a native Windows environment. If you're an all-Windows shop, or your Ansible content is developed specifically to run on a Windows controller, maybe that's all fine, but without a lot of guardrails to inform when something unsupported or unworkable is happening, it's a recipe for frustration for the folks that are just trying to automate all the things.

Honestly, I don't think there's a realistic comprehensive solution to this one. The best we could probably do is to tell you when you're trying to do something that's unsupported on your controller platform, so at least it's obvious that some conditional behavior is necessary in your content if you want to support running on both Windows and POSIX controllers. Maybe part of the solution is also that the implicit localhost for Windows doesn't exist at all, or is called something else (so we won't even try to run POSIX stuff on Windows or vice-versa). That eliminates the need to make most of the POSIX/Python modules (and related subsystem) work on Windows at all, while still allowing the language and controller to work there. Remember: this is only about the behavior of localhost and the local connection plugin- for the majority of tasks where Ansible is managing remote targets of any type, execution parity should be achievable.

Things That Give Me Hope

None of these things are insurmountable. But they're also not going to happen the right way without some serious investment. Red Hat is clearly not afraid to invest in Windows where it makes sense; look to the existing Windows target efforts in Ansible, official support for Windows OpenJDK, Windows containers on Openshift, to name a few. Ansible has historically been an easy sell to all-Linux and mixed Linux/Windows groups, but without native Windows controller support, most all-Windows groups tend to stop the conversation pretty early. If you fall into this latter camp, be sure to let your Red Hat salesperson know how many Ansible nodes they're missing out on because we don't support this configuration today.

All that said, there's never been a better time to run Ansible controllers on Windows. Ansible works great on WSL and WSL2, and is pretty darn seamless. While that configuration is not capital-S-supported by Red Hat, most of the minor issues we've encountered have been easily addressed. We still tell people to avoid using Ansible under Cygwin, as the previously-mentioned fork unreliability will eventually cause things to break.

As we work on the future of Ansible, we're trying to make sure we eliminate barriers to native Windows controllers, and don't erect any new ones. I'd love to someday announce first-class native Windows Ansible controller support. But it's not something that's going to come easily or quickly.


EmperorArthur said...

What's interesting to me is the use of fork() instead of Python's built in multiprocessing. To me that seems like not just a POSIXism, but a general improvement that could be made by using tools built into the language.

I would really appreciate it if this could be elaborated on further.

Matt Davis said...

It does use `multiprocessing`, but it currently relies exclusively on the forked implementation of multiprocessing in order to get a (nearly free) process-isolated snapshot of all the vars, worker state, etc; IOW, the worker model is optimized for the worst-possible case of a badly-behaved action that corrupts state and needs access to all vars. That's not necessary for the vast majority of tasks, so a reuseable spawned worker where only the necessary task state is shipped over would be sufficient. You can't just switch MP over to `spawn` though- at that point each worker is starting from empty. There are a lot of moving parts necessary to maintain backward compatibility with all the existing plugins that are out there, while still allowing new plugins to be written against a different model.

We've prototyped a few different versions of new worker models over the years, and I've got a hybrid fork/spawn one that covers nearly all the backwards compatibility cases that we'll probably start playing with trying to bring to production in 2022. v1 may or may not be able to do "spawn only" (which would be a prereq for a native Win32 Ansible), but with the ever-degrading condition of fork-sans-exec on MacOS, we might want to expend the extra effort to make spawn-only at least an option.