Can somebody explain to me why async IO is so important and why it is better than using the operating system scheduler?
If process A is blocked because of IO, then then thing that needs to be done will need to wait for the IO anyways.
Of course, in a server context, process A cannot handle new server requests while it is blocked. But luckily we can run more than one process, so process B will be free to pick it up. I will need to run a few more worker processes than there are CPU cores, but is there a problem with that?
I'm thinking now the problem is maybe that running more workers than there are cores will mean that the server accepts more concurrent connections than it can handle?
If I use async code and run exacly as many workers as I have cores, the workers will never blocked.
But then, I have the scenario where multiple async callbacks resolve in short sequence, but cannot be picked up by a worker because all workers are busy.
So, in both scenarios (no async but more workers than cores VS async with as many workers as cores) it can happen that the server puts too much on its plate and accepts more than it can handle.
I have a feeling that this is a fundamental problem that manifests itself differently in both paradigms, but exists notheless?
However, many people do believe that async/await and event loops make reasoning about non-blocking IO much easier. Has your opinion changed since http://techspot.zzzeek.org/2015/02/15/asynchronous-python-an...?
they do, until their program is mysteriously having MySQL server drop their connections randomly because something CPU-bound snuck in and the server dumps non-authenticated connections after ten seconds. Three days of acquiring and poring over HAProxy debug logs from the production system finally reveals the issue that never really should have happened in the first place because the server is only handling about 30 requests per second, and of course the fix is to switch that part of the program to threads.
asyncio certainly makes it easier to reason about non-blocking IO but it also means you have to construct your own preemptive multitasking system by hand, given only points of IO where context can actually switch. We're coding in high level scripting languages. Low level details like memory allocation, garbage collection, and multitasking should be taken care of for us.
But keep in mind asyncio has many issues of its own, which is why I'm happy that alternatives like http://trio.readthedocs.io/ are possible in Python.
I don't really want to write callback-driven code anywhere, but even for simple definitely-not-thousands-of-concurrent-connections kinda problems, the typical language's standard library operation of "do this one thing and wait indefinitely until it's done" gets fairly annoying, working around it with threads is also annoying, and I'm just kinda hoping that the people working on making the async IO user experience better will accidentally make my life easier as well.
1. Yield points are implicit rather than explicit, so their interaction with other effects is unpredictable. https://glyph.twistedmatrix.com/2014/02/unyielding.html has a description of the problem; more generally think about the reputation that thread safety bugs have. (There are those who argue that the advantages of implicit pervasive yielding outweigh the disadvantages).
2. At every yield the processor has to swap out the full C stack (usually 4K or 8K). This is a slow operation ("context switch") and inefficient when often the only information that actually needed to be passed from one task to the next was a single integer (e.g. a socket ID, user ID, SQL query ID, etc.) or something similarly small. Whereas with userspace scheduling, a task switch only has to pass the actual task state that's needed for the task in question.
It's also possible (but not common) to release accessed-but-no-longer-needed pages. Doing this when returning to a thread pool avoids situations where one occasional deep call chain causes all the threads in the pool to eventually become huge.
You still have to bank a whole bunch of registers and there might be detrimental side effects but I can't imagine why the stack would get in the way. Process swapping is obviously much worse since you change the process ID and the virtual memory mapping but even then I don't see why you'd ever copy the stack.
A new process requires a new stack and a new heap segment, that later one is usually on the MB range.
Besides, starting and stopping threads take some time. If you are serving small files, forking at their start means that your process will spend most of its time forking. And if you don't use an easy architecture that forks on every connection, it isn't much more difficult to go all the way and make your server fully asynchronous.
You still do.
The difference is with thread-based IO you block a task until the IO scheduler is done with one operation, while with async IO you block a task until the IO scheduler is done with any operation.
The reason why async can be more efficient is that you have fewer tasks and possibly less task-housekeeping-overhead.
It's important to note that disk I/O generally has poor support for asynchronous operation (because disk I/O traditionally meant that high I/O concurrency brutally murders throughput, which has changed). It's something that's being worked on for Linux, though.
For each of those network connections, your program needs to retain some state for what it is doing.
There are various places to put the state. You can put the state in a callstack and use blocking IO, but then you have to have one process per connection, which has quite a high minimum memory overhead.
So people have developed a lot of frameworks which keep (connection,program) state in various ways and use the operating system's asynchronous frameworks, so you can have two worker processes handling thousands of connections.
Multiprocessing has a lower CPU overhead and is usually the better choice if you have less but heavier connections (ie do a lot of work on each connection) since the OS scheduler can then properly allocate resources to each process (ie, providing a minimum amount of CPU time to all processes so no connection gets stuck). Async can do that too but it doesn't have the resource allocation granularity that processes have (In go or JS I can't just allocate less resources like CPU time to a connection)
If however you need to keep a lot of shared state up-to-date it may become a bit messy. Consider a multiplayer video game server for instance, everybody needs to know where everybody else is. If you have different threads or process for every connection you need to have some sort of synchronization mechanism to update the shared game state. Meanwhile async I/O kind of hides the nasty details of synchronizing multiple connection and you end up with a more or less serialized and single-threaded events. You can handle each update one at a time synchronously.
Web services generally manage to go the first route because the synchronization is typically handled at the database level so there's no shared state at the webapp layer so each worker is effectively independent from the rest.
Also, you may only need the concurrency for a small portion of your code, and the other approaches may be simpler to code and maintain (I’m looking at async).
It comes down to this: Mutual exclusion is a requirement for concurrent operations.
This can be accomplished with mutexes (usually how thread-based implementations work), cooperative scheduling (how some async implementations work, but not all), shared nothing architectures (one request per process, actor/CSP model), among other approaches.
There are plenty of hybrids out there. libevhtp is a threaded async web server. It screams. Erlang presents a no shared memory model along with an aggressive scheduler, allowing for one request per process designs, except in this case a process is extremely lightweight. Golang does something similar with goroutines, similarly lightweight units of execution.
Doing asyncIO was a way to solved the c10k issue back in the day.
The max is only 4M so I guess I'll change my statement to be linux doesn't handle millions of threads very well.
I'm not aware of any system that manages millions of threads well- it's also not clear what the usecase for this would be.
As per the top post, millions of threads means there is no need to implement your own scheduling system with async IO, which can handle millions of idle connections.
There are many comments discussing the general case, but there's also a specific case that is relevant here, which is that Python is not very good at threading. Retrofitting threading onto dynamic scripting languages 10+ years into their lifetime has not proved a very successful project, with results ranging from "unusable" to "usable but probably not something you should put into production" .
So for these languages, "spawning lots of threads and using the OS scheduler" is simply not an option. The only way to recover any sort of concurrency is via async-style operations, where there's various ways of wrapping more or less syntax sugar around it but fundamentally at any given moment, only one instruction is being executed by the interpreter.
(I don't think dynamic scripting languages have any fundamental reason why they can't support threading, it's just really damned hard to retrofit that on to a code base that was optimized for many years for single-threaded behavior. The dynamic scripting languages that are popular all date back to the 1990s. Theoretically a new threadable one could be developed, but I suspect there's a lot of reasons why it would have a hard time gaining any traction, because this hypothetical new language would be trying to go toe-to-toe with all these other languages with decades of experience in the dynamic field, and on the "but we have working concurrency!" side you face competition from Go and other up-and-comers like Crystal, and I'm not sure there's enough sunlight in that niche to allow anything to grow.)
: Generally, when I say this, people claim that there are some dynamic scripting languages that do support threading. Please point me at the exact module that implements it and show me some community consensus that it is safe to use in production. Last I knew, when I said this about a year ago, PHP was closest with a threading library, but community consensus was still "Yeah, don't use this in production." I have no issue with acknowledging that some dynamic scripting language has finally run the gauntlet to having a production-ready threading library, because the point I'm defending is that it was a gauntlet in the first place. If PHP does have production-ready threading, it was a project that took something like a full third or half of its lifetime to accomplish!
> So for these languages, "spawning lots of threads and using the OS scheduler" is simply not an option.
Threaded Python applications are extremely common and are in widespread production use. The popular mod_wsgi Apache plugin is typically run in "daemon" mode where the Python code is run in a threaded server.
The issue where "spawning lots of threads" is not an option is when you are trying to parallelize IO in the range of many hundreds/thousands of concurrent connections within a single process. But that is not the general use case, because that process can only use one CPU core at a time which even for an IO-heavy process still presents a limiting factor. The typical "I want to run a web server" has fairly CPU-busy Python processes where a process typically handles a few dozen simultaneous requests, and for CPU parallelism you use multiple processes. Python has more CPU-busyness than people expect sometimes because it is after all an interpreted scripting language.
A similar situation exists in Ruby.
It can be used instead of the default asyncio.SelectorEventLoop on Lunix(? not Windows at least: https://github.com/MagicStack/uvloop/issues/14).
You will still, for better or worse, use the asyncio standard library when writing code running on uvloop.
But they didn't actually present the total processing time for all of the methods - I assume all of the parallel methods were about 17 seconds? (Compared to the sequential baseline of 29 seconds.) And how were the threaded frameworks configured? How many threads were they told to use (or just the default?), how many threads can they use, and what kind of parallel hardware did they run on?
This blog post presents the decision as one-dimensional; it claims all parallelization methods are the same, so the only dimension to choose on is memory efficiency. But I'm skeptical that all parallelization methods are the same, and the experimental design gives me no information on that front.
> Changed in version 3.5: If max_workers is None or not given, it will default to the number of processors on the machine, multiplied by 5, assuming that ThreadPoolExecutor is often used to overlap I/O instead of CPU work and the number of workers should be higher than the number of workers for ProcessPoolExecutor.
Not knowing how many things are happening in parallel means it is difficult to draw conclusions from it.
From a speed (overall script duration), they are very similar and the differences between them might be as well caused by a different state of a network. My main interest was how the methods perform regarding memory usage. That's why I focused on one dimension only.
I suppose that sending is similar: you pass the OS a buffer, and wait for completion. Sending of the data occurs while you wait.
So, if you can parallelize waits, the OS could be doing strictly parallel I/O for you (e.g. via two network interfaces, or network and disk), even though your code is concurrent but not parallel, and you don't run two sync OS I/O calls in parallel.
you can run sync IO calls and wait for each in separate threads. the GIL is released for IO. the waiting is "in parallel" just as much with a threaded / blocking approach as with a non-blocking.
Parallelism is things physically being executed simultaneously, increasing throughput via physically multiple resources. The multiple IOs you're waiting on might be executing in parallel if you have e.g. a RAID array or SAN in the loop, but it's an implementation detail and isn't what the article is about.
The two terms mean something technically different. The article isn't about parallelism.
If you want to parallelize network I/O, use async. Otherwise, don't.
 Technically: not parallelize, but overlap.