If you think about what a program does, and how a computer works, with concerted effort a really dedicated programmer could implement multithreadding as a whole series of binary patches to trap and redirect execution.
it would just be really hard, and nobody would want to do it.
I disagree with the assessment (unless "a series of binary patches" essentially means replacement of everything). To implement multi threading you need to protect data stores so access is blocked while the data items are modified, and you need to ensure modifications are performed in the correct order even when done by different threads, based on the correct version of the data (not old data available when the thread was started, later superseded by another thread), etc. This means you need administration to set the threads off to work with defined sets of data (as opposed the semi random state of a common store at various access times), as well as collection and integration of the results from the various threads into a new common state, and you may need to use several "pulses" within a tick. Micro threading, however, is a different thing that might be achieved by a good compiler or possibly through binary patches.
A compiled binary is a long string of instructions, a bit like a chain of pearls. Instructions are executed, and position on the chain is tracked using the stack pointer. This pointer gets updated on conditional jumps, which move the execution to different locations on that chain, depending on what it just did.
The hypothetical binary patch set I mention, would insert a new routine that traps execution, and spawns new execution threads, sending them to the appropriate parts of the binary stack. At the end of each thread instruction, is a call to return to this new landing pad area, where the worker thread instance then determines it is not the master thread, and waits for a new assignment. The exact same code that toady uses for say-- finding a path from point A to point B would be called, it would just be executed a bunch of times in parallel, the results sanitized by the main parent thread, and then after paths are calculated, the next logical process would happen.
EG, look at what the process flow is for the activities of each creature is, then parallelize those processes. Trap execution before moving to the next execution step, check sanity (and re-run as needed to retain concurrency), then continue execution to the next batch of evaluations.
Not a complete re-write, but it would be a mammoth undertaking. Like I said, nobody would want to do it.
You are of course, correct about needing to protect all the data to keep the threads from walking on each other. That is one of the major functions of the main thread, after it gets trapped in our little patched in routine. It becomes the watchdog, and handles the necessary spinlocks to prevent write accesses, tracks threads that have failed due to race conditions, and aborts/restarts them as needed.