Topic: Multi-core concatenation? (Read 3395 times)

dree12 · « **Reply #15 on:** November 03, 2011, 04:39:46 pm »

Quote from: janglur on November 03, 2011, 03:50:22 pm

Lectorog is correct. Whether DF is on one, two, or all three cores, performance doesn't improve. In fact, each additional dedicated* core makes it lose FPS at increasingly significant intervals.

* = Dedicated, as in it's the ONLY program AT ALL allowed to access the core.

Ah, yes, DF needs to manage its cores as well. I ignored that in my 13 number. That number stems from its supposed 13 threads it uses - this can easily be seen in a task manager.

kaijyuu · « **Reply #16 on:** November 03, 2011, 04:41:32 pm »

Since every major OS can delegate different programs to different cores, having multiple cores without multithreaded applications is far from useless. DF will get 99.9% of a single core when it is run, while the OS and everything else is run on other cores.

Software's what is going to have to step up to the plate eventually. We'll still have faster and faster cores (for a while anyway) but yeah that'll hit an an upper limit without some sort of breakthrough.

thobal · « **Reply #17 on:** November 03, 2011, 05:28:32 pm »

Say that during a given tick, each of your AI-units needs to assess it's situation, and come to a decision. Can not then each unit be given a different core(or queued up for a core), all that stuff be run through the cores, then blasted onto the next turn? Because I think multi-core processing is going to open up an incredible world of intelligent AI. At least as far as games are concerned.

Nadaka · « **Reply #18 on:** November 03, 2011, 06:21:25 pm »

Oh... This again. It gets brought up every couple weeks or months, but usually on the upper forums.

There is automated method for improving performance of an arbitrary single threaded application by splitting it into discrete parallel tasks.

Some of the greatest minds in computer science have spent decades on this task and failed.

Why? There are a several reasons why. One is that some large tasks that are atomic and can not be split into discrete tasks. Another is that the average human mind is incapable of dissecting arbitrary code without domain expertise and documentation and AI is worse.

That said, writing new code that is or can be made parallel is actually very easy once you understand the basic principals. Ensuring that you squeeze increased efficiency out of it is only marginally more difficult.

I am very much an expert on this. It is one of my specialties.

Levi · « **Reply #19 on:** November 03, 2011, 10:30:30 pm »

I dare ya to write a program that outputs the numbers 1 to 1 million in order that runs faster on multiple CPUs than it does on a single processor.

Starver · « **Reply #20 on:** November 04, 2011, 06:39:54 am »

Quote from: Levi on November 03, 2011, 10:30:30 pm

I dare ya to write a program that outputs the numbers 1 to 1 million in order that runs faster on multiple CPUs than it does on a single processor.

Indeed. Although ask for something a tad more complicated (e.g. Roman Numeral representations of said numbers) you could (perhaps) dedicate n-1 cores to working out a number, on demand, and the nth to shepherd this process by allocating the numbers to be processed, retrieving the answers and outputting these in the correct order. The more complex numbers (e.g. 888,888, depending on how you translate the respective numerics) might lag behind some of the following numbers[1], but you can probably absorb that overhead.

(Make it an analysis of the periodicity of a PRNG with each of a range of successive seed conditions, e.g. for crypto-breaking purposes[2], and you might find that some outputs (the better ones!) are delayed enough that a queue of the quickly-ending seed-conditions builds up, although you're probably not concerned about sequential output, just "largest periodicity so far confirmed", and perhaps even "periodicities not yet confirmed, but certainly beyond a certain quantity of terms" alongside, to be enumerated as and when. But I digress.)

But it really needs a programmer's eye to work that sort of stuff out, in general, or at least some way of fulfilling the programmer's intentions (who might not care about sequentiality of some massively repetitive test, but it was only that the linearly-looping code most logically went through a sequence in order and output results in said order). Getting (as per the OP) the single-process code split virtually into into a multi-core version could probably done automatically (distilling a subset of the programmer's own knowledge into the MCP-thingy), but the method for doing so needs code of its own which has enough 'understanding' to be able to rip apart the linear code and re-assemble it in a functionally identical (or sufficiently specification-compliant) asynchronous version. Doing it on the fly would incur overheads (although on-the-fly with a view to looking for patterns that can be cloned and kept running persistently in their own little microcosm might mean it speeds up towards the end), and a suitably-complex pre-processor would certainly work for such a simple program but add its own difficulties, and I very much suspect that either kind of automated re-optimisation would be murder, in the case of DF. Until and unless Toady personally rejigs the whole thing towards asynchronous code (and I think majority consensus is composed of those that want the next version ASAP plus those that that don't mind waiting a bit, but on the basis of wait==content++[3], not (as would be needed) a huge pause while existing features are just 'multicore-friendlified') the side-optimisation of the rendering engine and just keeping the main DF-running core largely free of other system stuff is what we've got. (As noted, by someone else, seems to work well for those that have mentioned they've checked this out.)

[1] Actually, bad example, as most of the complexity in that number is shared by the next few numbers, anyway, with just the last one or two decimal digits needing whatever "extra effort" you might consider VIII to be over the following IX, X, [XC] and [XC]I that would be the next few numbers in the queue. BYGTI.

[2] Or, something that I've been wondering... It's possible (obviously increasingly and exponentially unlikely) to roll successively similar numbers with a dice, or a continuous sequence of common coin-results (as in Rosencrantz And Guildenstern Are Dead, Alas), and yet the first conditional requirement for a good PRNG seems to be that such sequences are less desirable in PRNGs than in actual 'real life'. Obviously, for cryptography purposes, that is all for the best (because it avoids possibly identifiable sequences in a stream-cipher that are essentially encoded by a simple shift-cipher, albeit that it reduces the entropy in the same way as "no character is ever equal to itself" under the Enigma system which was used to good effect in paring down the possibilities), but in a dice-rolling game you do sometimes get multiple sixes (or ones) in a row, and so you should get that in a computationally-equivalent environment, and with the same average frequency (over many, many iterations, naturally, to make it statistically significant) as you'd get in RL, once you distill the output states down to groups that represent a roll of a one, a roll of a two, a roll of a three, etc[2a]. But I digress once more.

[2a] Another point of analysis being not just the PRNG mechanism and status/seed, but the final grouping mechanism, given that a high-precision output bit-reversed (or otherwise chopped and mixed up) might give a differing frequency of group-following-group patterns than the original (once this new output is once more split into several contiguous ranges of values that boil down to the desired lower-precision output states), and might be worth chasing up as an additional (computationally-intensive!) variable in pursuing the desired sequence that provides one with the same "possibility of flukes". More digression. Ignore.

[3] Or "++content", I suppose, technically.

MorleyDev · « **Reply #21 on:** November 04, 2011, 07:48:43 am »

Quote from: Levi on November 03, 2011, 10:30:30 pm

I dare ya to write a program that outputs the numbers 1 to 1 million in order that runs faster on multiple CPUs than it does on a single processor.

I dare you to fight a tiger in a one metre by one metre room naked and armed with a sniper rifle instead of a shotgun. Ergo sniper rifles suck compared to shotguns.

Do I even need to point out the logical fallacy you're making here?

What about Conway's Game of Life? The functional in-out nature of it is well suited for splitting the grid into sections that can be divided amongst the cores and processed in parallel.

ChairmanPoo · « **Reply #22 on:** November 04, 2011, 08:24:52 am »

I dare you to fight a tiger in a one metre by one metre room naked and armed with a rubber hose and an used diaper

Starver · « **Reply #23 on:** November 04, 2011, 08:27:12 am »

While the 1..1,000,000 example was perhaps at the extremity, so is yours, at the other end. I don't think anyone would argue against massive parallelism being the best way of dealing with a massive-parallelism-suitable problem. As it is, however, DF is far from the latter, even while it not as cut'n'dry as the former.

It is eminently arguable that DF's multi-agent nature and grid-based system could actually be reimplemented (via a total re-write) so that the individual units and sub-regions involved in path-finding, water-sloshing, temperature transitioning, etc, could have all their calculations dedicated to separate forks which could be drawn away to work on different cores and spread the load, just as you are suggesting with the GoL example (very similar, in many ways, I know). But that's a separate argument[1] from what the OP of this thread was talking about and would take much time and effort that Toady has previously stated he is not currently willing to divert to (and would need to spend time changing his coding style for) just to take the current game and (doubtless with some related refinements along the way, admittedly) redo it in that manner.

Quote from: ChairmanPoo on November 04, 2011, 08:24:52 am

I dare you to fight a tiger in a one metre by one metre room naked and armed with a rubber hose and an used diaper

Is that fully naked and the diaper 'used' before going went into the room?

[1] And contrary to both the "bugfix now!" and "moar content!" attitudes being pitted against each other on another thread I'm posting to.

Shades · « **Reply #24 on:** November 04, 2011, 08:44:15 am »

Quote from: Levi on November 03, 2011, 10:30:30 pm

I dare ya to write a program that outputs the numbers 1 to 1 million in order that runs faster on multiple CPUs than it does on a single processor.

This is actually trivial to do as the calculations at each stage are discrete. Each processor output a different set of the numbers to a memory mapped file for example would leave the output in order but take significantly less time.

Displaying it to the screen faster would be a problem because the IO is a bottleneck.

Starver · « **Reply #25 on:** November 04, 2011, 09:26:21 am »

Quote from: Shades on November 04, 2011, 08:44:15 am

Quote from: Levi on November 03, 2011, 10:30:30 pm
I dare ya to write a program that outputs the numbers 1 to 1 million in order that runs faster on multiple CPUs than it does on a single processor.

This is actually trivial to do as the calculations at each stage are discrete. Each processor output a different set of the numbers to a memory mapped file for example would leave the output in order but take significantly less time.

Displaying it to the screen faster would be a problem because the IO is a bottleneck.

There's two points to that (I believe). Firstly, the display bottleneck is the primary factor, and almost definitely takes vastly more effort (GPU shortcuts aside) than the successive incremental operations that are bread-and-butter to any non-esoteric programmable computer.

But, secondly and probably most important, in a parallelism situation the management thread itself is expending more effort (and thus slowing its own core down) in doing all the necessary management functionality. Compare just getting the main thread to do the "incremental by 1" itself, without any need for all the management-code, with farming out the "increment by n" to n other processes, and correctly collating the return signals. Even cut down to the bone with minimal attempts to keep everything synchronised and buffered, it can't possibly be any faster.

If you're taking the value i₀ from 1 to 1,000,000 and calculating (say) i_t+1 = (ai_t+c) mod M to the 400th iteration of t for each, then farming it out would probably (almost certainly!) save time, but if what you're outsourcing is no more complex than func ForDisplay(x) {return x} (hey, probebly even func ForDisplay(x) {return ToChar(x)}!), in whatever form of code you're actually programming, then all the management involved to do that farming out is going to be the killer, and begs an extraordinary explanation to justify being more efficient (say dedicated monatomic CISC instructions designed precisely to deal with the shared memory reading and writing in the slave threads, with a similar DWIM-like instruction on the MCP) before you can claim any faster operation.

Shades · « **Reply #26 on:** November 04, 2011, 09:51:54 am »

If your having to worry about synchronisation then you'd doing the parallelism incorrectly. When you start going beyond a handful of processes you need to be building in a way that there isn't a need to synchronise for the processed tasks. For the particular problem mentioned once you are past the setup there is no possibility of collision and not need to collate data, results would just be written in order to the memory map.

I agree if you had to outsource func ForDisplay(x) {return x} you would have an issue but if you are in a situation like that you have divided your problem scope poorly. There are of course classes of problems that are non-trivial and potentially impractical to parallelise, but in practice I've not come across one.

Starver · « **Reply #27 on:** November 04, 2011, 10:44:24 am »

Quote from: Shades on November 04, 2011, 09:51:54 am

If your having to worry about synchronisation then you'd doing the parallelism incorrectly. When you start going beyond a handful of processes you need to be building in a way that there isn't a need to synchronise for the processed tasks. For the particular problem mentioned once you are past the setup there is no possibility of collision and not need to collate data, results would just be written in order to the memory map.

I agree if you had to outsource func ForDisplay(x) {return x} you would have an issue but if you are in a situation like that you have divided your problem scope poorly. There are of course classes of problems that are non-trivial and potentially impractical to parallelise, but in practice I've not come across one.

_{(It's "you're"... sorry, makes me shudder when so obvious. Probably an honest mistake, I know I make them. Probably made them myself in the last few posts, in fact. Feel free to highlight my own thinkos and typos as I make them.)}

I'm just wondering how you're writing the parallelism then, because the only thing you can outsource is some form of "return x", in that situation. Or a kind of "return x+=n" thing. And you still need to ensure synchronisation to ensure the cores post their values into the memory (or directly to display) in order. Or are you reserving a 1,000,000 three-byte slots (at least, for values, seven-bytes at least for the character-based output forms), getting the not-needing-to-synchronise cores to fill their own "nx+c" positions, and then flooding it out to the GPU when finished?

(Are one million divided by 'n' shared-memory writes plus the "print this" directive at the end quicker than "increment, send to display memory, increment, send to display memory, ..."? Maybe, if you're not using the most efficient screen-writing techniques in either case, but the loop needed on the first mass-send is what I think'd be the killer for that method, as you're effectively doing "increment memory pointer, send to display memory, increment memory pointer, send to display memory, ..." (so similar to the second method's initial and only phase) after having done the 'faster' (if shared-memory issues and the need to block delay writes don't slow it down, that is... depends on the chip firmware's possibly faster memory-management method) memory-poking exercise.)

I suppose there could be something in the extended instruction set (since MMX there have been monatomic array-operators, at machine-code level, I know, and I imagine the multi-core chipsets have even more specialised additions to the instruction sets that might speed up some segments of inter-core communication and shared-resource access), but I can't help feeling that these need multiple cycles of the core's own microcode handler (compared with the number of cycles needed to execute the incremental sum and then poking out to video ram or the GPU's bus), in a manner normally accepted in a multi-core program by the management and expectations imposed by the synchronising/collating/general control core's processing. And this is where my credulity lies.

Just to explain my thoughts. Not very well, I know, but without knowing exactly where you're coming from I've rather rambled around the conceptual area in which I think your solution resides, and probably widely missed the mark anyway. While I did give higher-level-looking pseudocode, I'm rather still assuming that we're writing as at low as level as we possibly can, to avoid the inevitable inefficiencies of having to run through a compiler that doesn't get to the nub of the issue exactly the way we want it to.

Shades · « **Reply #28 on:** November 04, 2011, 11:05:56 am »

Quote from: Starver on November 04, 2011, 10:44:24 am

_{(It's "you're"... sorry, makes me shudder when so obvious. Probably an honest mistake, I know I make them. Probably made them myself in the last few posts, in fact. Feel free to highlight my own thinkos and typos as I make them.)}

Sorry about that, It is a mistake I make far too much. (Along with the it's/its one)

Quote from: Starver on November 04, 2011, 10:44:24 am

Or are you reserving a 1,000,000 three-byte slots (at least, for values, seven-bytes at least for the character-based output forms), getting the not-needing-to-synchronise cores to fill their own "nx+c" positions, and then flooding it out to the GPU when finished?

Being a memory map I would have been reserving the whole range regardless, so handing each sub process it's own chunk and range to fill seem the easiest solution. Each chunk being self sorted and the chunks themselves in order.
The output in that case is the memory map so it's already in order at the output location. Nothing particularly clever with instruction sets, but solves the problem at hand.

I did mention it would hit an IO problem if displaying to screen was required. Although you could do some magic with modern GPUs by doing both the calculation of the list parts and the rendering itself on the GPU rather than the CPU which avoids the bottle neck of them talking to each other.

As to the OP's question, it is theoretically possible to split any single threaded process across multiple cores like that but I've not seen it done anything outside proof of concept. Certainly a non-trival problem to achieve this with a real program.
For that reason alone it seems much easier for developers to change the way they program (and think) so they deal with discrete tasks rather than threads, unless something major changes in the immediate future they won't really have a choice as all the progress is being made towards many small cores.

Nadaka · « **Reply #29 on:** November 04, 2011, 11:30:12 am »

Shades... No, it is not even close to theoretically possible to split any single threaded process out to multiple processes capable of running in parallel, much less do so while increasing efficiency.

Bay 12 Games Forum

News:

Author Topic: Multi-core concatenation? (Read 3395 times)

dree12

Re: Multi-core concatenation?

kaijyuu

Re: Multi-core concatenation?

thobal

Re: Multi-core concatenation?

Nadaka

Re: Multi-core concatenation?

Levi

Re: Multi-core concatenation?

Starver

Re: Multi-core concatenation?

MorleyDev

Re: Multi-core concatenation?

ChairmanPoo

Re: Multi-core concatenation?

Starver

Re: Multi-core concatenation?

Shades

Re: Multi-core concatenation?

Starver

Re: Multi-core concatenation?

Shades

Re: Multi-core concatenation?

Starver

Re: Multi-core concatenation?

Shades

Re: Multi-core concatenation?

Nadaka

Re: Multi-core concatenation?