Topic: Multi-core concatenation? (Read 3398 times)

Shades · « **Reply #30 on:** November 04, 2011, 11:42:33 am »

Quote from: Nadaka on November 04, 2011, 11:30:12 am

Shades... No, it is not even close to theoretically possible to split any single threaded process out to multiple processes capable of running in parallel, much less do so while increasing efficiency.

Yes, any was the wrong term to use. I need to be more careful

Proof of point a single instruction program cannot be split.

MorleyDev · « **Reply #31 on:** November 04, 2011, 11:46:44 am »

Quote from: Starver on November 04, 2011, 08:27:12 am

While the 1..1,000,000 example was perhaps at the extremity, so is yours, at the other end. I don't think anyone would argue against massive parallelism being the best way of dealing with a massive-parallelism-suitable problem. As it is, however, DF is far from the latter, even while it not as cut'n'dry as the former.

I never claimed DF would gain from massive-parallelism, I was simply demonstrating the point that different tools are better suited for different requirements ^^ Heck, I doubt DF could be parallelised at this point without some massive rewrites and headaches. From what I've seen and read it's in the very nature of a lot of roguelike games and their ilk to have some highly intertwined code after all...

Starver · « **Reply #32 on:** November 04, 2011, 11:50:26 am »

I took the original challenge ("outputs the numbers") as including printing...

Yup, I somewhat concede that if each core got a shot at poking the GPU directly (still, consider the access-blocking and permitting needed, while each is doing its half/quarter/sixth-of-a-million, pokes, each) and then the "display page swap" (or whatever you do, these days, on modern GPUs), it might be quicker. I'm worried about aforementioned shared-memory issue, over the lower-bandwidth of the bus towards it, though.

A multicore GPU being given the task, internally, now... Hmmm... Same technique as getting modern graphics cards to do some of the physics calculations (but far simpler, maybe not needing even to 'report back' any more than a cursory "done" to the motherboard), probably already given the firmware for efficiently delineated/shared access to screen memory, and that wouldn't even need a multicore CPU to take advantage, having abdicated all responsibility beyond the minimum necessary to set the situation up and resolve any "Done!" signals. Would highly depend on the card (and GPU's core speed/suitability for non-graphics command-set compared to the CPY), of course, assuming I'm not giving them credit for far more autonomy and self-determination than they actually deserve (at today's tech-level, I know I saw proof-of-concept simulations of the kind of tech I'm describing, a few years back, and am rather assuming that it is ubiquitous (though maybe non-unified and propriety to each manufacturer) on the actual high-end stuff of today.

Maybe we have a plan! Now, who wanted these million numbers printing out again, and when does he/she want it doing by? 'Cos I probably need to source some hardware, get an intense refresher in Assembler, subscribe to the card manufacturer's developer programme, etc... What's this project's budget? Do I get any staff? I positively demand a cool domain name. We even need to work out whether we're comma-delimiting the thousands (with optional dot-delimiting for systems with regional settings specifying that!), whether it's one number per line or space-separated... So much to think about, so much to do, but I'm feeling confident that we can get a 95% bug-free program before we even move it to the beta stage.

Levi · « **Reply #33 on:** November 04, 2011, 11:54:59 am »

Hee hee. I guess that is the other big downside, the complexity and extra time it takes to write programs in parallel. (Although some types of things lend themselves to it).

Nadaka · « **Reply #34 on:** November 04, 2011, 12:04:30 pm »

A better thought experiment would be asking to write a program that outputs the numbers from 1 to 1 million in a random* order without repetition, because it actually touches on some of the problems that arise from attempting to run in parallel on general purpose hardware.

Quote from: Levi on November 04, 2011, 11:54:59 am

Hee hee. I guess that is the other big downside, the complexity and extra time it takes to write programs in parallel. (Although some types of things lend themselves to it).

I don't find it complex at all, though I have been told that I have expert bias.

Levi · « **Reply #35 on:** November 04, 2011, 12:09:45 pm »

Quote from: Nadaka on November 04, 2011, 12:04:30 pm

A better thought experiment would be asking to write a program that outputs the numbers from 1 to 1 million in a random* order without repetition, because it actually touches on some of the problems that arise from attempting to run in parallel on general purpose hardware.

Quote from: Levi on November 04, 2011, 11:54:59 am
Hee hee. I guess that is the other big downside, the complexity and extra time it takes to write programs in parallel. (Although some types of things lend themselves to it).

I don't find it complex at all, though I have been told that I have expert bias.

Well, more complex that doing things as a single process at least. I admit my 1 to a million example is silly, but it does somewhat show how much more thinking a person needs to do compared to the single threaded version.

alway · « **Reply #36 on:** November 04, 2011, 12:11:43 pm »

There was actually a talk at RIT this morning about runtime/dynamic parallelization; unfortunately I didn't go to it, but there is plenty of research in that direction and as such it may be pretty typical in the not so distant future.

Spoiler: From the email about the presentation (click to show/hide)

Though keep in mind any sort of system would almost certainly need to be built in during compilation, and as such would be useless for DF unless it was Toady himself implementing it (as he's the only one with access to the source code and such).

Thief^ · « **Reply #37 on:** November 04, 2011, 12:23:50 pm »

Quote from: janglur on November 03, 2011, 03:54:30 pm

Why? It's not strictly for DF. I have many, many single-threaded apps that need the same boost. Minecraft, Minecraft's Server, this fractal compression tool i'm using that's marvelous but unbeleivably slow, etc.

Think of CPU cores like cars, threads as people, and your argument rephrased as "I have two cars, why can't I use both to make me get somewhere faster than I could with one?"...
You can't, but it's much quicker to move ten people to somewhere with two cars than one.
Translating back in software speak, one thread can only ever run on one CPU. But with a dual-core you can run a 10-thread program twice as fast as a single-core could.

Eagleon · « **Reply #38 on:** November 04, 2011, 12:35:06 pm »

Quote from: Thief^ on November 04, 2011, 12:23:50 pm

Think of CPU cores like cars, threads as people, and your argument rephrased as "I have two cars, why can't I use both to make me get somewhere faster than I could with one?"...
You can't, but it's much quicker to move ten people to somewhere with two cars than one.

Just blow up one car behind the other to get a boost in acceleration. Then claim the insurance money to buy a faster car.

Jay · « **Reply #39 on:** November 04, 2011, 04:48:58 pm »

Regarding multiple cores' usefulness - Hell no, it's not useless to get more than four cores.
Your computers, although you don't generally KNOW this, tend to run upwards of 70 threads at a time. Despite most of these being sleeper threads, you can still easily get a boost in performance on any one application by emptying relevant cores of these threads. They still take CPU time to run, just a negligible amount under normal circumstances. That time adds up, though.

You cannot get a speed boost by using multiple cores for one thread. Full stop. Unless programmed with threading in mind, stuff always happens in sequence, so only one of your cores that you've dedicated to the application will actually run at a time, wasting the cycles of the others.
This is also why hyperthreading is a gimmick. You can emulate more cores, sure, but it's still being done in sequence, so you've actually just made what amounts to two cores at half the speed, which is rarely relevant.

Regarding the future: Sure, it's in multiple cores. But so are the applications. DF will get there "eventually", and the modders are working on Minecraft (see: Optifine). Stuff is getting there. It takes time to develop. Same way IPv6 is such a joke right now. Older applications that don't receive updates of that caliber, sure, you'll be stuck with the same number of threads. But again, more cores means you can absolutely dedicate the cores it needs.

Contrary to popular belief, core speed is not "stuck". The current trend is towards heat efficiency. With that in mind, remember that core speed is in direct balance with heat output. You can almost always increase speed, so long as you can deal with that heat.
Ivy Bridge's new design involves more efficient transistors, which will deflect the overall heat "balance" downwards, allowing either more power efficiency (which is how it's advertised, usually), or more core speed.

Virex · « **Reply #40 on:** November 04, 2011, 05:27:49 pm »

I thought the problem with having too many cores was that eventually your bus gets "clogged" and the cores are spending more and more time waiting until they can access the memory? To solve that you'd need to get a faster bus or you'd need a memory type that supports parallel access and both are pretty expensive.

Also, core speed is going to get a jump very soon as chip manufacturers are bringing their 20-nm production systems on-line. And don't forget that it's not just clock speed that matters, there's something to be gained when it comes to a processor's microarchitecture, as Intel proved with it's Sandy Bridge architecture (which is the reason an i7 is faster than most other processors with the same clock speed)

Tellemurius · « **Reply #41 on:** November 04, 2011, 09:57:26 pm »

You mean with the brand new tri-gate transistors?

janglur · « **Reply #42 on:** November 06, 2011, 01:43:36 am »

Quote from: Virex on November 04, 2011, 05:27:49 pm

I thought the problem with having too many cores was that eventually your bus gets "clogged" and the cores are spending more and more time waiting until they can access the memory? To solve that you'd need to get a faster bus or you'd need a memory type that supports parallel access and both are pretty expensive.

Also, core speed is going to get a jump very soon as chip manufacturers are bringing their 20-nm production systems on-line. And don't forget that it's not just clock speed that matters, there's something to be gained when it comes to a processor's microarchitecture, as Intel proved with it's Sandy Bridge architecture (which is the reason an i7 is faster than most other processors with the same clock speed)

Glad you mention this. This is adressed by AMD, although only partially. Since the Phenom II, the processors actually have two parralel memory controllers. In my BIOS I can set it to 'ganged' or 'unganged', which (loosely translated as I understand it) to whether the controllers work in tandem, each as a single-channel memory bus, or synchronously as a sort of quasi-quad-channel. The result, as I have found, is actually more variable than anticipated. Dwarf Fortress gets a (small) boost from using the Unganged setting over Ganged, along with many other applications. Whereas similar programs, like Minecraft, prefer Ganged. Ultimately, it boils down to whether the application is multithreaded AND doing a metric fuckton of read/write activity, or if it's not. Programs that consume and thrash memory much prefer the controllers working asynchronously and thus able to read/write in dual channel across four chips (if you have four) than if the alternative where it sequences the read/write activity. Whereas programs that do large read/writes but in relatively low quantity will benefit from the increased throughput of Unganged. In both cases, the gain or loss is relatively marginaly, in the 3-7% range, which violates the '15% noticed' rule. (The rule states that anything must be faster or slower by an overall 15% for it to really be noticeable by the user.) Ultimately it's a tweak only for people who feel the need to tackle it rather than addressing the main problem a bit more directly- you need faster memory architecture!
In any case, AMD plans the next-gen of their 8-cores (that is, the ones AFTER the current 8-core Phenoms) to implement quad memory controllers with true octo-channel capabilities [Not the quasi-QDR currently used by them or the quadi-SDR Intel boasts]. This QDR step is definitely in the right direction, but there's no definitive word on when it will happen exactly.
Multiplexing seems to be the name of the technology game. At this rate, by 2020 we can expect to see 8-32 core CPUs with a seperate memory controller for every DIMM, maybe two if they ever perfect true QDR on an electrical level rather than just logical.

[Note: I may have the Ganged/Unganged reversed, i'd have to check and it's too late for me to care enough to.]

Bay 12 Games Forum

News:

Author Topic: Multi-core concatenation? (Read 3398 times)

Shades

Re: Multi-core concatenation?

MorleyDev

Re: Multi-core concatenation?

Starver

Re: Multi-core concatenation?

Levi

Re: Multi-core concatenation?

Nadaka

Re: Multi-core concatenation?

Levi

Re: Multi-core concatenation?

alway

Re: Multi-core concatenation?

Thief^

Re: Multi-core concatenation?

Eagleon

Re: Multi-core concatenation?

Jay

Re: Multi-core concatenation?

Virex

Re: Multi-core concatenation?

Tellemurius

Re: Multi-core concatenation?

janglur

Re: Multi-core concatenation?