Topic: Hope is on the horizon - one day FPS drops might be a thing of the past! (Read 8276 times)

wierd · « **Reply #30 on:** September 19, 2016, 04:36:47 pm »

I think they are confused. It is the epitome of centralized, because it is the bigiron+thin client model all over again, and certainly not new.

But just throw the word "cloud!" On it, and the PHBs won't know it isn't new or awesome.

SpoCk0nd0pe · « **Reply #31 on:** September 19, 2016, 06:11:58 pm »

Quote from: wierd on September 18, 2016, 11:54:47 am

FPS drop in DF is not usually caused by cpu speed, but by RAM access speed.

Modern CPUs are already much faster than the switching speed in most RAM modules, (Yes, even DDR4). This means that if the program does lots of things with RAM (Like DF does), then it will spend lots and lots of time waiting around for the memory bus to be free so it can perform another memory operation.

DF uses very large data structures in RAM (Like the creature and item index vectors), which get bigger the more dwarves or items you have in your fortress. There are limits to how much data one can pull in or out of a memory read or write at a time. This is usually called the "Word size" and is processor and architecture dependent. 64bit processors have larger word sizes, which is one reason (among many!) that the 64bit version of the game is very well received, and likely results in a performance increase on the same hardware.

Could the waiting problem be reduced by pulling different things from RAM in the meantime? By e.g. multi threading (things that make sense to multi thread) or doing some kind of queue?

Or is not more a bus width problem?

I've probably mostly got sciolism but I'm curious

[edit] I relly really hope Intel releases another L4 cache CPU. If I understand things correctly, something more geared towards this kind of problem. Like the quad core kabylake-x but with eDRAM.

wierd · « **Reply #32 on:** September 19, 2016, 06:43:09 pm »

The speed of ram is usually the FSB speed, by definition.

Modern systems, this is around 400mhz. (To 800mhz)
The CPU runs 1.2 to 3.5ghz, depending on the chip. At the slowest, the CPU is 4x faster than the ram.

Many times, the results of an operation do not need to go into ram to be used by the next computation, as the value can be stored inside the CPU in one of the registers. The problem is that the data structures used by df are "enormous", and are bigger than the register size in many cases. This means the compiler can't use such tricks to optimize the binary.

Cache can operate at the internal speed of the CPU, because it is on-die, and does not have the signal propagation speed issues of going out onto the memory bus.

Because of this, some data can be stored in the cache, but cache is shared by all active tasks running on that chip, including those on a different core. A cache miss forces a fetch cycle.

The speed difference comes from the decoherence that happens when driving at gigahertz frequencies. The distance of the trace before decoherence happens is usually in the centimeters of distance. Basically, the data being sent to and from the ram gets all jumbled from rf noise from the wire itself beyond a certain distance. This distance is so prohibitive that it actually poses design problems inside the CPU itself. To overcome this problem, the ram runs much slower than the chip's internal speed.

L2 and L3 cache is typically on die, but distance to the alus is where the distinction gets made. The areas closest to the alus are prime real estate, which gets the L1 cache priority. This is why L1 is usually so small. L2 is on die, but just outside the part used for all the cores. L3 is outside the area used for any coprocessors, And is shared by all cores. L4 is off die, and suffers from propagation delays. It is usually 2x the FSB speed though.

The issues inherent with driving signals at these frequencies are very much at the heart of this issue.

The size of the word controls how much data can get in or out of the memory bus on a read or write, not how long the CPU waits between operations. Eg, the city bus arrives and departs every 10 minutes. But if a bigger bus is implemented, twice as many patrons can ride per arrival/departure.

jecowa · « **Reply #33 on:** September 19, 2016, 06:48:47 pm »

Quote from: wierd on September 19, 2016, 06:43:09 pm

The size of the word controls how much data can get in or out of the memory bus on a read or write, not how long the CPU waits between operations. Eg, the city bus arrives and departs every 10 minutes. But if a bigger bus is implemented, twice as many patrons can ride per arrival/departure.

With more people riding on the bus, do you think the 64-bit version of Dwarf Fortress v0.43.05 should be faster than the 32-bit version of Dwarf Fortress v0.43.05?

wierd · « **Reply #34 on:** September 19, 2016, 06:52:08 pm »

Only if the 64bit version makes use of the bigger word size.

If it only requests the 32bit word size reads and writes per cycle, instead of dispatching double read and double write per cycle, the bus is only half full of patrons.

Other issues: the data requested must be contiguous is ram. It can't be at separate addresses. If the structures don't align, you have bad performance.

SpoCk0nd0pe · « **Reply #35 on:** September 19, 2016, 07:07:28 pm »

But if most operations do not fit into L3, it would be beneficial to pull several different things at once right? So if you could equally use 4 cores, the calculation should be 4 times faster minus the extra time by more probable cache misses (the data pulled by the 3 other threads relancing data that might have been a cache hit)?

wierd · « **Reply #36 on:** September 19, 2016, 07:09:48 pm »

That is multi threading. Df does not multi thread. It becomes hard for the core process to know if the results from other cores have arrived or not. This is called concurrency. To prevent one core from scrambling the data in cache that is being used by another core, things like spinlocks and murexes are needed, which prevent a process from doing that. This slows things down, and adds complexity.

Concurrency can be nasty, especially for programs that were never made to have to deal with it. Toady is not a masochist, and does not want to go there. His program, his choices.

Instead:

The compiler uses branch prediction to help determine what the code execution path will be, and makes compiler optimisations to minimize fetch cycles. Things like predicting if two operations will read adjacent addresses in 32bit sized chunks, and preemptively grabbing that adjacent data with a 64bit read, then proceeding with a cache hit for that data when the code does need it. Modern compilers do a pretty good job of this. A cache miss is still a cache miss though.

jecowa · « **Reply #37 on:** September 19, 2016, 07:18:47 pm »

Quote from: wierd on September 19, 2016, 06:52:08 pm

Only if the 64bit version makes use of the bigger word size.

If it only requests the 32bit word size reads and writes per cycle, instead of dispatching double read and double write per cycle, the bus is only half full of patrons.

Other issues: the data requested must be contiguous is ram. It can't be at separate addresses. If the structures don't align, you have bad performance.

So Dwarf Fortress probably doesn't make use of bigger words since they are still compiling 32-bit versions. You don't think the compiler could fixing that automagically for them, do you?

Quote from: SpoCk0nd0pe on September 19, 2016, 07:07:28 pm

But if most operations do not fit into L3, it would be beneficial to pull several different things at once right? So if you could equally use 4 cores, the calculation should be 4 times faster minus the extra time by more probable cache misses (the data pulled by the 3 other threads relancing data that might have been a cache hit)?

If the bottleneck is moving data from the RAM to the cache, I don't think a multi-threaded Dwarf Fortress would be faster. Maybe quantum magic will allow the CPU cache to be far away but still really fast so we can have 16 GB of quantum cache.

wierd · « **Reply #38 on:** September 19, 2016, 07:26:46 pm »

The compiler can do the best it can to predict the execution path, and read data Into cache preemptively, yes. But that only gets you so far when the cache is shared by multiple processes, such as on a multi core chip.

One of the reasons I suggested a board with physically discrete CPUs, rather than multi core. (Or a multi chip, multi core system, with all other cores idle). That way there is no cache contention between processes.

SpoCk0nd0pe · « **Reply #39 on:** September 19, 2016, 07:28:36 pm »

The idea in my head looks like this:

With one thread, you do x+y, then you decide you need to do +z. z is too big for the cache, so the cpu has to wait for the ram.

With 100 threads (hypothetical number), there is already another thread where z is already fetched from ram, so the cpu isn't idle. It can do something else. Even all 4 CPUs (or how many you have) can do somthing else.

Would I run into some other bottleneck? Or would cache contention negate all the benefits from having multiple threads?

wierd · « **Reply #40 on:** September 19, 2016, 07:30:30 pm »

Again, keeping the threads synchronized so they don't walk all over each other is a nontrivial exercise.

Toady does not want to do that. His program, his decision.

Also yes: cache contention. Each core shares L3 and L4 caches with the other cores on the die. They can push data out of the cache, leading to more cache misses if not tightly managed.

SpoCk0nd0pe · « **Reply #41 on:** September 19, 2016, 07:34:20 pm »

Quote from: wierd on September 19, 2016, 07:30:30 pm

Again, keeping the threads synchronized so they don't walk all over each other is a nontrivial exercise.

Toady does not want to do that. His program, his decision.

I can imagine that is quite difficult to do.

Did he post or say he doesn't want to do it? Or could it be just low on his priority list, kind of a current non-issue but maybe in the future?

Thanks for all the explaining btw!

jecowa · « **Reply #42 on:** September 19, 2016, 07:36:34 pm »

Quote from: wierd on September 19, 2016, 07:30:30 pm

Also yes: cache contention. Each core shares L3 and L4 caches with the other cores on the die. They can push data out of the cache, leading to more cache misses if not tightly managed.

So does this mean you think a hypothetical multi-threaded Dwarf Fortress would be slower since it would likely have more cache misses?

wierd · « **Reply #43 on:** September 19, 2016, 07:49:50 pm »

Depends on cache size, number of other processes being run, and number of threads df would be using. Just saying more is not always better.

Miuramir · « **Reply #44 on:** September 21, 2016, 01:21:07 pm »

Quote from: SpoCk0nd0pe on September 19, 2016, 07:28:36 pm

The idea in my head looks like this:

With one thread, you do x+y, then you decide you need to do +z. z is too big for the cache, so the cpu has to wait for the ram.

With 100 threads (hypothetical number), there is already another thread where z is already fetched from ram, so the cpu isn't idle. It can do something else. Even all 4 CPUs (or how many you have) can do somthing else.

But fetched from RAM to *where*? The whole point is that all of the on-die cache put together is smaller than some of the individual data structures DF may need to iterate over. So it has to read a piece from system RAM, work on it, then send it back and get the next piece. In general terms, IF this is the limiting factor, there are four sorts of hypothetical improvements one can make to that sort of process:

* Reduce the size of the data involved so it fits into on-die cache. Usually not applicable; but at some future point we might find that with continuing improvements in hardware, a no-history 1x1 hermit fort or something would fit entirely on the chip, and it would run dramatically faster than larger ones. Few improvements expected here in the near future.

* Reduce the number of different passes needed to be done over the data per "tick". This is largely a compiler optimization issue, although turning features like temperature and weather off can help if the user is willing to accept the reduction in quality. Toady jumping up a couple of compiler versions may have brought some improvement here, although I'd not expect further huge gains.

* Significant improvements in the amount of low-level cache; this is Intel's responsibility, and it's not likely to get huge jumps. Theoretically, you could design a variant CPU package that would normally have a large number of compute cores, but instead had only a few cores swimming in a huge L2 / L3 cache. It would be a niche product, however.

* Improvements in the overall speed of system RAM access. This is an area where moderate improvement is likely due to continuing progress at Intel, but dramatic improvement is not expected. This could be improvements in the FSB speed, the actual RAM latency, the memory controller, or whatever.

Again, we don't know for sure whether DF is thus constrained, or more likely what fraction of the time DF is thus constrained. And this will depend on the rest of the computer as well. It's entirely possible that DF is CPU limited on weaker processors, and RAM access limited on stronger ones, for a very simple example. And anything that gets anywhere near a disk, even a SSD, is so ridiculously slow by comparison that even momentary glitches there can have significant effects on overall speed; this discussion assumes you already have enough system RAM to store all of DF's code, all of DF's working data, and all of your operating system, web browser, music player, and whatever else in RAM; otherwise they will be stealing precious slices of memory access bandwidth with the very slow process of paging or streaming.

News:

Author Topic: Hope is on the horizon - one day FPS drops might be a thing of the past! (Read 8276 times)

wierd

Re: Hope is on the horizon - one day FPS drops might be a thing of the past!

SpoCk0nd0pe

Re: Hope is on the horizon - one day FPS drops might be a thing of the past!

wierd

Re: Hope is on the horizon - one day FPS drops might be a thing of the past!

jecowa

Re: Hope is on the horizon - one day FPS drops might be a thing of the past!

wierd

Re: Hope is on the horizon - one day FPS drops might be a thing of the past!

SpoCk0nd0pe

Re: Hope is on the horizon - one day FPS drops might be a thing of the past!

wierd

Re: Hope is on the horizon - one day FPS drops might be a thing of the past!

jecowa

Re: Hope is on the horizon - one day FPS drops might be a thing of the past!

wierd

Re: Hope is on the horizon - one day FPS drops might be a thing of the past!

SpoCk0nd0pe

Re: Hope is on the horizon - one day FPS drops might be a thing of the past!

wierd

Re: Hope is on the horizon - one day FPS drops might be a thing of the past!

SpoCk0nd0pe

Re: Hope is on the horizon - one day FPS drops might be a thing of the past!

jecowa

Re: Hope is on the horizon - one day FPS drops might be a thing of the past!

wierd

Re: Hope is on the horizon - one day FPS drops might be a thing of the past!

Miuramir

Re: Hope is on the horizon - one day FPS drops might be a thing of the past!