Topic: System Memory issues to consider on a 'refactoring' of DF memory handling (Read 3625 times)

Telgin · « **Reply #30 on:** December 13, 2011, 02:25:26 am »

Assuming Toady is using more or less typical data structures and algorithms, I imagine the memory access patterns are reasonable. What's DF programmed in anyway? C++? He's probably using STL or Boost containers if so, and they're very well written to keep that in line. Of course, if they're sufficiently large and he doesn't organize them well, that doesn't matter...

Anyway, I've heard that Toady stated that he didn't know of any way to parallelize DF to make it faster. Presumably he knows how to write parallel code, but not in a useful way for DF. I'm wondering how true that really is. Depending on how exactly he is handling time steps, it could be a trivial worker thread pool that does updates, which should dramatically improve FPS...

It's going to pretty much take a rewrite of the timestep code if he does that, and handling potential collisions between object states from non cooperating threads is going to make it more complicated. Not impossible though. All the more reason I think he should focus on it before doing too much more in fundamental changes.

All speaking with no knowledge of the source code, of course. And I can understand his trepidation, I've written much simpler things that I dreaded going back to alter in much less fundamental ways.

Romaq · « **Reply #31 on:** December 13, 2011, 07:16:03 am »

One change I thought of would be to have a separate 'display' fork from the running code. Simply turn the 'do stuff' DF process into a server that puts state changes into a queue along with reading from a command queue (things like 'build' or 'designate' that force a halt to the running 'do stuff' thread).

Then you would have a separate thread who's mission is simply to display state changes, interpret 'change display' requests (level up/ down, move display in a compass direction, alter display size). Even if your 'do stuff' thread had a frame-rate of 10 FPS, your display thread (running on its own core) would cheerfully display at 50 FPS, and if you gave it a 'build/ designate/ pause' command, it would receive the command right away and be able to throw a HALT signal to the 'do stuff' process. The processes could run separate cores and with totally separate memory. They would have a common I/O queue to communicate state changes and command. Both processes would still have to be aware of what memory they have to work with, so as to not CTD on the memory boundary.

Just ideas. I really don't know what the hell I'm talking about. Other than... it bites hitting a CTD on 2 GB of memory on a 9 GB machine with plenty of open real memory space.

King Mir · « **Reply #32 on:** December 13, 2011, 07:56:59 am »

Quote from: Telgin on December 13, 2011, 02:25:26 am

Assuming Toady is using more or less typical data structures and algorithms, I imagine the memory access patterns are reasonable. What's DF programmed in anyway? C++? He's probably using STL or Boost containers if so, and they're very well written to keep that in line. Of course, if they're sufficiently large and he doesn't organize them well, that doesn't matter...

I'm sure the containers are decently implemented, but cache use optimization isn't as simple as making sure to use good STL implementations. It's more about size and organisation. The fact that DF manages to use 2GB sometimes doesn't help, because it means you can't use more memory for speedup.

Quote from: Romaq on December 13, 2011, 07:16:03 am

One change I thought of would be to have a separate 'display' fork from the running code. Simply turn the 'do stuff' DF process into a server that puts state changes into a queue along with reading from a command queue (things like 'build' or 'designate' that force a halt to the running 'do stuff' thread).

Then you would have a separate thread who's mission is simply to display state changes, interpret 'change display' requests (level up/ down, move display in a compass direction, alter display size). Even if your 'do stuff' thread had a frame-rate of 10 FPS, your display thread (running on its own core) would cheerfully display at 50 FPS, and if you gave it a 'build/ designate/ pause' command, it would receive the command right away and be able to throw a HALT signal to the 'do stuff' process. The processes could run separate cores and with totally separate memory. They would have a common I/O queue to communicate state changes and command. Both processes would still have to be aware of what memory they have to work with, so as to not CTD on the memory boundary.

Just ideas. I really don't know what the hell I'm talking about. Other than... it bites hitting a CTD on 2 GB of memory on a 9 GB machine with plenty of open real memory space.

That may be a good idea, but as a separate thread, not process, because of the need to share large amounts of data. Which means it wouldn't help with the 2GB limit. It could potentially help the frame rate though.

Romaq · « **Reply #33 on:** December 13, 2011, 08:38:46 am »

Yeah. I can't really guess how the data is managed, and that aspect is still way out of anything I really know anything about. Other than CTD. I don't like CTDs. I know that much.

thisisjimmy · « **Reply #34 on:** December 13, 2011, 06:33:41 pm »

Quote from: Romaq on December 13, 2011, 07:16:03 am

One change I thought of would be to have a separate 'display' fork from the running code.

Yeah, running the user interface in a separate thread is fairly standard and is good practice. It'll likely take a lot of work to retrofit on to code that's been in the making for almost a decade now. And of course, this just keeps the game feeling responsive, but doesn't actually speed up the simulation.

Quote from: King Mir on December 13, 2011, 07:56:59 am

Quote from: Telgin on December 13, 2011, 02:25:26 am
Assuming Toady is using more or less typical data structures and algorithms, I imagine the memory access patterns are reasonable. What's DF programmed in anyway? C++? He's probably using STL or Boost containers if so, and they're very well written to keep that in line. Of course, if they're sufficiently large and he doesn't organize them well, that doesn't matter...
I'm sure the containers are decently implemented, but cache use optimization isn't as simple as making sure to use good STL implementations. It's more about size and organisation. The fact that DF manages to use 2GB sometimes doesn't help, because it means you can't use more memory for speedup.

This is very much speculation. Not only have we not profiled the code, we haven't even seen the code. We don't even know what algorithms or data structures are used.

But I'll make a few guesses of my own, anyway.

First, maps are about 48x48 tiles per embark zone square, and about 150 tiles in depth. Memory used seems to be proportional to map size, with somewhere around 35 bytes per tile being used.
The vast majority of tiles aren't accessed regularly, or the framerate would be abysmal. Most tiles on a map are filled solid.
Pathfinding is expensive and uses some form of A*. There are probably only a few thousand often-searched tiles (mainly, inside your fortress). A* is also likely to access adjacent memory locations.
Waterfalls and fluid dynamics can also cause slowdowns. These also tend to occur in a localized area and repeatedly update the same few tiles.

So overall, I see no particular reason to assume it's memory bound. It could be, but I wouldn't assume so without profiling.

King Mir · « **Reply #35 on:** December 13, 2011, 10:31:44 pm »

Yeah, my guess that it's memory bound is speculation, and your guess may be better than mine. Some of it could be tested by swapping out hardware, but nobody is likely to do much research into it that way.

As a few points to back up my speculation:
--In general in DF things that are used often have more detail, which means large object sizes. But any particular process would only access one of these details at a time. That effectively spaces out memory access.
--many quantities in DF are variable size, which means another level of indirection and non locality.
--35 bytes per tile is half a cache line, so if that estimate is accurate then even querying adjacent tiles is likely to result in a cache miss. Of course it's possible that tile information isn't in one big "array", but that would be less intuitive to code, so it's unlikely to happen except as an optimization.
--3 dimensional arrays can only have one axis on which conceptually adjacent tiles are physically adjacent. So path-finding isn't cache optimal for 2 out of 3 dimensions at best.

Also, while I think DF is generally memory bound, many of the FPS bugs may well be CPU or cache bound.

News:

Author Topic: System Memory issues to consider on a 'refactoring' of DF memory handling (Read 3625 times)

Telgin

Re: System Memory issues to consider on a 'refactoring' of DF memory handling

Romaq

Re: System Memory issues to consider on a 'refactoring' of DF memory handling

King Mir

Re: System Memory issues to consider on a 'refactoring' of DF memory handling

Romaq

Re: System Memory issues to consider on a 'refactoring' of DF memory handling

thisisjimmy

Re: System Memory issues to consider on a 'refactoring' of DF memory handling

King Mir

Re: System Memory issues to consider on a 'refactoring' of DF memory handling