Topic: Pathfinding is not a major cause of FPS death (Read 59691 times)

Criperum · « **Reply #165 on:** January 23, 2023, 06:41:15 am »

I'll add my two cents to the limitation topic. As for me I also against it. But i believe the goal for the optimizations it to provide smooth game experience at hardest default settings. Settings that are achievable by clicking the default buttons. In our case it is LARGE world with 500 history and 200 adults + 20 kids. Anyone can create such a game and it should run smoothly. Anything above it could run at lower FPS because you as a player are willingly putting the game under stress.

jipehog · « **Reply #166 on:** January 23, 2023, 07:13:34 am »

That limiting options is the only thing that can prevent late game lag in such games is a statement of fact to address expectations of what is possible through optimization alone, not a suggestion nor endorsement to do so. The dev choices regarding simulation and game defaults are an example of said limits**, naturally if there were no hard limit on population the game would have run to a crawl as computation demands scaled with population size, and since there are several aspects of simulation that are not limited if you spam them you'd hit the same performance wall no matter the setting.

** Personally, I would prefer the devs to push the limits of simulation/mechanics, not aim for smooth game experience for the average hardware.

Quote from: Criperum on January 23, 2023, 06:41:15 am

Anything above it could run at lower FPS because you as a player are willingly putting the game under stress.

In 'Distant worlds 2' the game checks your CPU and memory and warns against using galaxy settings that are bellow minimum requirements that would severely impact performance, maybe something to consider.

Note you can also set embark size to 6x6 in game.

IronGremlin · « **Reply #167 on:** January 23, 2023, 05:11:25 pm »

Quote from: ayy1337 on January 23, 2023, 06:00:01 am

Quote from: IronGremlin on January 23, 2023, 05:24:16 am
A clock speed of 3200 mhz means you get 3 billion operations in a second
Sort of, actually DDR (double data rate) frequency quoted is twice the actual clock speed because it transfers data twice per clock cycle, so it's 1.6 billion clock cycles per second. But the actual operations like set row to read, read a column etc. take several clock cycles - let's say 15 because that's what mine takes. So it's about 10ns just in the latency for the ram chip to respond, which works out to around 33 clock cycles of my CPU unless I'm mistaken, so I'm thinking halving the memory clock would increase the inherent latency on the chip to a full 66 clock cycles.

But anyway, neither the RAM chip's latency or the delay waiting for the signal to carry through the little wires from RAM to CPU would be increased by underclocking the CPU so I don't see how a highly memory bound workload would show such a direct correlation with CPU clock speed.

Right, sorry, 3 billion distinct opportunities to request/return an operation per second, which is not the same thing as ops per second - my bad.

But we're talking like 100 ns for a round trip here. It's not "one hundred cycles" as Putnam threw out for a random example, it's certainly not 30 cycles as in your example, it's MULTIPLE HUNDREDS of cycles.

Op time and data transfer windows are tiny fractions of the total time the CPU spends waiting. Nobody talks about this part because there isn't dick anyone can do about it, because we haven't managed to make electron wave propagation any faster over those kinds of distances yet, but when you hear things like "accessing data in RAM is an order of magnitude slower than accessing data in the L2 cache" this is what they are talking about. Adjusting the RAM clock is a fractional percent difference here.

That kind of small optimization -really- adds up when you're performing tens of thousands of operations in series - I am not saying RAM timings do not matter for performance - but the moment that one operation depends on the next one to complete it becomes very meager difference. Not all memory I/O problems are created equal, hence me talking about 'latency' vs. 'bandwidth' in network terms.

Here's is how this relates to the CPU / Ram timing test:

Decrease CPU clockspeed - Program moves slower. But you have introduced other opportunities for bottleneck. A program that was not CPU bound before can now become CPU bound.

We have proven nothing with this test, save that there is a point at which CPU speed can be a limiting factor for DF performance, but this shouldn't really be a surprise to anyone.

Turn the CPU clockspeed back up, and adjust RAM timings - Program stays the same speed. But as we've established, if the RAM I/O problem you are having is one of data latency, and not one of bandwidth, adjusting the timings will have near 0 effect.

We have also proven nothing with this test, other than that our problem probably wasn't one of RAM bandwidth.

Hence, we have not proven that the program was CPU bound or memory bound - those observations are consistent with a scenario in which your problems with performance are a result of conditional execution against data in RAM. Given that Putnam has also talked a bit about how conditionals in RAM are a big problem for DF performance, I'm inclined to interpret that as the explanation until evidence to the contrary is presented.

ayy1337 · « **Reply #168 on:** January 23, 2023, 06:50:55 pm »

Quote from: IronGremlin on January 23, 2023, 05:11:25 pm

But we're talking like 100 ns for a round trip here. It's not "one hundred cycles" as Putnam threw out for a random example, it's certainly not 30 cycles as in your example, it's MULTIPLE HUNDREDS of cycles.

Op time and data transfer windows are tiny fractions of the total time the CPU spends waiting. Nobody talks about this part because there isn't dick anyone can do about it, because we haven't managed to make electron wave propagation any faster over those kinds of distances yet, but when you hear things like "accessing data in RAM is an order of magnitude slower than accessing data in the L2 cache" this is what they are talking about. Adjusting the RAM clock is a fractional percent difference here.

Halving the clock still increases the total time from 100ns to 110ns, enough to make a measurable difference if this is the bottleneck imo.
Whereas it had no impact on FPS but CPU clock speed had a nearly 1:1 reduction in FPS.

If I could underclock my ram any more to get yet more evidence I would, but I think I'll try seeing how linked lists perform at 3200 vs 1600 instead since they are known to be ass slow because of the exact memory latency issue we're discussing.

Another thing to consider is that the unit list and other data structures in the game are afaik stored in vectors which are very apt to efficient memory prefetching when you're iterating it in order, so the latency issue isn't going to be pronounced.

IronGremlin · « **Reply #169 on:** January 23, 2023, 11:23:08 pm »

Quote from: ayy1337 on January 23, 2023, 06:50:55 pm

Quote from: IronGremlin on January 23, 2023, 05:11:25 pm
But we're talking like 100 ns for a round trip here. It's not "one hundred cycles" as Putnam threw out for a random example, it's certainly not 30 cycles as in your example, it's MULTIPLE HUNDREDS of cycles.

Op time and data transfer windows are tiny fractions of the total time the CPU spends waiting. Nobody talks about this part because there isn't dick anyone can do about it, because we haven't managed to make electron wave propagation any faster over those kinds of distances yet, but when you hear things like "accessing data in RAM is an order of magnitude slower than accessing data in the L2 cache" this is what they are talking about. Adjusting the RAM clock is a fractional percent difference here.
Halving the clock still increases the total time from 100ns to 110ns, enough to make a measurable difference if this is the bottleneck imo.
Whereas it had no impact on FPS but CPU clock speed had a nearly 1:1 reduction in FPS.

If I could underclock my ram any more to get yet more evidence I would, but I think I'll try seeing how linked lists perform at 3200 vs 1600 instead since they are known to be ass slow because of the exact memory latency issue we're discussing.

Another thing to consider is that the unit list and other data structures in the game are afaik stored in vectors which are very apt to efficient memory prefetching when you're iterating it in order, so the latency issue isn't going to be pronounced.

1.6 vs. 3.2 billion cycles per second translates into each "gap" in communication taking 5/8ths of a nanosecond vs. 5/16ths of a nanosecond. I'm not really sure where you're getting '10 nanoseconds' from - I don't follow.

Regardless - Assuming you achieved a 10% difference in your primary bottleneck, that would make a measurable difference in a controlled benchmark - I don't know if that's going to reliably translate to a consistently measurable difference in a Dwarf Fortress FPS counter. My money would be on "no," because in my experience conducting proper benchmarks is hard and observation bias is a bitch.

The linked list vs. the flat vector is a a great case example and I would think that should illustrate the relative performance change in RAM I/O bandwidth vs. latency very clearly.

I would caution against attempting to reduce the problem space to 'oh but DF uses flat vectors so it can't fall victim to pathological cases involving a lot of blind pointer hopping' - this presumes an awful lot about the architecture of software that you don't have the source code for, and more to the point most of the shit I've heard Putnam say about this subject makes me extremely suspicious of that line of reasoning.

ayy1337 · « **Reply #170 on:** January 24, 2023, 02:38:47 am »

Quote from: IronGremlin on January 23, 2023, 11:23:08 pm

1.6 vs. 3.2 billion cycles per second translates into each "gap" in communication taking 5/8ths of a nanosecond vs. 5/16ths of a nanosecond. I'm not really sure where you're getting '10 nanoseconds' from - I don't follow.

It's not actually 1.6 vs 3.2 because the actual clock frequencies are half that; so 800mhz vs 1600mhz. And the 10ns come from the fact that the RAM chip can't respond as soon as it receives a request for data. Basic operations on the RAM chip entail several memory clock cycles of waiting before it can start sending data back. Reading a column of memory if the correct row is already open on my RAM takes 15 of its clock cycles. Opening a row if there isn't another one open already takes 15 clock cycles. So opening a row and then reading a column in that row takes 15 + 15 = 30 of its clock cycles from when it receives the request to when it can first starts replying with bits.
30 clock cycles at 1600mhz (DDR3200) is 18.75ns
30 clock cycles at 800mhz (DDR1600) is 37.5ns
If another row is already open the delay is even longer, 45 clock cycles.
You can look up memory timings to read more about it if you wanna.

Putnam · « **Reply #171 on:** January 25, 2023, 09:02:31 pm »

Quote from: Saiko Kila on January 22, 2023, 03:14:48 am

Quote from: ayy1337 on January 21, 2023, 08:34:40 pm
Quote from: Bumber on January 21, 2023, 06:35:15 pm
Quote from: ayy1337 on January 21, 2023, 05:56:42 pm
Is throwing more cores at a highly CPU bound problem really being obtuse or just a logical conclusion? Optimizations are always good but at a certain point your choices really are 'do less' or 'more compute'

You're assuming it's CPU bound rather than memory bound.
It's more of an educated guess seeing as the utilization of one core is always 100% when playing the game.

In the interests of being thorough though I went and underclocked my CPU to 2ghz from 3.7ghz and my FPS went from 38 to 20
And then returning the CPU to normal underclocked my ram from 3200mhz to 1600mhz and found my FPS went back to 38

During my tests in previous years the conclusion always was that the FPS rate is directly proportional to the CPU speed (single core), and in any given CPU doubling this speed doubles FPS. Effectively older CPUs with higher performance of single core were better than new ones. I even make decisions of upgrades of my platforms based solely on expected Dwarf Fortress performance

Older CPUs have universally worse single-core performance, I have no idea where the idea that they're better comes from. You can run a 6 GHz Pentium 4 and the game will not run it well, not even remotely. It will crash and burn, literally, on both counts, but even if it didn't, it wouldn't run nearly as well as a modern CPU running at 2 GHz.

Quote from: IronGremlin on January 23, 2023, 11:23:08 pm

Quote from: ayy1337 on January 23, 2023, 06:50:55 pm
Quote from: IronGremlin on January 23, 2023, 05:11:25 pm
But we're talking like 100 ns for a round trip here. It's not "one hundred cycles" as Putnam threw out for a random example, it's certainly not 30 cycles as in your example, it's MULTIPLE HUNDREDS of cycles.

Op time and data transfer windows are tiny fractions of the total time the CPU spends waiting. Nobody talks about this part because there isn't dick anyone can do about it, because we haven't managed to make electron wave propagation any faster over those kinds of distances yet, but when you hear things like "accessing data in RAM is an order of magnitude slower than accessing data in the L2 cache" this is what they are talking about. Adjusting the RAM clock is a fractional percent difference here.
Halving the clock still increases the total time from 100ns to 110ns, enough to make a measurable difference if this is the bottleneck imo.
Whereas it had no impact on FPS but CPU clock speed had a nearly 1:1 reduction in FPS.

If I could underclock my ram any more to get yet more evidence I would, but I think I'll try seeing how linked lists perform at 3200 vs 1600 instead since they are known to be ass slow because of the exact memory latency issue we're discussing.

Another thing to consider is that the unit list and other data structures in the game are afaik stored in vectors which are very apt to efficient memory prefetching when you're iterating it in order, so the latency issue isn't going to be pronounced.

1.6 vs. 3.2 billion cycles per second translates into each "gap" in communication taking 5/8ths of a nanosecond vs. 5/16ths of a nanosecond. I'm not really sure where you're getting '10 nanoseconds' from - I don't follow.

Regardless - Assuming you achieved a 10% difference in your primary bottleneck, that would make a measurable difference in a controlled benchmark - I don't know if that's going to reliably translate to a consistently measurable difference in a Dwarf Fortress FPS counter. My money would be on "no," because in my experience conducting proper benchmarks is hard and observation bias is a bitch.

The linked list vs. the flat vector is a a great case example and I would think that should illustrate the relative performance change in RAM I/O bandwidth vs. latency very clearly.

I would caution against attempting to reduce the problem space to 'oh but DF uses flat vectors so it can't fall victim to pathological cases involving a lot of blind pointer hopping' - this presumes an awful lot about the architecture of software that you don't have the source code for, and more to the point most of the shit I've heard Putnam say about this subject makes me extremely suspicious of that line of reasoning.

DF doesn't use flat vectors. This isn't something only I have access to, DFHack exposes most of this stuff. DF uses vectors to pointers everywhere, to objects that are multiple kilobytes in size each.

Saiko Kila · « **Reply #172 on:** January 26, 2023, 04:13:37 am »

Quote from: Putnam on January 25, 2023, 09:02:31 pm

Quote from: Saiko Kila on January 22, 2023, 03:14:48 am
Quote from: ayy1337 on January 21, 2023, 08:34:40 pm
Quote from: Bumber on January 21, 2023, 06:35:15 pm
Quote from: ayy1337 on January 21, 2023, 05:56:42 pm
Is throwing more cores at a highly CPU bound problem really being obtuse or just a logical conclusion? Optimizations are always good but at a certain point your choices really are 'do less' or 'more compute'

You're assuming it's CPU bound rather than memory bound.
It's more of an educated guess seeing as the utilization of one core is always 100% when playing the game.

In the interests of being thorough though I went and underclocked my CPU to 2ghz from 3.7ghz and my FPS went from 38 to 20
And then returning the CPU to normal underclocked my ram from 3200mhz to 1600mhz and found my FPS went back to 38

During my tests in previous years the conclusion always was that the FPS rate is directly proportional to the CPU speed (single core), and in any given CPU doubling this speed doubles FPS. Effectively older CPUs with higher performance of single core were better than new ones. I even make decisions of upgrades of my platforms based solely on expected Dwarf Fortress performance

Older CPUs have universally worse single-core performance, I have no idea where the idea that they're better comes from. You can run a 6 GHz Pentium 4 and the game will not run it well, not even remotely. It will crash and burn, literally, on both counts, but even if it didn't, it wouldn't run nearly as well as a modern CPU running at 2 GHz.

I didn't mean as old as Pentium 4, there's too much difference in technology (over two decades), but rather different generations of intel core.

Putnam · « **Reply #173 on:** January 26, 2023, 02:21:37 pm »

older generations of intel core are also worse than newer ones, believe it or not, and the best CPU on the market for playing dwarf fortress is probably an 8-core AMD CPU, and, as of next month, a 16-core CPU with 32 threads... because it has 144MB of L2+L3 cache.

The most important thing for a CPU is single-core performance. The most important thing for single-core performance, at least for a game like DF, is cache size. Look at older Core cache sizes compared to newer ones.

jipehog · « **Reply #174 on:** January 26, 2023, 03:12:05 pm »

When multi-core process first hit the market, many early adopters found that their upgrade preformed worse their older processor, the moral of the story is that newer isn't necessarily better. Single core performance also very between newer processors, and while clock speed can be used as rough indicator in family, it is better look up benchmarks to account for all variables.

Isaacc7 · « **Reply #175 on:** January 29, 2023, 09:06:43 pm »

Quote from: Putnam on January 26, 2023, 02:21:37 pm

older generations of intel core are also worse than newer ones, believe it or not, and the best CPU on the market for playing dwarf fortress is probably an 8-core AMD CPU, and, as of next month, a 16-core CPU with 32 threads... because it has 144MB of L2+L3 cache.

The most important thing for a CPU is single-core performance. The most important thing for single-core performance, at least for a game like DF, is cache size. Look at older Core cache sizes compared to newer ones.

I am salivating over how DF could run on Apple Silicon. Very fast single thread and large caches. On paper they look to be the best processor type to run DF. Even better if some of the compute could be offloaded to the dedicated “Neural Engine” co-processor. Not sure how likely that is or if DF uses calculations that could benefit from that processor but it is an appealing thought no?

Putnam · « **Reply #176 on:** January 30, 2023, 10:26:09 pm »

I did another 3-minute profile on my 260-citizen fort, this time with the knowledge of the source. Percents are going to overlap some. Specific seconds are not.

The slowest thing was line-of-sight checks. Due to the fact that it was only counting the checks in the function itself, this was a total of 14.7% of CPU time. However, if you remove one specific thing from line-of-sight checks (more on that later), it's 8%, and thus faster than the next thing. 13.181s in the function itself.

The slowest individual thing in the fort was checking family relationships. This was mostly, so far as I could tell, called in the context of watching performances, probably to tell if they should be feeling love at their family members or whatever. This was 9% of CPU time. 15.537 in the function.

Next is getting glowtiles. This is required for line-of-sight tests, naturally. Glowtile'd creatures are visible from farther in the dark. 5.7%. This is the slowest part of line-of-sight checks, at least as far as stuff that's actually done. 11.749s.

#4 is the game's software rescale function. 4.3%, 7.440s

#5 is probably tile rendering. I think. I see a couple xmms in there and there's a lot of bizarre magic numbers that suggest a switch-case to me. 13.2%, most of which is in subcalls. 5.350s.

#6 is... uh, temperature stuff, I think? 6.3%, 4.888s.

...it just goes like this for a while.

The main point I'm getting to is that 1%, 0.687s, is pathfinding. 1%. 1%. I was wrong earlier and I apologize; pathfinding is not a cause of FPS death at all. It is completely irrelevant to the "slow, inevitable" FPS death people complain about.

Ten_Tacles · « **Reply #177 on:** January 31, 2023, 04:29:33 am »

Never expected family relationships to be so slow. Fascinating information!
I hope you can put it to good use. Fps death is no glorious end to a dwarven life.

ayy1337 · « **Reply #178 on:** January 31, 2023, 08:24:41 am »

Quote from: Putnam on January 30, 2023, 10:26:09 pm

I did another 3-minute profile on my 260-citizen fort, this time with the knowledge of the source. Percents are going to overlap some. Specific seconds are not.

The slowest thing was line-of-sight checks. Due to the fact that it was only counting the checks in the function itself, this was a total of 14.7% of CPU time. However, if you remove one specific thing from line-of-sight checks (more on that later), it's 8%, and thus faster than the next thing. 13.181s in the function itself.

The slowest individual thing in the fort was checking family relationships. This was mostly, so far as I could tell, called in the context of watching performances, probably to tell if they should be feeling love at their family members or whatever. This was 9% of CPU time. 15.537 in the function.

Next is getting glowtiles. This is required for line-of-sight tests, naturally. Glowtile'd creatures are visible from farther in the dark. 5.7%. This is the slowest part of line-of-sight checks, at least as far as stuff that's actually done. 11.749s.

That's all pretty interesting, if you don't mind me asking when exactly are line of sight checks being done? Is it once per frame every creature checks whether every other creature is in line of sight or something less frequent? And curious about the family relations stuff too. If it's performances does that mean if there's no taverns that just goes away?

Footkerchief · « **Reply #179 on:** January 31, 2023, 11:17:35 am »

Quote from: Putnam on January 30, 2023, 10:26:09 pm

I did another 3-minute profile on my 260-citizen fort, this time with the knowledge of the source. Percents are going to overlap some. Specific seconds are not.

The slowest thing was line-of-sight checks. Due to the fact that it was only counting the checks in the function itself, this was a total of 14.7% of CPU time. However, if you remove one specific thing from line-of-sight checks (more on that later), it's 8%, and thus faster than the next thing. 13.181s in the function itself.

Great to see numbers here, thanks!

I know 14.7% doesn't point to any massive optimization potential, but -- is the line-of-sight implementation more amenable to a concurrent/thread-pool approach than the pathfinding? I would not expect LOS to use global mutable state, but I wouldn't have expected that for pathfinding either, so what do I know.

News:

Author Topic: Pathfinding is not a major cause of FPS death (Read 59691 times)

Criperum

Re: Pathfinding is not a major cause of FPS death

jipehog

Re: Pathfinding is not a major cause of FPS death

IronGremlin

Re: Pathfinding is not a major cause of FPS death

ayy1337

Re: Pathfinding is not a major cause of FPS death

IronGremlin

Re: Pathfinding is not a major cause of FPS death

ayy1337

Re: Pathfinding is not a major cause of FPS death

Putnam

Re: Pathfinding is not a major cause of FPS death

Saiko Kila

Re: Pathfinding is not a major cause of FPS death

Putnam

Re: Pathfinding is not a major cause of FPS death

jipehog

Re: Pathfinding is not a major cause of FPS death

Isaacc7

Re: Pathfinding is not a major cause of FPS death

Putnam

Re: Pathfinding is not a major cause of FPS death

Ten_Tacles

Re: Pathfinding is not a major cause of FPS death

ayy1337

Re: Pathfinding is not a major cause of FPS death

Footkerchief

Re: Pathfinding is not a major cause of FPS death