Topic: 64-bit Complied Exe? (Read 3612 times)

Corona688 · « **Reply #15 on:** May 16, 2010, 04:05:53 pm »

Quote from: numerobis on May 15, 2010, 04:28:13 pm

I feel like I saw that the code is 64-bit clean (maybe the linux build is 64? I forget). So it is just a compiler option.

It's almost certainly not; sorry. And I just ran 'file' on a linux dwarfort.exe:

Code: [Select]

$ file ./dwarfort.exe 
./dwarfort.exe: ELF 32-bit LSB executable, Intel 80386, version 1 (SYSV), dynamically linked (uses shared libs), for GNU/Linux 2.6.8, stripped

Not 64-bit and has never been.

Quote

3. caches, internal registers, etc

For 3: Almost no matter how dumb the compiler, you're generally benefiting when the caches grow and the number of internal registers increases and clock rates increase.

You get the benefit of caches no matter what you do -- they're hardwired.

Quote

By default, compilers often assume you're running on an ancient machine (a 386, or maybe a pentium)

for maximum compatibility, things have been traditionally compiled for i386, but recently this has been left by the wayside. Basically all modern chips are compatible with each other but for their extensions; compile for i686 and you're compiling for a modern 32-bit chip.

Quote

but the relative cost of various operations has changed in the intervening 25 years; you can get marginally faster code by tuning to a more recent chip so the compiler has a more realistic view of the relative cost of operations on machines people actually use. Your code will still run on a 386 though.

I do think Toady should at the very least compile it for i686. Even mainstream linux has dropped i386 support for newer things now.

Quote

For 2: If you have the compiler use the newer instruction sets (MMX, SSE, etc etc), then it will frequently be able to produce faster code. Chips that don't support that new instruction set are left out: they just won't be able to run your code.

It takes an extremely smart compiler to use MMX and SSE operations. There's basically two: one by Intel, and gcc. When autovectorization is enabled they both produce code about as stable as a dwarven submarine. So, this sort of thing is best done with great care.

Quote

For 1: If you have the compiler use 64-bit pointers rather than 32, then assuming you have OS support for it, you can access much more memory, at the cost that now your pointers take up twice as much space -- so now your caches are somewhat less effective, and you need to move more memory over the bus.

I love it when people get accolades for things I already said.

QuakeIV · « **Reply #16 on:** May 16, 2010, 07:05:16 pm »

THE IGNORANCE; IT BURNS!

Capntastic · « **Reply #17 on:** May 16, 2010, 07:11:14 pm »

Quote from: QuakeIV on May 16, 2010, 07:05:16 pm

THE IGNORANCE; IT BURNS!

Then by all means, dispel it. These sorts of comments aren't productive.

numerobis · « **Reply #18 on:** May 16, 2010, 08:33:52 pm »

Accidentally ending up with 64-bit clean code is really not that hard; most code I've written has worked just fine after a few compiler errors. Nevertheless, thanks for checking; you're right, we don't appear to have a 64-bit version being shipped yet.

You usually benefit from the caches, and when they increase in size, that's usually a good thing. But it's not always true. For one, maybe you don't have any cache misses anyway in the critical path; then growing the cache doesn't matter (shrinking it might be bad). For another, the new cache may have more or less associativity, which changes what makes for the best blocking. A lot of work has gone into ATLAS to optimize the use of the caches for the precise machine you're targetting. An approach that's more robust to changing machine details is to use cache-oblivious algorithms. The most common approach is to ignore the issue and hope something reasonable happens; that usually works quite well.

The instruction sets expansions that AMD and Intel have created offer more than just vectorization. The two things that make the most difference in code I've written are being able to move 64 bits in one instruction, and being able to access the floating point registers as registers rather than as a stack; vectorization didn't help me at all. Thinking about the kind of code Toady is probably writing, I doubt it would help his code either, but I'd love to be wrong. Recent gcc (4.3 and 4.4) seem to be quite usable at -O3 even though that turns on -ftree-vectorize, so perhaps they fixed the bugs you encountered?

Various 40d# versions got bugs filed saying there was an illegal instruction, which indicates (1) there are people out there with rather ancient processors who want to play, and (2) the versions were compiled using some newer instructions.

Finally, I'm terrible sorry about having peed in your wheaties. Corona is a more potent diuretic than I'd realized.

Corona688 · « **Reply #19 on:** May 21, 2010, 12:46:56 am »

Quote from: numerobis on May 16, 2010, 08:33:52 pm

Accidentally ending up with 64-bit clean code is really not that hard; most code I've written has worked just fine after a few compiler errors. Nevertheless, thanks for checking; you're right, we don't appear to have a 64-bit version being shipped yet.

I based my opinion on DF's 64-bit cleanliness on comments made by Toady. He was search-and-replacing text involving 'long'; long integer variables are one of the big compatibility breakers for 64-bit compilation since long integers are one kind of variable that differs in size when compiled for common 64-bit architectures. Any code that assumes sizeof(int) == sizeof(long) can break.

That he's writing it in C++ and not C is a step towards 64-bit compatibility though. C++ doesn't let you get away with another common 64-bit compatibility breaker in C: Not including all necessary headers.

Quote

Thinking about the kind of code Toady is probably writing, I doubt it would help his code either, but I'd love to be wrong. Recent gcc (4.3 and 4.4) seem to be quite usable at -O3 even though that turns on -ftree-vectorize, so perhaps they fixed the bugs you encountered?

Perhaps.

Quote

Various 40d# versions got bugs filed saying there was an illegal instruction, which indicates (1) there are people out there with rather ancient processors who want to play, and (2) the versions were compiled using some newer instructions.

Or instructions specific to a line of processors; AMD processors have exensions like 3dnow which Intel never picked up. Or, more likely; memory or file corruption or simply a bug caused a genuinely invalid instruction to exist in memory, or caused memory that doesn't contain instructions at all to be inadvertently executed.

TomCatFort · « **Reply #20 on:** May 27, 2010, 07:20:49 pm »

Build in types in C++ has serious drawbacks when you want to compile it to different processors. So it's a good thing to use some typedefs to create your own types and then you just only have to change them in one source file. Also some programmers thinks that sizeof(int)==sizeof(void*). Also I found a great pain to make x64 applications for Windows. Usually it's hard or impossible to find the necessary 64 bit libs or if you compile them, they gets bugs, errors, or generates segfaults. I never found the 64 bit OpenGL libs and extension wrangler lib for Windows.

Anyway, programs usually benefits 64bit.

chmod · « **Reply #21 on:** May 28, 2010, 12:36:09 pm »

This may be the most painful thread I've seen here.

Have you tried compiling with ULTRA_FAST_MODE?

darius · « **Reply #22 on:** May 28, 2010, 03:18:19 pm »

Quote from: chmod on May 28, 2010, 12:36:09 pm

This may be the most painful thread I've seen here.

Have you tried compiling with ULTRA_FAST_MODE?

or maybe -O666 (optimize till it bleeds?)

zimluura · « **Reply #23 on:** May 30, 2010, 11:23:12 am »

Quote from: darius on May 15, 2010, 12:29:49 pm

32 bit programs have max of 2 gb memory (or 3 if compiled with some flags) while 64 bit has limit of (very big number) also 64 bit commands usually speed up programs witch use big amounts of memory as every command is operating with twice as much data (approx)

dwarf fortress generally uses 500-800MB on my windows machine, think i've had spikes up to 1.2GB. and i'm guessing the operations the exe does during play are growing in detail and complexity. it might be a good idea to start coding practices that yield word-size_portable_code sooner rather than later. thinking how difficult the merge seems to have been. since he's just touched all that code once it could even be an opportune time to give it a twice over to get some 64bit exes. but of course we still have a little ways to go before it hits that limitation. and, if i were toady, i would be pretty sick of doing code porting tasks right now.

Dr. Melon · « **Reply #24 on:** May 30, 2010, 11:28:49 am »

I think that a multi-core version would also benefit. Such as splitting pathfinding into its own thread, and having water simulation etc in another one would boost speeds magnificently.

That is, of course, provided Toady doesn't do that already. He does a damn fine job of optimisation, I know that much.

DDR · « **Reply #25 on:** May 30, 2010, 09:30:48 pm »

Currently, the only thing multi-core is the rendering stuff. Everything else is single core, and probably going to stay that way.

Stargrasper · « **Reply #26 on:** June 06, 2010, 05:26:19 pm »

Quote from: Dr. Melon on May 30, 2010, 11:28:49 am

I think that a multi-core version would also benefit. Such as splitting pathfinding into its own thread, and having water simulation etc in another one would boost speeds magnificently.

That is, of course, provided Toady doesn't do that already. He does a damn fine job of optimisation, I know that much.

I think we've gone over this before. As helpful as it might be for DF to use multiple cores, threading a program as mature (read: lots and lots of existing code) could be a massive project in itself. Threading programs that weren't designed to be threaded from the very beginning is usually far more trouble than it's worth...unless Toady plans to arbitrarily end development on Slaves to Armok 2 and begin Slaves to Armok 3 with all of this built from the beginning in lieu of completing the current project.

Kholint · « **Reply #27 on:** June 07, 2010, 05:13:09 am »

I'll be honest: I'm not entirely convinced by the whole "threading would be too difficult" argument. I've done a lot of multithreading for a game engine recently and I've tried loads of different methods and you're right- it *is* very difficult. Right now I've settled on using a spatial partitioning design which is more than a little bit complicated (and would be pretty difficult to retrofit into a game like DF, but not impossible) - but it has the advantage of meaning that all real game logic (moving things around, building things, removing things, etc. etc.) can be done from a single-threaded perspective.

Imagine the game world being split into multiple little bubbles of activity (that are constantly changing after areas are processed and the 'bubble' moves on) - with non-active areas serving as a buffer between all the bubbles. The code inside the bubble works totally serially and single threaded, as long as it doesn't reach outside the bubble.

For my application it's quite easy to abide by this constraint, because it mostly involves simulating particle movements. The pathfinding used by dwarves in DF would complicate matters, because the pathfinding code necessarily has to "reach out" beyond the bubble's safe boundaries. The best solution for this, I think, would be to maintain a thread-safe, abstracted data structure designed for pathfinding that the real map feeds data into. The pathfinding dwarves would use this overlaid structure to read from, so they don't ever interfere with one another. This also has the advantage of providing an obvious place for pathfinding optimisations (caching, predictive pathing, you name it).

Er, anyway - that was besides the point. What I *actually* wanted to say was this: if adding some sort of all-encompassing data-parallelised multithreading (as I described above) would be too difficult (and I think we'd all agree that it probably would), that doesn't mean the most CPU-intensive parts of DF can't be parallelised.
Putting pathfinding and flows on individual seperate threads isn't a great idea (known as functional threading) because it doesn't scale - it'll work for now, but in 3 or 4 years when we're running 12 or 16 cores as standard we'll be crying out for DF to scale beyond 3 or 4 cores and it'll have to be multithreaded all over again.

Here's how DF could have parallelised pathfinding that scales practically indefinitely:

I'm assuming right now that DF handles the turn system by having each creature individually take his turn: assess the situation, pathfind (or not), then move.
This is hard to parallelise, as one creature might be pathfinding whilst another creature moves into the area the first is trying to pathfind through.

Instead, if toady were able to rejig DF's turn system so that every pathfinding request for that turn were all executed at once, you could offload the pathfinding to every core on the user's machine. DF's pathfinding is totally read-only (I'm assuming!) and doesn't affect the landscape it's pathfinding over, so there's absolutely no problem with thread/memory interference.

A user's 8 core i7 mega-machine could blast its way through the "pathfinding phase" then return to single threadedly handling all the creature movement, combat and so on. This would be pretty easy to implement - the only thing it'd require is that DF's turn system is changed so that allows all the pathfinding to be done at once. Everything else would be the same.

The game as a whole won't scale indefinitely (flows, general movement and combat etc. would still be just as slow), but it'll remove a significant bottleneck in current fortresses - possibly kicking up our maximum fortress populations from the 200 mark to maybe 400, 500?

The fact that 80% of gamers* are now using multicore machines (25% of gamers having a quad core!), I think it'd be a really good idea to seriously consider adding a "parallelised pathfinding phase" to DF, especially considering how easy it'd be if the turn system was changed.

What do you guys think? Any ideas?

*http://store.steampowered.com/hwsurvey/

numerobis · « **Reply #28 on:** June 07, 2010, 08:12:14 pm »

Except when for example a goblin army shows up all at once, it's probably rare for a frame to have more than one or two creatures needing to compute an expensive path. The problem arises when they have to go a long way, so that one pathfinding call is expensive, and two frames later another dwarf finishes what they were doing and also goes a long way, etc. Those calls can't really be parallelized easily.

If you're interested in pathfinding, there was a whole thread about a programming project to deal with it, like Baughn with the new OpenGL stuff. I'm not sure if it died.

helf · « **Reply #29 on:** August 04, 2010, 06:01:05 pm »

I was about to post asking about a 64bit version of DF when I noticed this thread.

The only reason I'm curious about it is because if you try generating a world with 100 races and 10000 years of history, the game quickly runs into its 2GB memory limit and crashes, usually before it even hit year 1200 in the history genning. I'd love to be able to actually generate a world with 10k years of history with 100 working races. But I seem to be limited to maybe 12 races or so and 10k years of history. That hits around 1.6GB :p

News:

Author Topic: 64-bit Complied Exe? (Read 3612 times)

Corona688

Re: 64-bit Complied Exe?

QuakeIV

Re: 64-bit Complied Exe?

Capntastic

Re: 64-bit Complied Exe?

numerobis

Re: 64-bit Complied Exe?

Corona688

Re: 64-bit Complied Exe?

TomCatFort

Re: 64-bit Complied Exe?

chmod

Re: 64-bit Complied Exe?

darius

Re: 64-bit Complied Exe?

zimluura

Re: 64-bit Complied Exe?

Dr. Melon

Re: 64-bit Complied Exe?

DDR

Re: 64-bit Complied Exe?

Stargrasper

Re: 64-bit Complied Exe?

Kholint

Re: 64-bit Complied Exe?

numerobis

Re: 64-bit Complied Exe?

helf

Re: 64-bit Complied Exe?