Topic: Virtual machines to increase performance? (Read 4234 times)

draeath · « **Reply #15 on:** December 06, 2014, 08:32:15 am »

Without getting into the nitty-gritty, a VM with one core on a multi-core system will be bound to doing it's processing on a single host core.

A bit more detail: most VM utilizations are paravirtualizers - the majority of processor work is passed-through, thus you are still subject to host architecture limitations. While full virtualization breaks away from this (QEMU for example, can run an ARM vm on an x86 host) the performance of doing so is atrocious.

MitchTheLich · « **Reply #16 on:** December 08, 2014, 06:05:16 pm »

Assuming funds weren't really an issue, would it be possible to make like a 32gig single core? And then run that in a normal computer, and run solely DF on the 32 gig core?
Although someone was saying DF doesn't actually use that much ram, and that a higher-speed cpu would have more of an effect.
/computers are hard/

draeath · « **Reply #17 on:** December 08, 2014, 07:01:38 pm »

Not a single core. The time it takes the electrical signal to propagate across the chip is the primary limiting factor. That's why speeds have been where they are but performance has been getting better, as process size shrinks - we can do more with less at the same frequency. We can still only operate around 3.5ghz, but for each of those cycles more circuitry can be communicated with which means that one cycle can do a more complex operation.

Let me show you a little something...

The empty chassis costs like $2000. That's empty CPU sockets and no RAM. Yep. A bit expensive.

We take lots of these in Cisco UCS chassis and cluster them, and run our infrastructure virtualized inside them. Last I checked we had over 4000 virtual machines. There's some serious gear, here. But note that it takes 20 2-core CPUs to pull it off. (oh, and don't let that "2 NICs" fool you. That's a pair of uplinks to fabric extenders, 10gbps each)

Miuramir · « **Reply #18 on:** December 09, 2014, 04:22:04 pm »

Quote from: MitchTheLich on December 08, 2014, 06:05:16 pm

Assuming funds weren't really an issue, would it be possible to make like a 32gig single core? And then run that in a normal computer, and run solely DF on the 32 gig core?
Although someone was saying DF doesn't actually use that much ram, and that a higher-speed cpu would have more of an effect.
/computers are hard/

You seem to have some confusion about computers in general and DF in particular. DF is in practice much more like scientific computing than most games. Simplifying, a computer has a small amount of memory actually on the chip with the CPU, which is very fast; this is the "cache". (In practice, there are several levels of this, don't worry about that yet.) Then there is the computer's main memory, which is moderate speed. Then there is SSD style storage or hybrid disk cache, which is painfully slow, and actual spinning hard drives, which are unimaginably glacially slow.

Let's say you're at an exclusive club, and you suddenly realize you need a different credit card than the one you were planning on using originally. If your wallet is in your hand already and you just pull out a different card, that's like the data being on the stack in the CPU. If you have to pull your wallet out of your pocket and swap cards, that's like the data being in the cache. If you left that card in a different wallet in your coat pocket that you checked at the door when you came in, that's like the data being in main memory. If you left that card in your coat which you left in the trunk of your car which was valet-parked in a different building by a unionized worker who just went on break, that's like the data being on the hard drive. If your car with the wallet was handed over to a weird foreigner who promised you wouldn't have to pay parking fees, because they were going to drive it around town using it as a delivery vehicle for their catering business, that's like the data being "in the cloud".

When you start a program, it loads from the super-slow hard drives into memory. If you don't have enough memory to load the whole program (plus the operating system itself, and all the *other* programs you are using), the operating system has to play juggling tricks with the super-slow hard drive to pretend it has more memory than it does; this makes everything run very slow. So, at first, adding more memory to a computer helps *tremendously* in performance. In particularly pathological cases, going from 512M to 1G RAM in an older system might speed up certain programs by as much as a factor of ten.

But once you have enough memory to load *everything*, adding more doesn't really help much. The operating system can try and "guess ahead" what you might want on the super-slow hard drive, and use idle time to go ahead and preemptively load it into memory; but you reach limiting returns on that fairly soon. This is why most ordinary users don't see benefits from more than 8G of memory, if that; most ordinary programs still don't use more than 2G to 4G at most, with the majority much smaller.

Once everything is in memory, you have to get the info from the memory to the CPU, and back again. Most programs that ordinary home users deal with use fairly small files, and have a lot of free time waiting for user input. For DF and some sorts of scientific computing, this is more problematic. CPUs with enough on-die L3 cache to load a full embark don't exist; even the Intel 4790K has only 8M, for instance. L4 cache (in-package but not on-die) as seen in some Haswell mobile CPUs may help some once they hit the 2nd-gen desktop CPUs, but that's still only 128M. In any case, the memory access speed or memory bus speed is the factor here.

Let's switch examples. You're trying to optimize a restaurant, and have one guy who who takes clean silverware out of one bin, clean napkins out of another, and one of those colored paper sticky strips out of a box. This gets handed to another guy, who folds the napkin around the silverware and attaches the paper strip, then hands it back to be put in a different bin.

If you just had a security camera watching the output bin, you couldn't tell who was the limiting factor. If the guy doing the folding (CPU) was the limiting factor, spending more money on a guy who is better at taking things out of bins (the memory speed) isn't going to help, and vice versa. A program where the CPU has everything it needs when it needs it, but takes too long to do the work, is "CPU bound"; improving the speed of the CPU will help directly, while other improvements will help only a little if any. Conversely, a program that has the CPU waiting around idle for the memory bus to give it the next piece of data would be "I/O bound".

We don't really know under what conditions DF is CPU bound vs. I/O bound. In general we suspect that with a fast modern processor, DF is more often (or more likely to be) I/O bound, yet sensitive to errors. Thus the general recommendation that to optimize DF, figure out what the fastest long-term-stable memory transfer speed you can afford works out to; and then the fastest individual core speed that is compatible to that. Note that "fastest" is not merely GHz, either; raw clock speeds are only valid for comparing chips of identical architecture and similar generations.

Note that any sort of virtual machine arrangement applies overhead; you are *at best* fractionally slower than running without one. There are a wide variety of logistical and practical reasons to run VMs, but raw computational performance isn't one of them.

Note also that you don't need a VM to access quite a lot of memory these days; Dell will cheerfully sell you an off-the-shelf system with 4x 15-core, 2-thread-per-core CPUs (total 60 cores, 120 threads) and 96 slots of 16G each, for 1.5T (1,536G) of directly-accessible RAM. But each of those cores only has a max turbo of 3.1 GHz, and DF can't use more than 4G at most (plus some for OS overhead, etc.); a decent i7 at 3.5 GHz in a desktop with 8G of premium RAM would probably do better. (The aforementioned server monster could, of course, run multiple copies of DF at once without slowing down; but that is rarely what people are looking to do. And given memory bus limitations, it might not actually keep up full speed past 4 copies.)

Putnam · « **Reply #19 on:** December 12, 2014, 03:18:45 am »

hmm yeah imma keep a copy paste of that around

wierd · « **Reply #20 on:** December 12, 2014, 11:59:09 pm »

It's a little wordy, and a bit heavy on switching analogies, but otherwise pretty good at describing the problem. I liked it too.

I keep hoping that some day we will get a major breakthrough in computing/signal-processing that will do away with the differences in memory bus and internal computation times that will make things exciting in computing again.

I keep seeing academic research into optical transistors, quantum gates, memristors, and other radical technologies. I personally really look forward to a day where only a very tiny copper interconnect interface exists between devices and the system bus architectures, and where everything else is wholly optical in nature. Signal propagation delays would be substantially less, the unique properties of polarized and entangled light streams could be leveraged in clever ways, and having such tiny interconnects would permit driving the connected peripherals at rates well outside of what can be realistically driven on existing copper circuit traces.

But I probably will never see that day. It makes me rather sad that computers have slowed down in their evolution in the past decade when compared to previous decades. I want to see the crazy-mad evolution again, where we went from an 8086 in '82 to a 486 in '95. (12 years, 4 major processor generations with radical increases in power)

*sigh*

As it stands now, the signal propagation delay over the copper means that even if we switch away from silicon and embrace a new semiconductor technology (like diamond or graphene) that can be driven much higher, we will still have the crippling problem of the slow memory bus due to noise saturation problems driving the memory bus over copper at "long distances" from the CPU. We would have to integrate the RAM directly into the CPU to overcome it, and that would really be un-ideal. That's why I keep hoping for optical computing to one day escape from the lab and really take root.

Maybe some day. Just not today.

Sigh.

Lightman · « **Reply #21 on:** December 15, 2014, 02:29:06 am »

Quote from: Orange Wizard on December 04, 2014, 03:56:34 am

DF and Hardware Performance: Random Factoids:
Dwarf Fortress is, in fact, multithreaded. However, all of the game is run on one thread. The other thread is the rendering code.
Your graphics card has somewhere between very little to no effect of DF's performance. A few hundred letters are vastly easier to draw than hundreds of thousands of polygons.
Because (usually) DF cannot access more than 2 Gb of RAM at any time, and (usually) doesn't use that much in the first place, the amount of RAM you have is largely irrelevant.

Your second point is not very accurate. The rendering in Dwarf Fortress does make a difference and there are options for a reason. Text-mode printing is significantly faster than graphical output (via SDL and such).

You should consider that each tile is two polygons; more are needed for overlaying different combinations of colours and characters. A grid of 80 x 50 is at least 8000 polygons. Drawn at 60Hz, that's 480,000 polygons per second. I don't think graphics cards make much difference but it's not as simple as drawing "a few hundred letters".

Orange Wizard · « **Reply #22 on:** December 15, 2014, 02:38:50 am »

Fair enough. Hundreds of millions of polygons, perhaps.

Lightman · « **Reply #23 on:** December 15, 2014, 02:56:13 am »

Quote from: Orange Wizard on December 15, 2014, 02:38:50 am

Fair enough. Hundreds of millions of polygons, perhaps.

It might be interesting to do some comparisons sometime and see if we can get some quantifiable numbers. I would imagine that the video card has little impact.

superbob · « **Reply #24 on:** December 15, 2014, 02:37:24 pm »

I wonder how would modern server hardware handle DF. I mean as a dedicated platform. Obviously disregarding cost - let's assume the fate of the world depends on being able to handle a larger embark size.

Consider this: http://ark.intel.com/compare/75260,78940

The server CPU has massively larger cache and memory bandwidth. By disabling all but two cores, it could get closer to its max turbo boost of 3.7GHz. Assuming it would still be able to take advantage of all of its cache and memory bandwidth with just two cores, this seems to be an ideal platform for playing DF.

Miuramir · « **Reply #25 on:** December 15, 2014, 03:39:38 pm »

Quote from: superbob on December 15, 2014, 02:37:24 pm

I wonder how would modern server hardware handle DF. I mean as a dedicated platform. Obviously disregarding cost - let's assume the fate of the world depends on being able to handle a larger embark size.

Consider this: http://ark.intel.com/compare/75260,78940

The server CPU has massively larger cache and memory bandwidth. By disabling all but two cores, it could get closer to its max turbo boost of 3.7GHz. Assuming it would still be able to take advantage of all of its cache and memory bandwidth with just two cores, this seems to be an ideal platform for playing DF.

Tricky question. A better comparison might be this one:
between the E7-8893 v2 you mention and the i7-4790K
It's unclear whether the 8 GT/s vs. 5 GT/s system bus and 85 GB/s vs. 25.6 GB/s memory bandwidth would overcome the 3.7 GHz vs. 4.4 GHz clock speed difference. Of course, there's also the 20x cost issue; but if someone has $15k burning a hole in their pocket *after* having made a truly epic donation to Bay 12, it would be a good way to really answer to what degree DF is I/O bound vs CPU bound.

Also note that it's quite common to have i7-4790K setups that are designed to run the memory well over the official DDR3-1600 rate; whereas they rarely make server boards that do the same. That might well make up for a significant part of the difference.

wierd · « **Reply #26 on:** December 16, 2014, 12:25:52 am »

I am physically in possession of one of these.

http://www.amazon.com/HP-Proliant-DL380-2x3-2GHz-Server/dp/B005G5MH6Y

(It's a large, outdated toaster)

I obtained it second-hand from a friend, who asked if I wanted it. It was cast-off from a midsize company doing a (clearly much needed) equipment upgrade.

On a lark, I decided to test how well it runs DF. The craptastical video hardware is a severe limiter. it is NOT accelerated in any meaningful way. The "Rage XL" does not have any 3D capabilities (Despite being a Rage chip, of which others in the series had primitive 3D), and even lacks any good 2D accelleration. I think the best it can do is BitBlt. (Maybe. I have to disable all accelleration and compositing in order for linux to boot into the GUI, because it crashes X with any turned on. This chip just does not have it.) To circumvent this as a consideration, I installed linux, and ran it in full ncurses mode in a non-graphical terminal. (Used alt+F5 to get to a REAL terminal) This removed the GPU almost entirely from the equation.

As you can see, the system lacks any kind of sound hardware. It DOES however, have a PCI-X riser bay with 3 user slots. I happen to own a suitably card-edged PCI audio card that can live in a PCI-X slot. (An antique areal vortex 1. Good card, but also another dinosaur. I felt they belonged together, but I digress..) I installed it, and linux knew what it was, and turned it on for me. I then fired up DF and did a few large embark tests.

A 5x5 embark ran at about 25fps when I loaded up a pretty mature fortress, with the FPS uncapped. This is not what I would consider exemplary, but not exactly unplayable. (The same fortress runs at 5fps in "2D" mode on a windows laptop I have that has a craptastical "Dorito" (cough) type chip in it. (AMD E300)) Heavy compute was not exactly what this server was intended for, as this is clearly a mid-range, mixed-application type server with a RAID disk array which (For the time it was produced in) was pretty spacious and fairly fast. (Anything modern blows it away though, both in available capacity and in raw disk access. Again, this thing is a dinosaur.)

I have both CPU sockets populated with XEONs from the era, and have fully matched dual channel memory installed. (8gb)

My desktop is an intel i7 (an older model, but still an i7) with 12gb of dual channel RAM, running at 3.4ghz. It gets much better than the 25fps in pure terminal mode. (I will have to run another run to get what it was again, but it was much better than 25FPS. IIRC, it was closer to 40FPS on this fortress. I do not know what impact the speed of the video architecture has here. The GPU in my i7 is a 16x PCIe Nvidia offering which is getting a bit long in the tooth, but that is MUCH better than the "Integrated AGP 4x" (*cough*) Rage XL inside that server in terms of raw ability to push bits to the card over the bus. I don't know how I could quantify this objectively. Since both are running linux, I suppose I could run a test over SSH and cut the video card out of the question entirely.)

The i7 has a better die process, and likely much faster internal cache memory, in addition to a superior internal clock speed. The memory is a newer tech with a faster memory bus as well. However, the XEONs have bigger L2 and L3 caches, if I remember correctly. (Again, my i7 is an older one.)

If it weren't for my abysmal upload speed, I would be up for opening up that dinosaur server up for live operation tests with remote hosts, but my upload speed is capped at 250kbit. (Hooray for ADSL in rural flyoverland.)

I don't have a more modern datacenter computer to run tests against, I can however benchmark against the antique one I have. If more people had datacenter equipment, and would be willing to run various test fortresses on them for benchmarking, (possibly with some DFHack scripts to help get evaluation data out of DF with better granularity than just the FPS count), we could get a better idea of how DF is IO-bound or CPU-bound.

Currently, this comparison is apples to oranges, but again, I think if we had enough datapoints from enough systems all over the spectrum from a consistent set of fortresses presenting the game engine with various kinds of stresses, we could better determine this question.

superbob · « **Reply #27 on:** December 16, 2014, 04:02:48 am »

I think that's a Pentium 4 Xeon server. I also had one (I think the mobo with its CPUs is still gathering dust somewhere around here) and tried running DF on it. Can confirm the result was underwhelming. But AFAIK Pentium 4 was just a crap CPU even at the time. Also it seems yours had 2MB L2 cache, assuming it was what pops up when I look up that server model - about 3GHz, 2MB L2, single core /w hyperthreading. The early desktop i7 typically had 8MB (up to 12MB) of cache and was a lot more efficient than a Pentium 4, also I think they used DDR3 from the start. So our ancient servers are significantly worse everywhere it matters.

Still it's amazing that it's possible to play DF on hardware that old and underpowered. I ran a sealed off, self sufficient fort on mine, since it was on 24/7. It had a working GPU, so I played under X Window, usually checking on my dwarves remotely using VNC.

It was fun, but the power bill was not worth it.

I've also played DF on a workstation equipped with w Xeon L5420 (kinda like the Core 2 Quad CPUs, but with 12MB L2 Cache). It was ok I guess, but performance likely suffered from using slow FB-DIMM memory.

I had nothing to compare it to at the time, so I can't really tell how well it handled DF. But I do remember playing pretty large embarks with it, like 6x6, and not really getting FPS death until I triggered it somehow.

Switching to my current i5 4670k made everything a lot faster, including DF, even though it "only" has 6MB of cache. But it's also running at 4GHz, can do more instructions per clock cycle and uses much faster memory.

Which is why I'm curious as to how would modern server hardware handle DF. Everything I've tried so far was already old when I got it, also flawed in one way or another.

Still I don't expect a miracle, like being able to play a full featured 64x64 embark at 100FPS. I think we'll have to wait for some kind of a breakthrough in CPU technology for that to work.

Thief^ · « **Reply #28 on:** December 16, 2014, 05:55:26 am »

Quote from: superbob on December 16, 2014, 04:02:48 am

I think that's a Pentium 4 Xeon server. I also had one (I think the mobo with its CPUs is still gathering dust somewhere around here) and tried running DF on it. Can confirm the result was underwhelming. But AFAIK Pentium 4 was just a crap CPU even at the time.

IIRC the design remit for the Pentium 4 was "as high of a clock speed as possible". As a result it had an insanely long instruction pipeline (20+ stages compared to the 10 of the Pentium 3), and anything which had instruction dependencies or branches heavily (both common in real code) slowed to a crawl on it. The early implementation of hyperthreading didn't help either, some benchmarks actually showed multithreaded performance (let alone single-threaded performance) going down with it enabled (although generally a 5-10% increase overall).

News:

Author Topic: Virtual machines to increase performance? (Read 4234 times)

draeath

Re: Virtual machines to increase performance?

MitchTheLich

Re: Virtual machines to increase performance?

draeath

Re: Virtual machines to increase performance?

Miuramir

Re: Virtual machines to increase performance?

Putnam

Re: Virtual machines to increase performance?

wierd

Re: Virtual machines to increase performance?

Lightman

Re: Virtual machines to increase performance?

Orange Wizard

Re: Virtual machines to increase performance?

Lightman

Re: Virtual machines to increase performance?

superbob

Re: Virtual machines to increase performance?

Miuramir

Re: Virtual machines to increase performance?

wierd

Re: Virtual machines to increase performance?

superbob

Re: Virtual machines to increase performance?

Thief^

Re: Virtual machines to increase performance?