Topic: Multithreading Workshop (Read 4083 times)

dreiche2 · « **on:** September 13, 2008, 09:36:41 am »

Alright, so we had a big discussion recently about the potential benefits and risks of multithreading, e.g. when applied to pathfinding in DF. However, this was a rather theoretical, what-if discussion. So I spend some days last week trying to find out how much work it actually takes to get something multithreading running, what the options are and how hairy the whole thing is.

I will here report my findings about two things in particular - OpenMP and Boost.Thread - and post actual code examples for people to try out. I'm not sure whether it makes sense to do anything like this in DF - I will discuss it though - but maybe it's at least useful to get an idea about the whole thing, for anyone with interest in the topic, and in particularly for Toady who doesn't have the time to do several days of research about something that might in the end not bring any benefit. Maybe the thread will be resurrected at some point when the time is right. However, I don't want to start a general discussion about multithreading again. Here I give you a concrete example to comment on.

Disclaimer: I am, or was until last week, a newbie with regards to the topic. My C++ are also rusty as I spend recent years mostly with Matlab. I'm a careful person and would usually insert "possibly", "apparently", "I think" into every second sentence below, but I thought I'd spare you this and give you this disclaimer instead. But take everything with a grain of salt.

OpenMP

Wikipedia: http://en.wikipedia.org/wiki/Openmp
Website: http://openmp.org/wp/
A Tutorial: http://bisqwit.iki.fi/story/howto/openmp/

OpenMP offers a method for parallelising for-loops. To me the biggest advantage is that is very simple to play around with. The biggest disadvantage is that it is compiler-based - either your compiler can do it or it can't (more on this at the end).

In practice, it works like this: You pick out a for loop where a lot of work happens, and, to be sure, one where data is not shared within iterations, then you write a compiler directive in front of it - and that's it. When executed, the program will run as usual up until the for statement. Then, OpenMP will start a number of threads and share the iterations among the threads to be processed in parallel. After the loop is over, the auxiliary threads will be destroyed again, with exceptions (see below). By default the number of threads created is handled by OpenMP automatically, and AFAIK there is no control over how threads are assigned to processors, but you don't necessarily have to care about that. A simple example (initializing an array in parallel), more or less straight from wikipedia, looks like this:

Code: [Select]

int main(int argc, char *argv[])
{
  int a[100000];
  #pragma omp parallel for
  for (int i=0;i<100000;i++) 
     a[i]= 2*i;

  return 0;
}

With g++ under linux, use -fopenmp to compile it:

Code: [Select]

g++ -fopenmp openmpTest1.cc -o test
That's all. Of course you might want to cout<< the result to see that the program actually did something:

Code: [Select]

#include <iostream>

int main(int argc, char *argv[])
{
  int a[100000];
  #pragma omp parallel for
  for (int i=0;i<100000;i++) 
     a[i]= 2*i;


  for (int i=0;i<100000;i++) 
    std::cout<<a[i]<<std::endl;

  return 0;
}

There are some additional things (see the tutorial linked above), but as far as I can see that's essentially it. In particularly, there are some ways how to handle synchronization by locking data etc., but as far as I understood OpenMP is really mostly meant for cases like the above, where iterations in the loop are independent anyway and can't interfere with each other.

Here comes a slightly more extensive example that actually computes something - albeit nonsense - to give the processor(s) some actual work.

Code: [Select]

#include <iostream>

int main()
{
  
  int nResults=1000;
  double results[nResults];
  int increment=10;
  for(int i=0;i<=nResults-1;i++)
    {
      results[i]=i;
    }
  
  for (int iTurn=1;iTurn<=100;iTurn++)
    {
      
      #pragma omp parallel for
      for(int i=0;i<=nResults-1;i++)
	{
	  
	  for(int j=1; j<=10000;j++)
	    {
	      for(int k=1; k<=10;k++)
		{
		  results[i]=results[i]+increment;
		}
	    }
	  
	}
    }
  std::cout.precision(20);
  for(int i=0;i<=nResults-1;i++)
    {
      std::cout<< results[i]<<std::endl;
    }
  
}

Principle is the same, just some nested for loops and useless computations. Note that in general there is some overhead work involved with creating threads, so here one might worry about the parallelised for loop being itself in a for loop. However, apparently the threads are reused at least in this kind of situation (see tutorial under Loop nesting - performance).

When I run the above example on my dual core laptop, I can see both cores working at 100 %. For comparison, the non threaded code runs only on one core. Note that you do not need to remove the OpenMP statements to generate single threaded code, just compile without the OpenMP switch and the corresponding lines are simply ignored.

Caveats:

1. The compiler needs to support it. Gcc under Linux does so, visual studio only in the professional (!) edition.

2. If the parallelised code actually runs faster can depend on a number of things. In the above example, it depends on the number of iterations in the various nested for loops, and on how heavy the computations being performed are. And obviously, any performance increase only applies to the parallelised sections.

While I could get results almost twice as fast in certain configurations (with my dual core), my very first attempt actually yielded code that ran much *slower* when ran in parallel. What had happened? For the bogus calculations, I used a single a=a+1; Apparently, this calculation is optimized by the compiler (cf. a++, bit-shift, yadda yadda). However, within a parallelized for loop, the optimization apparently does not take place any more (the compiler probably is more careful in these cases in general). If I used a=a+1-1+1 instead, the single threaded code was no longer faster.

The above example shows that it's not always intuitive what's going on, and it boils down to just trying it out.

3. There might be further caveats. The above tutorial mentions something about STL not being thread-safe -which is true-, but I'm not sure in how far that is a problem on top of what has already been said concerning data should not be shared within iterations. For example, a microsoft tutorial on OpenMP uses code like

Code: [Select]

#pragma omp parallel for
   for(int i = 0; i < xVect.size(); ++i) 
      process(xVect[i]);

(haven't read the whole thing, though). I guess the point is as long as you're not doing something fancy with iterators but just keep with the "normal" loop form from above, you're fine.

Conclusion:

I think the best thing about OpenMP is that it is easy to try out. Pick a work intensive for loop, write the pragma line in front of it, then see if it helps. The loop should not share any data within iterations, and to be safe not contain too fancy library calls etc., just plain heavy calculations. The amount of total improvement will then depend on the details of the loop and of course on how the rest of the program looks like / how many of such suitable loops you find.

For DF, I doubt that Toady uses Visual Studio Professional, but if the Linux port ever works out, then that would open up the possibility to play around with OpenMP.

I will post about Boost.Thread later today. Comments are appreciated!

valcon · « **Reply #1 on:** September 13, 2008, 10:25:00 am »

a+b=make it go faster;>>{pewpew}

dreiche2 · « **Reply #2 on:** September 13, 2008, 03:31:36 pm »

Okay this is the second part. It ended up to be a little bit longer and unfortunately I don't have time to explain it all nicely. However, I hope to present some code in the end that might be useful to anyone. It's certainly not wasted time for me because I can very likely use it for my work in the future.

Boost.Thread

The Boost C++ libraries do all kinds of things, and in particular, Boost.Thread does multithreading.

Wikipedia: http://en.wikipedia.org/wiki/Boost_C%2B%2B_Libraries#Multithreading_.E2.80.93_Boost.Thread

The only proper tutorial around seems to be
http://www.ddj.com/cpp/184401518

wheras the documentation
http://www.boost.org/doc/libs/1_36_0/doc/html/thread.html
is more of a reference.

Boost.Thread lets you create threads running independently from the main thread, with each thread executing its own code implemented in a function passed to the thread. For example, you could split up work to be performed independently in multiple threads as with OpenMP, or you might also delegate a task completely to another thread (e.g. path finding) while the main thread minds its own business. The secondary thread(s) will be destroyed as soon as their function returns.

Simple Wikipedia example:

Code: [Select]

#include <boost/thread/thread.hpp>
#include <iostream>
 
using namespace std; 
 
void hello_world() 
{
  cout << "Hello world, I'm a thread!" << endl;
} 

int main(int argc, char* argv[]) 
{
  // start a new thread that calls the "hello_world" function
   boost::thread my_thread(&hello_world);

 // wait for the thread to finish
   my_thread.join(); 

 
  return 0;
}

After having installed Boost.Thread on Ubuntu, I compile this with

Code: [Select]

g++ thread.cc -o thread.out /usr/lib/libboost_thread-gcc42-mt-1_34_1.so.1.34.1
and you need to adjust the path to the library (and to headers possibly) accordingly.

The above example is simple enough. However, it can get complicated once you want to actually share data in between threads. You want to avoid having multiple threads accessing the same data at the same time - in case of doubt, better play it safe. It's not so much the case that simultaneous access will break things right away, it just can lead to undefined behaviour, and hard to debug code. Thus, you could think of using flags to set when it is save to access a certain variable etc., and these kinds of things are implemented in the forms of mutexes, condition variables etc (see the above tutorial). The latter are general concepts which are in particularly implemented in Boost.

Getting to grips with all this is not trivial and you still have to be careful about what you do. What I'm going to do here is to talk a little bit more about the basics, and then describe a working solution that might be useful for a certain kind of task - DF or otherwise - hopefully even for people who do not want to delve into all the details. (oh and sorry I think I changed my naming conventions along the way).

Here another simple example that uses two threads with some actual (bogus) work to do:

Code: [Select]

#include <boost/thread/thread.hpp>
#include <iostream>
 
using namespace std; 
 
void hello_world() 
{
  double a=0;
  
  for(long j=1; j<=100000;j++)
    {
      for(long k=1; k<=50000;k++)
	{
	  a=a+0.0000000001;
	}
    }
  cout.precision(20);
  cout << "Hello world, I'm a thread!, " <<a<< endl;
}

int main(int argc, char* argv[]) 
{
  // start a new thread that calls the "hello_world" function
   boost::thread my_thread(&hello_world);
 
   //minding my own business
  hello_world();

 // wait for the thread to finish
   my_thread.join(); 
 
  return 0;
}

The hello_world() is passed to the thread and subsequently called in the main thread. Thus, it will be processed twice and in parallel. On my dualcore laptop, this is almost twice as fast as when hello_world() is simply called twice in sequence instead.

Before I come to how variables can be accessed safely by multiple threads, note that in the above example there is actually an example of such a possibly simultaneous access: The cout object is accessed by both threads in the two hello_world instances, although here is unlikely that both threads access it really at exactly the same time. You can have such an effect however when looping over cout calls in two threads, and you will see that then the output can be a garbled, e.g.

"helheloo wwoorlrldd"

if both threads print "hello world". As with any shared accesses this can be avoided by using mutexes (see below), but in the cout case I don't know if there is any other problem than a possibly garbled output.

Next comes an example of uncontrolled shared access. There is a global variable a. The main threads reads user input and stores it in a, the second thread puts out the value of a whenever it changes. Entering 1 ends the loops in both threads.

Code: [Select]

#include <boost/thread/thread.hpp>
#include <iostream>
 
using namespace std; 

int a; 


void hello_world() 
{

  int a_old=a;

  while(a!=1)
    {
      while(a_old==a)
	{
	}
      cout <<endl<< "Hello world, I'm a thread!, " <<a<< endl;
      a_old=a;
    }
}
 

int main(int argc, char* argv[]) 
{
  a=0;
  // start a new thread that calls the "hello_world" function
   boost::thread my_thread(&hello_world);
 
  while(a!=1)
    {
      cin>>a;
    }


 // wait for the thread to finish
   my_thread.join(); 

 
  return 0;
}

This example shows that shared access by itself does not necessarily break things. However, there are many pitfalls to avoid with shared access. Even having one thread that only reads and one that only writes can be dangerous if the variable is more complicated like an array or an STL container. I'm not sure about the details here, but in case of doubt better play it safe - using so-called mutexes.

A mutex provides a way of locking access to data so that only one thread can access a variable at a time. The idea is very similar to using a flag variable which threads can query to find out whether nobody else is currently accessing the data.

When I first read about mutexes I thought they where some kind of container for data. Instead, they are more like gates in the code execution very similar to an if or while statement accessing in a flag variable. The last example modified to use a mutex (but not really working, see below) looks like this:

Code: [Select]

#include <boost/thread/thread.hpp>
#include <boost/thread/mutex.hpp>
#include <iostream>
 
using namespace std; 

int a; 

boost::mutex the_mutex;

void hello_world() 
{

  int a_old=a;

  boost::mutex::scoped_lock lock(the_mutex);
  while(a!=1)
    {
      while(a_old==a)
	{
	}
      cout <<endl<< "Hello world, I'm a thread!, " <<a<< endl;
      a_old=a;
    }
}
 

int main(int argc, char* argv[]) 
{
  a=0;
  // start a new thread that calls the "hello_world" function
   boost::thread my_thread(&hello_world);
 
  while(a!=1)
    {
      cin>>a;
    }


 // wait for the thread to finish
   my_thread.join(); 

 
  return 0;
}

The mutex here is a global variable. What happens in the hello_world() function is that the mutex is locked (by instantiating a lock variable). More precisely, at this point the hello_worlds attempt to lock the mutex. If it is already locked, it halts execution and waits for it to be unlocked. Otherwise, it locks it and proceeds. If any other thread then would attempt to lock the same (!) mutex, e.g. in other code sections or even by running the same hello_world() function concurrently, it would have to wait on its part until the first thread unlocks the mutex again.

Unlocking can happen by calling .unlock() on the lock variable. Also, in the above case, a scoped_lock is used. This automatically unlocks the mutex once the current execution block is over (e.g. when the next '}' closes). This has apparently the additional benefit that the mutex gets unlocked even when an exception is thrown etc.

So, to have safe access to data, you want to use a mutex whenever you access the data, i.e. locking the mutex before the corresponding code sections. In particularly, in the above code the mutex doesn't actually do anything useful, as it is accessed only by the hello_world() thread. If I wanted to control access to variable a, I should have used the mutex in the main thread. In the example above, the main thread doesn't care about the mutex but just accesses variable a right away.

The above is mostly meant to give you an idea about the mechanisms, though in the end I hope to present a "solution" that hides away a lot of the details.

Now, the second thing I need apart from mutexes is condition variables. A condition variable makes it possible that thread waits with accessing shared data not only until no one else does, but also until the data is in a usable condition.

The principle works like this: A thread waits until a certain condition is set by calling

the_condition_variable.wait(...);

Another thread can set the condition by calling

the_condition_variable.notify_one();

This then wakes one of the threads waiting on that condition variable, or all wating threads when notify_all() is used.

Apart from possible access issues, this method also has an advantage over using a while(my_variable) instead to halt execution until a condition is realised: In the latter case, the thread waiting in the while loop will keep its CPU spinning at full load, whereas one that waits on a condition variable doesn't.

dreiche2 · « **Reply #3 on:** September 13, 2008, 03:32:28 pm »

Now, I won't go into further details on how this works. See the above tutorial or this link (will be explained below). Instead, I'm going to present a concrete application based on code that is meant to be usable without knowing all the details.

Basically, what I had in mind was a situation where the main thread creates batches of work to be processed by auxiliary threads. For example, this might be pathfinding requests for creatures in DF, but could be anything comparable.

Specifically, the idea is to have a queue of data objects to be processed. The main thread will occasionally push data on the queue and pop processed data, whereas secondary threads process the data whenever the queue is not empty. This seems to be a producer-consumer problem (wikipedia, though not with a fixed size buffer as described there), i.e. the main thread continuously produces work objects that are to be "consumed" by the auxiliary threads.

I based my implementation on the concurrent queue described here:

http://www.justsoftwaresolutions.co.uk/threading/implementing-a-thread-safe-queue-using-condition-variables.html

There, a thread-safe container queue is implemented, i.e. a queue which handles the locking of shared data internally so that threads can safely push and pop data onto it. Note that although this seems to be a common problem that has its own wikipedia entry, I have found only one other example of an actual implementation of such a queue (and both are from 2008...).

I also managed to point out an error in the implementation of the above link (so be careful using it, see comment section), which makes me feel warm and fuzzy inside because it means that I at least partly understand what's going on.

Anyway, what I did is implementing a "work_queue". It mirrors the above concurrent_queue, but using two internal queues: A to_do_queue and a done_queue. The idea is that both the main thread and one or multiple worker threads access the work_queue. The main thread pushes from time to time data onto the to_do_queue and checks the done_queue for already processed data. The worker threads wait for data to be processed on the to_do_queue, and push what they are done with onto the done_queue.

The work_queue is a template class and can thus be instantiated with any type of data, or specifically, two types of data for to_do_queue and done_queue, respectively. I will give a full example below, with the intention of most of the code to be reusable. This includes the worker_queue class and an implementation of a worker_thread class. The latter waits on a work_queue and calls a class method whenever it finds data on the work_queue. The only things that need to be re-implemented for a specific application is the method used to actually process the data, and the way the work_queue is filled and emptied in the main thread.

Concerning on how the main thread could handle the work_queue, I see two possibilities. First, it simply pushes work onto the work_queue whenever it becomes available, and checks from time to time whether results are on the done_queue. This seems most efficient as it gives the worker threads as much to do as possible. However, I see a disadvantage here in that the game becomes indeterministic: Depending on how fast the worker threads are done with the work, which would depend on how much current load there is on their respective CPU, the main thread would retrieve results at varying times. Thus, if you load the same savegame twice, the game could play differently depending on what else happens on your system. Apart from other possible problems, this might make bug hunting more painful because it would be harder to reproduce game events from savegames.

The second possibility would be to store and retrieve work in the main thread in a way that is synchronized with the worker threads such that the program state is always identical over repeated executions. Basically, the main thread could push and retrieve results in regular intervals - say every 10 game frames - and wait at these times if not all work is completed yet. Thus, after the waiting is over and the current batch of work retrieved, the state would be always the same, even if there are fluctuations in how long it took to get there.

In principle you could do this synchronisation every single frame, but I guess doing it more sparsely, if possible, would be more efficient because fluctuations in worker thread processing duration would average out.

dreiche2 · « **Reply #4 on:** September 13, 2008, 03:33:05 pm »

Below I implemented an example (with synchronization). It looks more complicated than it is because I had to implement stuff around it so that it actually does something, but the core idea is really as above described above. Specifically, I followed the idea that the main thread delegates "path finding" to the worker threads. Here, path finding simply means filling up a queue of integers: For instance, the main thread requests a path from "5" to "8" by putting a queue containing the two numbers on the to_do_queue, and the worker thread pushes as result a queue containing "5,6,7,8" onto the done_queue, after doing some additional bogus work to keep the worker threads busy. The complete code follows again below, but I'll break it down and describe it first.

I start with the main function/thread.

Code: [Select]

int main(int argc, char* argv[]) 
{

  work_queue<queue<int>, queue<int> > the_work_queue;

This instantiates the work_queue. The latter is a template class, and the two types define the data objects on the to_do_queue and on the done_queue, respectively. In this case, the data is paths, and both input and output paths are realised as standard integer queues.

Next, I am actually using multiple worker threads here (see below).

Code: [Select]

  int nWorkers=3; //how many workers and threads

  vector<queue_worker> workers;

queue_worker is the class that defines a worker thread. It provides a start function that will be passed to the actual thread, making the thread wait for work/paths to be pushed onto the work_queue. Then, it has a compute_path method that does the actual work. The latter is the only bit that would have to be reimplemented for a different application.

I use the above vector to store the multiple queue_workers that all get their own thread (in this case I could have used a single queue_worker instance, but never mind).

Next, having multiple threads, I use boost::thread_group for easy handling:

Code: [Select]

boost::thread_group workerThreads;

 for (int iWorker=1;iWorker<=nWorkers;iWorker++)
   {

In this loop I create nWorkers queue_worker instances and threads

Code: [Select]

     queue_worker worker(&the_work_queue);
     workers.push_back(worker);

The queue_worker constructor takes as argument a pointer to the work_queue, so that both main thread and worker thread can access the latter.

Code: [Select]

   workerThreads.create_thread(boost::bind(&queue_worker::start, workers.back()));
     //note that workers.back() is a reference (!) to the worker in the vector
   }

workerThreads is the thread_group. One complication here. We had boost::thread my_thread(&hello_world) before. Now, when I actually want to pass an argument to the thread function, I can use boost::bind, e.g.

void hello_world(int x);

boost::thread my_thread(boost::bind(&hello_world,1234));

In the above case, I pass the start function of the queue_worker class, and an actual queue_worker instance as argument (from the vector that stores all the queue_workers). Each .createThread call creates a new thread and "starts" the queue_worker by calling its start method. This will make each worker wait on the
work_queue for work to do.

Now, next comes the part where the main thread creates work for the worker threads, and this part obviously would have to be implemented for the specific application at hand. Here, I just have the main thread create 5 paths repeatedly over 4 cycles. The main thread will push the 5 paths onto the work queue and then wait till they have been processed. In "real life", it could do other work instead.

Code: [Select]

int nCycles=4;
  int nPaths=5;


  for (int iCycle=1;iCycle<=nCycles;iCycle++)
    {

Now five paths are created (just by selecting arbitrary pairs of integers). The paths are then pushed onto the work queue (the to_do queue, specifically), and will be immediately picked up by the waiting worker threads.

Code: [Select]

     for (int iPath=1;iPath<=nPaths;iPath++)
	{
	  queue<int> path;
	  path.push(iCycle*10);
	  path.push(iCycle*10+iPath);
	  
	  the_work_queue.to_do_push(path);
	}

Next, the main thread wait until the current batch of work is done. To this end I implemented a wait_all_done() function into the work_queue class.

Code: [Select]

the_work_queue.wait_all_done(); //wait for workers to finish batch
Essentially, the main thread now waits on a condition variable (internally). The work_queue will wake up the main thread after all work is done, which is implemented by checking if the count of to_do items so far equals the count of done items so far on the queue.

Below, this is just some output of the main thread whenever it received a batch of results:

Code: [Select]

      cout<<"Retrieved paths:"<<endl;      

      queue<int> resultPath;
      while(the_work_queue.done_try_pop(resultPath)) //false when empty
	{
	  while(!resultPath.empty())
	    {
	      cout<<resultPath.front()<<",";
	      resultPath.pop();
	    }
	  cout<<endl;
	}
    }

After the cycles are over, the (part of the) program is supposed to finish. There is currently no method to kill running threads with boost other than to end the program (?) or to wait for them to finish. Here however they will never finish because they are just waiting on the empty work_queue. As a work around I implemented the queue_worker such that it returns whenever an empty path is pushed onto the to_do_queue (and the actual push will wake it up from its waiting).

Code: [Select]

  queue<int> dummy;
  while(the_work_queue.to_do_try_pop(dummy)){}; //emptying queue
  queue<int> empty_path;
  the_work_queue.to_do_push(empty_path); //signals the worker to finish
  
 // wait for the worker threads to finish
   workerThreads.join_all(); 
 
 
  return 0;
}

That's it for the main thread. I haven't talked about the implementation of the work_queue and the queue_worker (code comes below), but I hope it could be usable in the above fashion even without knowing the details. The work_queue can be used as is. For the queue_worker, you would have to reimplement the function compute_paths(...) so that it does the actual work you want it to do, and change the data type of the work_queue pointer accordingly if you don't work on queue<int> ("paths" here). Also, if you don't use a data type that has an .empty() method, you would have to modify my work around that stops the thread when an empty datum is pushed (see start function).

work_queue class, copy&paste into work_queue.h:

Spoiler (click to show/hide)

Code: [Select]

#include <queue>
#include <boost/thread/thread.hpp>
#include <boost/thread/mutex.hpp>
#include <boost/thread/condition.hpp> 

/* based on
http://www.justsoftwaresolutions.co.uk/threading/implementing-a-thread-safe-queue-using-condition-variables.html
*/

/* Provides two concurrently accessible queues for uncompleted and 
completed work objects, and a condition variable based method to 
wait until all the work is done. */


template<typename to_do_data, typename done_data> 
class work_queue
{
 private:
  std::queue<to_do_data> to_do_queue;
  std::queue<done_data> done_queue;
  mutable boost::mutex to_do_mutex;
  mutable boost::mutex done_mutex;
  boost::condition to_do_empty_condition;
  boost::condition done_empty_condition;

  boost::condition all_done_condition;
  
  long to_do_count;
  long done_count;
  
 public:

  work_queue()
    {
      to_do_count=0;
      done_count=0;
    }

    void to_do_push(to_do_data const& data)
    {
        boost::mutex::scoped_lock lock(to_do_mutex);
        bool const was_empty=to_do_queue.empty();
        to_do_queue.push(data);
	to_do_count++;

        lock.unlock();

        if(was_empty)
        {
            to_do_empty_condition.notify_all();
        }
    }

    bool to_do_empty() const
    {
        boost::mutex::scoped_lock lock(to_do_mutex);
        return to_do_queue.empty();
    }

    bool to_do_try_pop(to_do_data& popped_value)
    {
        boost::mutex::scoped_lock lock(to_do_mutex);
        if(to_do_queue.empty())
        {
            return false;
        }
        
        popped_value=to_do_queue.front();
        to_do_queue.pop();
        return true;
    }

    void to_do_wait_and_pop(to_do_data& popped_value)
    {
        boost::mutex::scoped_lock lock(to_do_mutex);
        while(to_do_queue.empty())
        {
            to_do_empty_condition.wait(lock);
        }
        
        popped_value=to_do_queue.front();
        to_do_queue.pop();
    }



    void done_push(done_data const& data)
    {
        boost::mutex::scoped_lock lock(done_mutex);
        bool const was_empty=done_queue.empty();
        done_queue.push(data);
	done_count++;

        lock.unlock();

	if(done_count==to_do_count)
	  {
	    done_count=0;
	    to_do_count=0;
	    all_done_condition.notify_all();

	  }

        if(was_empty)
        {
            done_empty_condition.notify_all();
        }


    }

    bool done_empty() const
    {
        boost::mutex::scoped_lock lock(done_mutex);
        return done_queue.empty();
    }

    bool done_try_pop(done_data& popped_value)
    {
        boost::mutex::scoped_lock lock(done_mutex);
        if(done_queue.empty())
        {
            return false;
        }
        
        popped_value=done_queue.front();
        done_queue.pop();
        return true;
    }

    void done_wait_and_pop(done_data& popped_value)
    {
        boost::mutex::scoped_lock lock(done_mutex);
        while(done_queue.empty())
        {
            done_empty_condition.wait(lock);
        }
        
        popped_value=done_queue.front();
        done_queue.pop();
    }


    void wait_all_done()
    {
         boost::mutex::scoped_lock lock(done_mutex);
        while(done_queue.empty())
	  {
            all_done_condition.wait(lock);
	  }
	
    }


   
};

The above example, including the queue_worker class:

Spoiler (click to show/hide)

Code: [Select]

#include "work_queue.h"
#include <boost/thread/thread.hpp>
#include <boost/thread/mutex.hpp>
#include <boost/bind.hpp>
#include <iostream>
#include <queue> 

using namespace std; 


class queue_worker
{
public:
  queue_worker(work_queue<queue<int>, queue<int> >* ptr)
  {
    work_queue_ptr = ptr;
  }

  void start()
  {
    queue<int> path;
    
    while(1)
      {
	work_queue_ptr->to_do_wait_and_pop(path); //call by reference! 
	
	
	//can't think of a better way to gracefully end the thread,
	//because otherwise it will wait over an empty queue forever.
	//Thus, I'm using the path itself as a signal. Pushing an empty 
	//path onto the work queue ends the worker thread.
	if(path.empty())
	  {
	    work_queue_ptr->to_do_push(path); //for other worker threads
	      break;
	  }
	
	//got data? Work!
      compute_path(path);
      
      work_queue_ptr->done_push(path);
      }
  }
  
private:
  
  work_queue<queue<int>, queue<int> >* work_queue_ptr;

  void compute_path(queue<int>& path)
{
  float bogus_work=42; 
  for(long j=1; j<=100000;j++)
    {
      for(int k=1; k<=10000;k++)
	{
	  bogus_work=bogus_work+0.00000001;
	}
    }
  //finding a "path" by filling up integers
  int p1;
  int p2;
  p2=path.front();
  path.pop();
  p1=path.front();
  if (p2>p1)  
    {
      for (int i=p1+1;i<p2;i++) path.push(i);
    }
  else 
    {
      for (int i=p1-1;i>p2;i--) path.push(i);
    }
  path.push(p2);
}
};






int main(int argc, char* argv[]) 
{

  work_queue<queue<int>, queue<int> > the_work_queue;

  int nWorkers=3; //how many workers and threads

  //here: could use one worker instance, but be careful if member data 
  //is changing in a worker specific way etc.
  vector<queue_worker> workers; 

  boost::thread_group workerThreads;

  for (int iWorker=1;iWorker<=nWorkers;iWorker++)
   {
     queue_worker worker(&the_work_queue);
     workers.push_back(worker);

     workerThreads.create_thread(boost::bind(&queue_worker::start, workers.back()));
     //note that workers.back() is a reference (!) to the worker in the vector
   }

   
  int nCycles=4;
  int nPaths=5;

  
  for (int iCycle=1;iCycle<=nCycles;iCycle++)
    {
      for (int iPath=1;iPath<=nPaths;iPath++)
	{
	  queue<int> path;
	  path.push(iCycle*10);
	  path.push(iCycle*10+iPath);
	  
	  the_work_queue.to_do_push(path);
	}
      

      the_work_queue.wait_all_done(); //wait for workers to finish batch

      cout<<"Retrieved paths:"<<endl;      

      queue<int> resultPath;
      while(the_work_queue.done_try_pop(resultPath)) //false when empty
	{
	  while(!resultPath.empty())
	    {
	      cout<<resultPath.front()<<",";
	      resultPath.pop();
	    }
	  cout<<endl;
	}
    }




  
  queue<int> dummy;
  while(the_work_queue.to_do_try_pop(dummy)){}; //emptying queue
  queue<int> empty_path;
  the_work_queue.to_do_push(empty_path); //signals the worker to finish
  
 // wait for the worker threads to finish
   workerThreads.join_all(); 
 
 
  return 0;
}

Performance wise, of course you don't want to create arbitrary many worker threads because of a certain overhead caused by managing threads. However, from my tests it seems as if having more threads than you have processors can be worhwhile, and I went up to as many as 10 worker threads in the above example (5 of which can't even do work because there are only 5 paths per cycle) and it still was maximally fast.

Conclusion

Okay I know this all sounds complicated. But I'd recommend for anyone who's interested in the matter to download the above example and try it out. To make it work on something you want it to, all you have to do is to change compute_path in the queue_worker, some data types, and main() accordingly.

For DF... well I don't really know. I don't know whether Toady has the nerve to try out this stuff, maybe not now. Maybe if he ever does an experiment analogous to Battle Champs or describes the code sections he's using, I or others could do the implementation for him and he tries it out (again, he seems to be busy with other stuff right now, so adding another non-content thing to his workload might not be in his interest). Btw, I would have tested the thing in BC but there wasn't a CPU intensive part analogous to the above problem around...

That's it for me, for now. Maybe this will be useful to someone at some point - and if only to myself

Keiseth · « **Reply #5 on:** September 13, 2008, 03:48:12 pm »

Whoa. I have to archive this thread in case it goes anywhere, and finish reading the rest of it. Thanks a lot for the info, dreiche. I use GCC, or MinGW rather, and it appears to be supported.

Tormy · « **Reply #6 on:** September 13, 2008, 05:37:06 pm »

dreiche you should edit the title of the topic..Mulithreading Workshop -> Multithreading Workshop

dreiche2 · « **Reply #7 on:** September 13, 2008, 06:20:18 pm »

good point

TheSpaceMan · « **Reply #8 on:** September 13, 2008, 06:46:20 pm »

Lots of the work is also to optimize the multithreading, the main thing to achive is that no logical processor is standing without work at any given time. I went to a really boring/intresting/confusing lecture about this a few years back.

Still with some of the heavier calculations in DF it would still probably help even if the other logical processors are not working all the time.

Ostsol · « **Reply #9 on:** September 13, 2008, 11:38:15 pm »

You might also want to look into Intel's Threading Building Blocks. I found it very easy to use in simple cases, such as parallelizing for loops.

Novocain · « **Reply #10 on:** September 14, 2008, 12:02:22 am »

Yeah, I have to say that, as a beginner to threading in general, TBB was much simpler to use than Boost. In fact, I could hardly get any part of Boost up-and-running, given their relatively slipshod tutorials and guides. Hell, I had trouble finding any way to even ask a question besides an IRC channel. TBB, on the other hand, has an extremely friendly forum that is easy to find(also TBB has super-sexy atomic operations). Just my two cents, as a novice C++ coder. I will admit to hearing good things about OpenMP, though.

Shades · « **Reply #11 on:** September 14, 2008, 08:10:03 am »

I've not used OpenMP before so I'm not sure how valid my point will be but almost every time I have a choice between a boost library and another one I will go with boost. Their quality of code requirements are pretty strict, and most of the libraries have been around a while and are well tested.

That said OpenMP sounds interesting.

DDouble · « **Reply #12 on:** September 14, 2008, 09:30:58 am »

It might help Toady if someone spent the time applying these changes to Battle Champions or Kobold Quest, so he could see it in action.

I really enjoy the interplay between the community's programmers and The Programmer himself. I think his idea to release a portion of DF code in the form of Battle Champions was a great idea. Thanks for posting all this!

dreiche2 · « **Reply #13 on:** September 14, 2008, 09:59:20 am »

Quote from: Shades on September 14, 2008, 08:10:03 am

I've not used OpenMP before so I'm not sure how valid my point will be but almost every time I have a choice between a boost library and another one I will go with boost. Their quality of code requirements are pretty strict, and most of the libraries have been around a while and are well tested.

That said OpenMP sounds interesting.

Well as far as I can see the two don't necessarily compete because they're good for different things. OpenMP for very easy parallelization of computational heavy for-loops, and Boost.Thread for more advanced constructs and delegation of whole tasks to auxiliary threads.

Quote from: DDouble on September 14, 2008, 09:30:58 am

It might help Toady if someone spent the time applying these changes to Battle Champions or Kobold Quest, so he could see it in action.

Personally, I've never looked at Kobold Quest, and the only reason I haven't tested it in Battle Champions is that the latter seemed to be dominated by graphics, not CPU calculations, and at first glance I also haven't seen anything that seemed particularly suitable for either of the two parallelization methods. In principle I agree however that it would be nice to test them in a DF related context.

olemars · « **Reply #14 on:** September 14, 2008, 10:45:01 am »

This reminds me of my Fortran programming days, where such for-loop parallellizations are intrinsically supported in the language.

News:

Author Topic: Multithreading Workshop (Read 4083 times)

dreiche2

Multithreading Workshop

valcon

Re: Mulithreading Workshop Pt. I: OpenMP

dreiche2

Re: Mulithreading Workshop

dreiche2

Re: Mulithreading Workshop

dreiche2

Re: Mulithreading Workshop

Keiseth

Re: Mulithreading Workshop

Tormy

Re: Mulithreading Workshop

dreiche2

Re: Multithreading Workshop

TheSpaceMan

Re: Multithreading Workshop

Ostsol

Re: Multithreading Workshop

Novocain

Re: Multithreading Workshop

Shades

Re: Multithreading Workshop

DDouble

Re: Multithreading Workshop

dreiche2

Re: Multithreading Workshop

olemars

Re: Multithreading Workshop