Latest update:
As it seems that the data I can collect is too noisy (heterogenous, etc.), or maybe it's just my techniques that are really bad, but I haven't had much success, nor time to improve the system. As such, I'm putting this on hold for now.
Hi guys! Over the last few days I've been working on a project in Python. The idea is to train a machine learning classifier to differentiate between Town and Mafia players based on the text of their posts. So far, I've gotten down extracting all the users + posts from an entire Bay12 Forums thread, and saving it into neat objects.
The problem I have now is labeling the players as "Town", "Mafia", or "Irrelevant" (OP, people who are posting to watch, IC's in BMs, etc.). This has to be done by hand, and unfortunately a lot of GMs don't label the final alignments of players when they flip or when the game is over, or they do at the game-end post (which isn't the last one, because of discussion after the game...).
The part I would like help on, if anyone wants to volunteer, is labeling the players in mafia threads, for example from the Beginner's Mafia threads. If you want to help, or are otherwise interested in what I'm doing, please PM me. Labeling myself has taken some time, and the bigger the training set the less biased it is and the better I can train the model.
I'll keep this post updated with my progress on this.
[Final] Extract posts (text) and players
[WIP] Label players - Perpetual WIP (I need more data to improve accuracy, at this stage)
[Final] Generate features from text (several ways to do it - currently tf-idf for SVM/NB/DTree and term count for NB, 1-grams, then LSA to reduce dimensionality)
[WIP] Train models based on the features, choose best one (SVM, Naive Bayes and Decision Trees).
Edit (March 21): Two people (TheDarkStar and FallocyofUrist) have currently contacted me. I'll be finishing up the specification for how forum games + labels for players would be "uploaded" to the program, based on feedback on whatever's easiest to do.
Edit (March 24): Specification is done, I have a clean pipeline built, now the bottleneck is a lack of data - lots of parameters and few points mean that I completely overfit currently. So, I'll be working on adding labeled data, and I ask for help with that. I'll post the specification in a little bit as a separate post.
Edit (April 22): So, I've been back to this as I've gotten more time on my hands. I've added some data by hand, and my algorithms are finally outputting something other than "all are town". Accuracy is bad (currently best for "all town" prediction...), so is F1 score and ROC-AUC. The bottleneck is clean data - there are tons of posts that are too short that I need to throw out, as well as noise from "in!" posts and such.
I'm contemplating a slightly different approach to deal with lack of data, but I would be really greatful if someone would help with just labeling more data.