Topic: [On Hold] Machine Learning on Mafia games (Read 2484 times)

TolyK · « **on:** March 19, 2017, 06:18:10 pm »

Latest update:
As it seems that the data I can collect is too noisy (heterogenous, etc.), or maybe it's just my techniques that are really bad, but I haven't had much success, nor time to improve the system. As such, I'm putting this on hold for now.

Hi guys! Over the last few days I've been working on a project in Python. The idea is to train a machine learning classifier to differentiate between Town and Mafia players based on the text of their posts. So far, I've gotten down extracting all the users + posts from an entire Bay12 Forums thread, and saving it into neat objects.

The problem I have now is labeling the players as "Town", "Mafia", or "Irrelevant" (OP, people who are posting to watch, IC's in BMs, etc.). This has to be done by hand, and unfortunately a lot of GMs don't label the final alignments of players when they flip or when the game is over, or they do at the game-end post (which isn't the last one, because of discussion after the game...).

The part I would like help on, if anyone wants to volunteer, is labeling the players in mafia threads, for example from the Beginner's Mafia threads. If you want to help, or are otherwise interested in what I'm doing, please PM me. Labeling myself has taken some time, and the bigger the training set the less biased it is and the better I can train the model.

I'll keep this post updated with my progress on this.

Code: [Select]

[Final] Extract posts (text) and players
[WIP] Label players - Perpetual WIP (I need more data to improve accuracy, at this stage)
[Final] Generate features from text (several ways to do it - currently tf-idf for SVM/NB/DTree and term count for NB, 1-grams, then LSA to reduce dimensionality)
[WIP] Train models based on the features, choose best one (SVM, Naive Bayes and Decision Trees).

Edit (March 21): Two people (TheDarkStar and FallocyofUrist) have currently contacted me. I'll be finishing up the specification for how forum games + labels for players would be "uploaded" to the program, based on feedback on whatever's easiest to do.

Edit (March 24): Specification is done, I have a clean pipeline built, now the bottleneck is a lack of data - lots of parameters and few points mean that I completely overfit currently. So, I'll be working on adding labeled data, and I ask for help with that. I'll post the specification in a little bit as a separate post.

Edit (April 22): So, I've been back to this as I've gotten more time on my hands. I've added some data by hand, and my algorithms are finally outputting something other than "all are town". Accuracy is bad (currently best for "all town" prediction...), so is F1 score and ROC-AUC. The bottleneck is clean data - there are tons of posts that are too short that I need to throw out, as well as noise from "in!" posts and such.
I'm contemplating a slightly different approach to deal with lack of data, but I would be really greatful if someone would help with just labeling more data.

TolyK · « **Reply #1 on:** March 24, 2017, 08:03:31 am »

So, my current implementation of the program takes a folder full of files of the next format:

.csv (or .xls/.xlsx sheets, if you want)
First row - url to reach the thread.
Next rows - username, alignment (mafia, town, replacement, other), additional information (who were they replaced by [required], otherwise not required).

Again, m for mafia, t for town, r for replacement, o for others (these are currently ignored, but it's useful to have to debug sometimes).

Example:

Code: [Select]

http://www.bay12forums.com/smf/index.php?topic=152897.0
Jack A T, o, GM
frostmoon, m
griffonday, t
njw2000, t
tntey, t
fillipk, t
DoctormcTaalik, r, Ddynamo
Ddynamo, t
The Moonlit Shadow, t
Kilakan, r, Roo
Roo, t
Persus13, m

If you have some free time and want to help, PM me please! The bottleneck so far has been lack of data.

origamiscienceguy · « **Reply #2 on:** March 27, 2017, 07:51:12 pm »

How do you plan to teach the program what posts are more likely attached to scum? neural networking?

TolyK · « **Reply #3 on:** March 28, 2017, 03:12:24 pm »

So, it might be useful to explain the pipeline.

Pre-processing
> Excel (.xls/.xlsx) -> GameData (.csv) by sheet
> GameData (.csv) -> load all HTML files for mafia thead -> parse into Posts, i.e. [ User : Text ] pairs -> merge into [ User : AllText ] pairs
> GameData (.csv) -> load all [ User : Alignment ] pairs, fit with previous to form [ User : Alignment : AllText ], save to disk

Machine Learning
> DataFrame is [ User : Alignment : Text ] rows. Really, I don't need the User rows to train, so currently I throw them out.
> DataFrame is split into Training and Test sets, at random. I'm still testing whether I need to balance the Town/Mafia (class balance) ratio or not.

On the Training set:
> Text is vectorized (vocabulary of words and/or bigrams are learned, then each text is converted to numbers by either WordCount or TF-IDF) into vectors of a really large dimension count (>1000, though I can remove lots of words that occur few times etc.)
> Vectorized text (TextFeatures) are then sent through a partial singular value decomposition (this is basically Latent Sentiment Analysis) to lower dimensionality. In vague human terms, it's a vector direction signifying "substantive direction of text" or something.

> I train an SVM (Support Vector Machine classifier) on the TextFeatures matrix + Alignment labels. I could use Naive Bayes or a Neural Net or something, but currently nothing really gives an answer, and SVM is fastest.

On the Test set:
> I vectorize the Text using only the vocabulary learned from the Training set, get reduced TextFeatures using same transformation as before.
> Get Predictions from SVM using only the TextFeatures from the Test set
> Use metrics (accuracy/recall, F-score, etc.) to determine how far the Predictions are from the Labels.

So far, the SVM has only returned "Town" on all the data, which is mostly a result of very little repetition in the data I have. Basically, the dimensionality of TextFeatures is less then or around equal to the number of observations I have, so it can't really compute any useful metrics. (I have around 80 "mafia" and "town" labels processed, so far). So, the bottleneck is data. Sorry for being possibly a bit too technical.

TheBiggerFish · « **Reply #4 on:** April 03, 2017, 09:21:07 am »

Hold on, hold on, IC-ness is intrapost, and the Town ICs still play the game, IIRC?

TolyK · « **Reply #5 on:** April 04, 2017, 08:47:10 am »

Town IC's play starting from a specific BM... I can't say ATM since not at computer.

I do agree that town ic posting is different from normal town, but that currently isn't differentiated. That might indeed be a problem.

Dorsidwarf · « **Reply #6 on:** April 07, 2017, 05:47:20 pm »

You could always just cut out the BMs, since play in them is frequently highly atypical and liable to skew your proghrams inputs

Bay 12 Games Forum

News:

Author Topic: [On Hold] Machine Learning on Mafia games (Read 2484 times)

TolyK

[On Hold] Machine Learning on Mafia games

TolyK

Re: [Help needed!] Machine Learning on Mafia games

origamiscienceguy

Re: [Help needed!] Machine Learning on Mafia games

TolyK

Re: [Help needed!] Machine Learning on Mafia games

TheBiggerFish

Re: [Help needed!] Machine Learning on Mafia games

TolyK

Re: [Help needed!] Machine Learning on Mafia games

Dorsidwarf

Re: [Help needed!] Machine Learning on Mafia games