I have had an amount of sleep sufficient to elevate me above counting on fingers, so a reply appears.
<snip>
Implementing this correctly will probably result in an AI that's not very good at macro plans but is pretty good at not losing pieces due to two-or-three-move-long gambits, and possibly decent at throwing gambits like those at you.
One of the problems with this approach for tafl games is that simple evaluation functions like counting material on the board don't really work all that well: your average tafl game has very few captures, and barring accidentally bad play, the first capture usually happens beyond the search horizon.
Encoding some strategic knowledge into move generation is something I've considered, but that's one of the things that made go AI so hard—strategic knowledge is inherently situational, and working out rules for when to apply other rules can turn into rules-ception pretty quickly. I've had good results with good old-fashioned tree search with alpha-beta pruning, so far, along with 2-3 plies of extension search on the five-ish best moves to help catch any horizon effects; the hard part is a good board evaluation function, which I haven't yet hit on.
Another more experimental option might be to find a way to train a neural net to evaluate board states, like AlphaGo did -- I don't know exactly how AlphaGo did it, but I'm guessing it did so by taking a board state, tree-searching it to the end (using whatever rules you used for your base tree search), and then giving that board state (and most of the intermediate ones) a value based on the win-lose probabilities that tree search turned up.
One of the facts from AlphaGo's paper was that its board evaluation is really good: good enough that even just generating all possible board states one move ahead and evaluating those makes it as good as or better than every computer go engine to date. Perfect board evaluation is impossible, but great board evaluation is very possible with NNs.
The way AlphaGo did its training initially was to look at one board state per game (more or less randomly-selected, to avoid any bias toward a particular style), and rate that board state on whether it led to victory or defeat. Once it had a solid foundation, it began to improve itself by self-play, generating games and rating states in those based on victory or defeat.
The problem here is, as you say, the corpus of games. One of the reasons I'm running this tournament is to get some quality computer players so I can begin to generate such a corpus.
(Of course, "count pieces and see who has more" is a stronger scoring function than you think -- combined with a lot of ability for short-term tactics, a lot of chess bots did pretty well with it, IIRC.)
As I said, there's unfortunately no good analogue to this for tafl, as far as I'm aware. Evaluation functions are necessarily complex, more akin to the checkers stuff, and tafl has the added issue of lacking opening or endgame databases. (The former can be fixed, but the latter is basically impossible: tafl endgames are too complex to build a database for, although I'd love to build a database of solved corner states, sometime.)
EDIT: Minor note -- it looks kind of like the algorithm I'm talking about when I say "tree search" is not strictly Monte Carlo tree search. After doing some quick reading it looks like it's typical to evaluate all the way to the end of the game making random choices all the way. I'm specifically saying you should evaluate not-too-deep, and substitute a better "evaluate board" function when you get there, so that when you make your random picks they're likely to be representative. There are enough possible moves from each hnefetafl state that evaluating six or seven moves down at near random will probably not give you a good valuation of the position AFAICT.
You'd think that about go, too, which has a much bigger branching factor, but consider this: my tafl engine is relatively slow, but can still generate and evaluate 30,000 states per second on my laptop, and 50,000-100,000 on my much faster desktop. At their longest, tafl games end up being about 150 moves. In thirty seconds of searching, that's at least 6,000 playouts on my laptop (more, since OpenTafl's AI spends a lot of time sorting potential moves at each state by 'best'), if you make the assumption that random playouts longer than 150 moves are unlikely to be uniquely good. 6,000 is enough to start getting pretty good pathways, if you have a decent move generator.
The thing about MCTS is that it doesn't require an explicit evaluation function, which is a huge benefit for go, since go positions are so hard to evaluate (at least, without a fancy neural network). The hybrid you're describing is interesting, though, a sort of depth-limited MCTS which falls back to an evaluation if it can't do a random playout quickly enough. This might prove to be a useful innovation for tafl games—it isn't nearly as hard to write an evaluation function for tafl games as it is for go, and it gets a lot of depth cheaply, which lets your evaluation function be simpler. (Generally speaking, it's almost always better to search more deeply than to write a better evaluation function.)