Okay, after some sleep and still with an unusable internet (fuck this incompetent ISP), lemme address to this by parts.
1. (web)Pages would only be open once a day at most. The crawler wont constantly crawl the same page non-stop while you play. If the dungeon for the page is already generated and haven't expired (24hours without access & not a homepage) it has no reason to reload. Of course, it wont have the most recent version of the last page of a forum topic, but I think we can all live with that.
(And I think I've said this a few times before, why everyone insists on thinking it will be opening & refreshing pages like crazy?)
Reasons: Bandwidth, both for the game and the website. It would be rude to waste more than the minimum necessary bandwidth from a site just to procedurally generate a dungeon for a game. Plus there's no real need for it.
2. It will crawl pages with an unique user-agent header. So if any web admin gets pissed off at some game crawling their webpage, they can simply add the agent to the block list.
Reasons: We're trying to be nice here.
3. It's not an aggressive robot. It will not attack link after link several times a day. It should be somewhat safe to ignore robots.txt but we probably shouldn't anyway.
Reasons: Respect and most blocked pages seem pointless to crawl for the game anyway.
4. If we're using a MOO, SVN will not work. You cannot feasibly program a MOO database offline. It must be programmed in the game while it's running. SVN can be used for database dumps, however.
Reason: Well, that's how MOOs work. It compiles object verbs (Also called methods in other languages) to machine code on the fly. The language is not a script language, it's compiled. It dumps it in to text form when saving the database and recompiles it when starting the game again. The reason is so you can have offline database maintenance if needed and if you switch from one MOO version to the other, the code wont horribly break everything since it will try to re-compile and will show any errors that popup instead of running machine code right away and crashing things.
5. My internet link is being a horrible bitch right now. ISP called earlier when I wasn't here, said they would call again later today, but they're incompetent bastards, so I doubt it. I've been having >35% packet loss & 600-3000ms ping to their own root server (10.10.0.1 (That's an internal ip range, btw)) for 2 weeks now. They've sent 2 crews over that noticed that I was right when I said the problem wasn't on my end and still haven't done a thing to fix it. So, don't expect this done any time soon, but I have to ready a minimal database for people to be able to log in and poke around. If it's not me doing it, then I'm sure you guys can find someone else or start this somehow. At this moment I'm too pissed with this issue to put any honest work into this.
Reasons: Stupid ISP. >.<
Addendum since people are still at it:
6. Even if it's not me doing this or someone else picks it up, please keep this idea for when the game should crawl a web page. It crawls on the fly and generates a dungeon on the fly when a player tries to enter a link. It will only do an external connection to that link if the page is either not already generated or expired. In fact, if a page expires, it will be deleted from the database, so it will always fall in the case of it not already generated as to when the game will crawl said page.
Things it will not do:
-Will not follow links automatically (It will only load links when a player tries to enter one)
-Will not download images (If we use MXP tags, the client can download images itself)
-Will not download anything other than the page itself (No reason to)
-Will not load a single webpage more than once every 24hours
Things it will do:
-Whenever a new word is found, it will look up on around 2 dictionaries for definition, synonyms, etc only -once- per word ever. This does not expire unless an admin decides to delete a word from the database for some reason. This could be queued and done once every 30 seconds to avoid wasting bandwidth. Preferably the common words should be in the database before the game is in a working state.