So for all of you wonderful people out there wanting a Loud Whisper Official Posts Authentically Datamined Textfile of Words, here ya go. It took friggin forever even with the scraper (It's 15 fucking megabytes holy shit), but here's 32,000 posts worth of distilled LW. 200 proof.
The weirdest textfile you ever did see. (It's on tinyupload cause it's too big for pastebin)
Side Note: I am actually ridiculously proud of the scraper, having never programmed a working thing in my life beyond a simple calculator. It's a work of art. Even if it is likely horrible optimization wise.
If you want to use the scraper, I'll post the code below. It requires the following:
requests (Can be downloaded through pip)
BeautifulSoup 4 (Also pip)
lxml (Ditto)
I take no responcibility for anything this does to your computer, blah blah blah, use at your own risk, try to not crash any websites. It's commented, but only as far as I understand it. It's likely that some comments are entirely wrong and show a fundamental misunderstanding of everything.
############ O.Wilde's Bay 12 Scraper ############
################## V.1 1/30/16 ##################
import requests
import bs4
import lxml
import re
from html.parser import HTMLParser
###########################################################################
def findprofile(profilepage):
print ('Finding pages to scrape...')
user = re.sub('http://www.bay12forums.com/smf/index.php?action=profile;u=', '', profilepage) #Removes everything but the user ID number from the profile link
url = 'http://www.bay12forums.com/smf/index.php?action=profile;area=showposts;sa=messages;u=' + user #Adds the User ID number obtained in the last step to navigate to their messages page.
return (url)
def scrapeposts(url):
page = requests.get(url)
soup = bs4.BeautifulSoup(page.text, "lxml") #Takes the text of the HTML contained on the URL messages page and makes is usable for our purposes
data = [a.attrs.get('href') for a in soup.select('div.pagesection a.navPages')] #Selects the data contained in our HTML [a class="navpages"...../div] tags, which is the entirety of the posts listed.
ppg = re.sub('http(.+?)start=', '', data[0]) #Finds the number of posts per page
tp = re.sub('http(.+?)start=', '', data[len(data)-1]) #Finds the postnumber that the final page of posts starts on
pagenumber = (int(tp) / int(ppg)) + 1 #Finds the total number of pages of posts
counter = 0
page = 0
scrapedata = ''
while counter < pagenumber: #While the page we are working on is less than the total number of pages
counter = counter + 1 #Add 1 to the counter
print ('Now scraping page ' + str(counter) + ' out of ' + str(pagenumber) + '!')
scrapeurl = url + 'start=' + str(((counter - 1) * 15)) #Goes to the url of the page of posts
scrapepage = requests.get(scrapeurl)
scrapesoup = bs4.BeautifulSoup(scrapepage.text, "lxml") #Takes the text of the HTML contained on the URL messages page and makes is usable for our purposes
scrapetext = soup.select('div.list_posts') #Selects the data contained in our HTML that we want to mine. Specifically, the posts on the page
scrapedata = scrapedata + ' ' + str(scrapetext) #Adds our freshly scraped data to the string of scraped data mined so far.
print ('Done!')
return(scrapedata)
def cleardata(data):
print('Removing HTML...')
cleareddata = re.sub('<[^<]+?>', ' ', str(data)) #Removes all strings contained within <...>, this is to remove HTML tags. Replaces with a space.
print('Removing Quote Tags...')
cleareddata1 = re.sub("Quote(.+)pm", ' ', cleareddata) #Removes quote tags ending in PM and replaces with a space
cleareddata2 = re.sub("Quote(.+)am", ' ', cleareddata1) #Same as above, but for am. (I could do this in one step, but I don't know how.)
print('Removing Tabs...')
cleareddata3 = re.sub('[\s+]', ' ', cleareddata2) #Removes any extra tabs, replaces with spaces
print('Removing Non-ASCII Data...')
cleareddata4 = re.sub(r'[^\x00-\x7F]',' ', cleareddata3) #Removes any non-ascii characters so textfile can be created.
print('Removing Spaces...')
cleareddata5 = re.sub(r'\s+', ' ', cleareddata4) #Removes the clutter of spaces created from the previous few steps, replacing them with a single space each.
return (cleareddata5)
def writefile(data, name):
print('Writing File...')
file = open(name + '.txt', 'w') #Creates a new file in which to save our data
file.write(data) #Writes our data to the file
file.close #Closes the file
input('Posts have been scraped, and file created. Thank you for using the Bay 12 Scraper by O.Wilde!')
###########################################################################
url = findprofile(input("Please input the profile of the member who's posts you would like to scrape: ")) #Asks for a profile link to scrape, and calls findprofile using that link. Sets url equal to the returned value
messydata = scrapeposts(url)
cleandata = cleardata(messydata)
writefile(cleandata, input('Please input the name of the text file you want to be generated. WARNING: Any file with the same name will be overwritten!!!: '))