I am examining how certain phrases crop up in vBulletin web forums to see how ideas spread among Internet audiences over time. I’m having problems and suggestions are welcome. (Update: to see how this research worked out, see this post)
Method at the moment…
After trying a few alternatives I’ve started to use an application called Sitesucker that downloads the messages from the forum onto a local disk as a series of html files.
Data mining with Anthracite, which seems to have the right features (it can strip the meta text and finds phrases with excerpts surrounding them, it can also grab the date of posting, user’s reputation etc), but it keeps crashing when I feed it larger numbers of files.
What I need is a tool to go through the forum, find a number of phrases that I am interested in, and then a) extract the paragraphs or a set number of characters around each instance of the target phrases; b) extract the time and date of the posts they appear in; c) the title of the thread; d) and the user name and reputation of the poster; e) and possibly a summary of follow up posts. Then the tool would ideally produce a file I can look through manually – and output as a CSV for statistical analysis.
Post script – Daniel Lee may have a solution… Will update shortly.
Post script 2:
Since posting this a possible alternatives suggested by:
Skec suggests Web Sphinx, which is a great java crawler and can certainly find instances of text, but would take allot of work (correct me if I’m wrong) to effectively excerpt the data I need.
Martin suggests Automap, which may come in handy at a later stage if I need to use machine analysis to how frequencies of phrases appearing near to other relevant terms.
I’ll try them all, but right now I’m trying to figure out Anthracite better, and waiting on Daniel’s custom tool.
Ideas and suggested apps/approaches are still welcome!