Over the weekend I started work on a new project. As I've mentioned before, I want to make an app that will assist others in reading the encyclopedia. Of course, that means wikipedia rather than Britannica.
The first step involved processing the vast amount of data in wikipedia. I downloaded the bzip2 compressed 15 gb file containing all the textual data in wikipedia. I spent most of Saturday night and Sunday afternoon working on it, but I now have a program that processes that file (without needing to decompress it first) and outputs an xml file containing a stripped down version that contains only the introduction of each article, the id of the article, and an image from the article if possible. By my calculations, it should be able to process the whole file in about 30 hours.
There were plenty of obstacles. I needed to find a Java library that handles streaming a bzip2 file. I found one in the ant project. I needed to parse the xml in a streamed manor (no way that I could load a 100gb decompressed xml file into memory). That was easy enough. I've used Java StAX before.
Finally I needed to parse the mediawiki format, strip out all the data I wanted to ignore, convert it to a format that I could use in an android app, then find a useful image link from the article. This was tricky, but I found a library called Sweble that parses mediawiki format. It's poorly documented, but once I figured it out, it worked beautifully.
Mediawiki articles sometimes have image links in the description, sometimes that have them in a sidebar, sometimes they have it only in the main article, and sometimes they have no image at all. I can now handle most of the easy cases, but to save time I currently omit the body of the article from parsing. That means I have no image link for any article that has image links exclusively in the body of the article. I'll have to work on how to handle that without adding significant time to the parsing.
The next step is to get genre information and generate a pagerank so that I can determine what articles to prioritize. When someone is reading the encyclopedia, it'll be terribly boring if I only show them the most popular articles. They would include almost nothing but celebrities and current events. Some might be interested in that, but I want to split things into genres and then include only the most popular and/or important articles in each genre.
Importance is hard to judge though. Popularity is one way (judged by page hits). Another way is page rank (a wikipedia article linked to by lots of other wikipedia articles has a higher page rank than one that's never linked to). Another way is centrality. If you plot a graph of how everything is linked, some articles will show up more centrally whether or not they are linked directly to more articles.
I wasn't able to find a freely available source of page rank data for wikipedia, so I'll have to calculate it myself. That'll be my project next. I'll parse the wikipedia data, output a list of pairs indicating which articles link to which other articles. Then I'll generate page rank data from that.
I'm hoping to find genre information, but I may need to derive that from information in the article itself like the info boxes. Then, for articles without info boxes, I could probably determine genre by finding how strongly they are linked to articles of known genre. That'll probably be a big project. I'm still hoping I can find sources of this data so that I can make use of it, but if it's not available, it should be fun to figure it out. If it works out I could even try to make it available to others (although I don't think I want to pay bandwidth costs, so I'd need to figure out a place to put it that wouldn't charge me for bandwidth).
Of course, all this is what I need to do before I even bother developing a portable app for reading the information. I'm not even going to think about that until I've done all this data processing. That part is probably more fun anyway. I'm interested in probability and statistics. I got a minor in math in college, but I somehow never went through a statistics course. I've gotten a few weeks into an online MIT course in the subject, but I've been putting off working on that while I've been doing this encyclopedia stuff. I'll get back to it though. It seems like it'll come in useful for data mining in any case.
No comments:
Post a Comment