Building a (fast) Wikipedia offline reader →
https://marco.org/2008/08/14/building-a-fast-wikipedia-offline-reader
From inky:
An ingenious method for reading Wikipedia offline. It efficiently uses the 3.9 GB compressed XML dump, without requiring additional hard disk space.
I haven’t talked about it much here, but I’m actually very familiar with the Wikipedia data dumps. At my last job, one of my projects was the Clusty Wikipedia search. I wrote the entire parser to download the dump files, do useful things with the metadata, and convert the article contents to the search engine’s input XML format. I also had to generate short abstracts to show in the Firefox toolbar’s “Clusty Clips” popups (those were actually Tiff’s idea).
One of my project ideas has always been a complete Wikipedia offline reader (in case you haven’t noticed, I like offline web-page reading). I’ve started and abandoned a few prototypes. And I’m thinking of making one for the iPhone.
It’s not for the faint of heart. The Wikipedia data is huge, inconsistently formatted, unranked, and full of obscure list articles that you really don’t want cluttering up the index, especially if you’re short on space. It’s less relevant if you’re willing to use that 4 GB dump on a desktop hard drive, but it’s much harder to intelligently select a good subset and compress it down to an acceptable level for cramped laptop drives or an iPhone.
I have a few ideas on how to do it, of course. This will continue to sit in my idea closet for a few more years, at least.