February 15, 2003

Distributed Proofreading

I have been reading books backwards, odd pages only, and with unusual attention to their punctuation; I am sucked into the Distributed Proofreading project. A page a day for the public domain, you know. Proofreading a page a day doesn't seem so useful the slightly-old way, with a proofreader assigned to each book; much can happen in a year, your partial efforts can easily be lost. Also it's daunting. DP just shows you a black-and-white image of a single page, and checks out a copy if the OCR text produced from it, and if all you can do today is that page, fine; the improvement in the common store of available text is proportional to your effort.

I'd be better off if I could keep it down to a page a day; I find it as mindlessly enthralling as Tetris. I'm not even attempting the hard stuff yet, the Hakluyt or Anatomy of Melancholy. Latin, Greek, superscripts as contractions, ligature characters, poetry, nested footnotes; tricky. Especially tricky because the Project Gutenberg standards, towards which DP tends its efforts, are all old skool Latin-1 completely linear text. I was made a bit sad taking the page numbers out of an index; the topic titles in a good index are not always enough to find the page referred to, because a good index may have a topic filed under a explicit term when the text identifies something in context; "Clarissa Character, bankruptcy of" might point to a paragraph saying "From epistolary evidence, in this year his sister signed over her share of the inheritance completely, and it was lost with the whole." Hypertext can be very good at this, of course, and at footnotes and endnotes. Really excellent footnotes are a form of commented linking that hypertext is still thinking about. It seems a pity to be washing out some of the links that we could improve instead. On the other hand, do I feel like doing it all myself? Not that book, no. Maybe I'll think of a tool.

Really, embarrassingly, if the two goals are to allow fairly precise internal references and to be readable by both machines and humans, page numbers are not at all bad. Three, four digits every eighty lines? there aren't many HTML anchors smaller, let alone identifiers in newer cooler schemes.

Many sites are storing images of the originals, although they present a more emphatic choice between easy-to-read formats, images good enough to be useful and attractive, and compression algorithms that will be reversible later. So wrote clew in Meta.

And thus wrote others:
TrackBacks turned off...