July 29, 2010

Bookselling: bushels or boutique

The resale value of books seems to have collapsed recently; physical booksellers near me remark that they have to buy at the lowest price found online plus shipping, which, for almost all books, is effectively the cost of shipping. You'd think the post office would be doing better.

That makes electronic books that we can't resell rather a lot closer, economically, to physical books; and physical books are hard to grep. If the DCMA is restrained until we can back up our own books, this might be a wash for most readers (assuming we learn to back up and index our digital possessions, which is also getting cheaper and easier).

Selling physical books seems increasingly quixotic. There's a warehouse in Seattle where booksellers who krill-filter the few books worth indexing sell the rest by weight. Taking the opposite approach, there's Ada's Technical Books, in a charming stone building (the Loveless building!) in Capitol Hill. Now, why buy there, instead of more cheaply? They're right close to a local hackerspace, so some clients will want to buy while the soldering iron is hot; and they're a nice place to be, with chairs and witty adornments and, generally, assistance in constructing a representation of self; one with some momentum. *I* bought.

Picture 1.png

June 22, 2010

Terrible court decision on copyright

Court Says It's Okay To Remove Content From The Public Domain And Put It Back Under Copyright. It's in a series of appeals, so we might not be doomed yet, but if Congress and the courts ignore the GAO and the First Amendment -- and non-centralized value, for that matter -- in favor of RIAA, the MPAA, increased control, and profits for a few, then the ratchet will keep strangling the public domain.

February 15, 2010

Used books support book ecology?

Presumably the market for used books raises the price of new books.

If almost all new-book stores vanished, leaving us to buy new physical books online from middlemen or direct from the publishers, used bookstores would still be useful. Some would still be big enough or closely enough attuned to their customers to carry some new books. Publishers would get paid and editors and authors would be paid and we would all have market access to new and old books. People who don't have computers would still have somewhere to go. Good enough. I find that the local-culture, book-recommending, serendipitious benefits of bookstores are stronger in ones that carry used books already. Abebooks.com evidently keeps some specialist used bookstores afloat while letting them keep a storefront; that seems like a very good combination of internet one-market-to-rule-them-all and independent shops.

I hope that Apple and Amazon and Abebooks and Google don't become monopsonists and monopolists of books new and used, but that isn't really the same question as the death-of-new-bookstores. The chain stores seemed to be trying to become monopsonists, just not as effectively.

When Amazon and Macmillan trompled on their authors last week, several authors I like started selling some of their work online directly. Excellent; I am not willing to buy DRM-ed books, I am happy to pay a roughly-used-book price for electronic files if I have them forever, I am even happier if the authors get more money than for a new paper sale (which, remember, was higher partly because of an implicit possible used book sale).

But now I'm worried for used book stores, because even if I do have the right to transfer a digital copy -- beats me about the legal right, but they were sold, not licensed, so I think probably it's legal if I give up my access -- I'm not likely to be doing that out of a used bookstore, am I? Poof, there they go. I suppose this only happens if too many book sales are electronic. Query: what's happening to used CD stores?

Also, *this* is the point at which the 'marginal cost of transferring a digital file' is relevant to the price of a digital book. The original publisher/editor/author are going to charge something that keeps them alive and pays for commercially-reliable server space. Me, as a used reseller, I am presumably content with the pittance I now get for paper books from used bookstores. Doesn't that drive down the price of the 'new' digital version? Perhaps we assume people virtuously buy from the original authors for the same reason that they aren't warezing them left and right.

Maybe we get to buy books with a right to 'return' them to the publisher within a year, but not to sell them elsewhere. That's about the same as assuming that decent people won't pirate in a reasonable market. I think it might be true, and earnestly fail to disprove it myself, but it seems unstable for the authors.

Moreover, there's going to be some insane twilight-of-empire condition for books that were published on paper in the period of Effectively Eternal Copyright. Now I can imagine a world in which everything before 1928 is potentially freely available through Project Gutenberg and its later, sleeker imitators; and everything after 20?0 is 'natively digital' and gets sold at either the easy-legal-resale, high price, or the piracy-is-inevitable, fans-are-sponsors very high price... and everything between is sold and resold in increasingly specialized '20th Century Culture' shops, among vinyl furniture and records and collectible Pez dispensers.

This may be a more cheerful view than the one in which everything after 1928 gets so gummed up with DRM and viral copyright that we have to fork cultures and raise authors who are never 'legally contaminated' by exposure to owned memes.

I only meant to point you to two authors I like who have some work for sale:

Barbara Hambly


C. J. Cherryh

and I was also thinking of Nick Harkaway's comments on Oxfam becoming an Evil Used Bookstore Chain (!!), as well as all his essays against the Google book scanning deal.

March 18, 2009

Ideas Into Words, Elise Hancock

If I were writing about science for non-scientists, I would cleave to this book, which reminds me very pleasantly of Strunk & White but covers even more material. I particularly liked the chapters on recognizing cutting-edge scientific research from crankery, and on how to manage an interview, and on the possible forms of organization and style for an article.The last is metaphorical and practical at once:

Within the general framework of get in (clear), tell 'em (interesting), and get out (short) lie a thousand possibilities, each of which has a particular organic shape. As you go along, try to "see" that inherent shape in the material itself--a spiral, meander, beech leaf, delta, or such--then use it to structure your article.

It is not a book on how to write a scientific paper.

Find in a Library: Ideas Into Words

March 15, 2009

I dreamt I dwelt in marble stalls

This is a truly useful bathroom-stall door:

Oak door with rack and hooks

When you walk in, the sheaf of paper and books you probably have in your arms goes into the hayrack, and then your bag onto the intermediate hook, and a coat (and hat, even) onto the top hook. They've been useful for about a hundred years now. The building as a whole needs an overhaul, and I hope these subtle perfections of the old days will not be lost.

March 18, 2007


I aten't dead.

It's a poor defense of fantasy that "it's only jailers who despise escapism", because it isn't quite true: jailers despise escape, but escape-ism without escape makes the inmates quiet.

The phrase 'x trumps y', where x and y are categories of outsiderdom, annoys a lot of people; not just those who think that y trumps x, but people who object to describing social ills as a game. I am not going to add injury to insult by making this joke in the middle of a serious conversation, but of course it's a game; we're all playing Social Contract Bridge Called My Back.

March 11, 2005

Getting there from here

I should be modelling a final, but I wanted to nod enthusiastically to Caveat Lector's latest musings on the Google/scanning/Gorman kerfuffle. Also, I am more optimistic than she is about the eventual conversion of buckets of bitmap to useful digital texts.

As a friend of Distributed Proofreading, and a constant reader of its products, I feel that it will all work itself out eventually as long as the scanned images are available to anyone who's perplexed by a reading. (AFAIK the original books will also be as available as they used to be?) Eventually may be decades or centuries, but somebody will be gripped by the need to make this book a pleasure to read for all the people who really ought to read it. The more people the books have access to, the sooner each one will find its loving midwife. And the tools will get better - Dorothea's use of a concordance to backstop proofreading for scannos is GENIUS.

I wouldn't say PG has had scanning licked, actually; I have a Civil War book with every tip-in, pearl-font table, list of idiosyncratically spelled proper names, incredibly fine-engraved map you don't want to deal with, and a terribly broken five-inch spine; I think there's a orbital scanner in SF that I could use through DP, but I'm not in SF...

It is true that the text artisanry, the conversion of a collection of proofread pages to a coherent book, is the slowest point, probably the bottleneck, and (I fear) the part least amenable to someone casually coming by and fixing errors later. DP would be better off if it/they/we could teach/learn markup skills faster. I'm kind of perplexed by the TEI instructions I've found online, and I can't be the worst-prepared person looking; I can use LaTex, I've edited DTDs, I read books with Scholarly Apparatus. The online instructions seem to be reminders for people who have been taught by a human, as is proper for something both an art and a craft, but doesn't speed up a whole lot when done in BBS and IM. (In no way do I speak for DP, but I think this would not be a minority view in its forums.)

The University of Washington has a digital humanities ?minor?, including a course next quarter that probably touches text-artisanry and certainly discusses metadata. I cannot possibly fit this course in. It does make me wonder why literature and history classes don't do more proofing and markup, as, one might say, the letter-scale act of close reading.

November 08, 2004

Spam, spam, spam

I'm closing comments because I have too much spam and not enough time. Pfui.

There are a whole lot of comments of form


"Hi! Just looking around!"

None of the URLs go anywhere... either there's a sophomore class with an ill-thought-out Internet assignment, or someone's seeding Google with links to something that will be commercial later, or ???

So I've deleted them. In the unlikely case that they belonged to real people with bad hosting, sorry; in this troubled world you have to be cleverer than Eliza to survive.

July 11, 2004

English as She's spoke

I've been having about of tendinitis, and much else to do with what typing I want to risk. I will therefore be using voice dictation software for a bit, which will produce some unlikely misrepresentations of what I actually said.

June 11, 2004

The new Seattle Public Library

I don't like it. It will grow on me, if only from a Pavlovian association of the place with its contents. The librarians will be clever and thorough in mitigating its navigation problems by providing maps and signs. The Architecture, though, is maybe a fifth part an interesting try at a Machine for Information, maybe a fifth part actively stupid, and more than a fifth humorous because it's already so dated. It was instantaneously passé when the third Matrix movie was a disappointment.

The aesthetic dorkiness has its charming side, though. I'm fond of Seattle partly because we're a provincial, optimistic, overambitious, pratfall of a place; very homely, even when tremendously annoying. The library and the sports stadiums will be our cultural bookends to the fortunate 1990s, reminding us what we spent and what we thought would save us.

I also look forward to no-budget dystopian films being made in the library, with Blöödhäg soundtrack.

What specifically don't I like about it? From the outside, the pompous, looming approach on 4th, in which the entry has all the appeal of a pore. Seattle doesn't need to shade its streets, and the high overhang won't protect that door from much rain. (Next visit, I'll start on 5th and see if that feels better.) Then the navigation is hideously, hilariously bad, so that the librarians have already taped up (tidy and color-coördinated) copier notices explaining where to go and how to get out. There's all this whooptedo about the easy navigation and the spiral of books, but (postponing the question of whether the Dewey line is really how we access books) you don't walk in and meet the books, you walk into a sort of distant-concierge hotel lobby on 5th or a crowded industrial arrangement of dead ends on 4th. The lobby on 5th presents vast vertical space with no books. The children's and multilingual books are on floor 1; 2 isn't public; 3 is other fiction (all of it?); the spiral is floors 6 to 9. You don't get to see the spiral when you walk in (I didn't find a good overview of it anywhere). The building doesn't invite you into the knowledge of the ages, rather it does more to hide the books than I would have thought possible in an open-plan, glass-walled building.

I suppose many people will be more comforted by the "mixing chamber" combined reference and information desk than I am, and I am content in the expectation that the librarians will make a good thing of it and make it an abstract introduction to the Horn of Plenty. All that concrete isn't abstract, and I disdain it for disdaining that spiraling horn.

It isn't the modernism I dislike; I enjoyed the temporary library, which spent the interim years in an authentically construction-surfaced installation in a convention-center building.

Just plain dim details: the stairs in the book spiral are incredibly noisy. The boxed-in stairwells are of a different material and aren't noisy, so I suspect the noisy ones were chosen because they look cool. There are what seem to be water sprays (for fire suppression, I can't remember the name) that are boxed in most of 300°, by the glass wall of the escalators on one side and by their heavy brackets on the other. Maybe they pop out and wave tentacularly in event of a fire. The continuous ramp concrete floor of the spiral is ribbed with cast-in-place level supports for the bookcases , etc. Where there aren't bookcases, these 1" or 2" teeth extend into the corridor floor a few inches; enough for me to trip on, as the corridors are narrow enough that I hug the wall going around corners. The verticals next to these teeth are mostly sheetrock, e.g. boxes around supports or stairwells, and it's slipshod that the teeth and the sheetrock don't match.

The lower end of the purportedly-important Book Spiral has already been commented on by one of our newsweeklies; it stubs out without ceremony or explanation, facing yet another drop filled with I-beam supports (which are covered with rough black fire retardant, to bolster the cheap-SF-movie effect), with no way in or out visible. I don't like the top of the spiral much better; you finally pass the Special Collections which are in a glass-walled Don't Touch city. Again, an actively disinviting transition from finding materials to using them.

I'm dubious about whether this is really a "light-filled" library. There's a lot of inaccessible space with an angled glass wall above and below, and maybe this will bring in comfortable indirect light year-round, and maybe it will be a wearying greyness all winter. I like grey, I go west and wetter from here to relax, but a building to concentrate that would be a bit much even for me. I hope someone did extensive solar modelling.

I liked the floor in the multilingual section a lot. Making the organizational principle of the books the spine of the building is a pretty idea. I sort of like the perpetual keyhole views through the grid-skin. The book-transport system is cute. I, mmm, I hope I'm more cheerful on my next visit. I don't like not liking my library. (I love the Capitol Hill design, although I think it should have more room for books, and I was horrified when the roof leaked catastrophically last winter.) The popup power strips and wireless access at the study tables aren't as good as the stunning blue leather and brass scholar's fitments at the British Library Reading Room, but they are attempts at being as useful (I should check whether the desks are comfortable for non-computer work.)

Back to the Dewey spiral... There are good reasons why our book-ordering systems map to the number line, but I don't think that's a good map of how we use them. (I Am Not A Librarian. Ignorant Pontification Ahead. Not Much Worse Than Everything Else So Far, Though.) Trotting up and down the spiral, making constant use of the rubber markers in the floor, I really, really noticed that books on related subjects aren't usually next to each other in Dewey. The back-and-forth pattern in boring old rectangular stacks is okay because you don't have to go the length of each bookcase, and sometimes you luck out and everything is on the same physical corridor. So the perfect Library is arranged with an infinite number of petals, extensible as their subjects grow, but all opening to the student in the middle: the one corridor collapsed to a point: the one place we want a Panopticon. (The BLRR catalogue tried, eh? but the actual books were elsewhere. I wish I could find a picture of the desk furniture.)

A spiral could have the elevators as the single point, stretched out by mere physical necessity; but I don't think this one does. I didn't stop at every stop but the elevator mostly doesn't face books, that I recall.

Well. More later. May this embarrass me a decade from now when it's obvious that the library is the help and pride of the city.

June 02, 2004


In a frenzy of deleting comment-spam, I think I deleted at least one possible non-spam; at least, it seems to have pointed to a reputable software house. I don't know anything about it but that. Oops; sorry.

April 25, 2004


This should do to catalogue a very small library; and if it doesn't, why then, I can take advantage of the open source.

For instance, rather a small SMS application might work for saying-goodbye book checkout; it's true that standing in the front hall using a cellphone to email my own basement to update information about an object I have just physically handed someone is pretty silly, but it would amuse our friends. And it wouldn't require sticking paper or RFID to every volume, either.

Or, once they're in a database, we can presumably print a barcode for each checkout card and update loan status that way. But I'd want SMS anyhow for the embarrassingly frequent bookstore question, 'Do we have this yet?' —and the dual, suitable for use in airports and around holidays, of 'Now we own a copy, don't buy another.'

February 04, 2004

Printer's devils, scanner's drivers

Ick. Scanning books to OCR will be punishingly slow if the driver (TWAIN driver? Imitation Photoshop plugin? whatever) remains flaky. I was fine with having it not installed by default, although the Readme installation instructions were not helpful. But it gives me fits that the software connection fades in and out.

And enough stuff has now been sold me with sleazywillfully optimistic compatibility assurances that I don't even want to buy the driver-upgrade CD for the scanner I did settle on, although that is the most likely fix, and it's not terrifically expensive. Compared to the scanner and an upgrade to the OCR software, not expensive; but the HP website doesn't list what's been upgraded, or for what OSes, so it also might not be at all useful. And (huff) really, they ought to be able to pull that info out of their fixed-bug database.

October 16, 2003

The Columbia Guide to Digital Publishing, William E. Kasdorf, ed.

A probably-useful summary of how digital publishing could work and what tools are available now. Very much aimed at existing publishers, in the part I read. I am more interested in how one makes a digital library, especially of public domain work, which is not quite the same thing, but I was happy to see some acronyms turn up where I expected them.

I only got halfway through, as another hold was put on it at the library before I could extend my original borrowing. I am taking it back promptly like a good citizen, not leaving it in my "I'll finish it tomorrow" ziggurat like a Rhinedwarf.

Found because of Dorothea Salo, who is one of the contributors.

ISBN: 0-231-12499-6

August 13, 2003

Complete RFID systems

Well, isn't the 3M Digital Identification System tempting? Okay, no, I don't need the whole automated checkout system, and I think the exit detection monitors are too big for my hall. Maybe a roll-my-own with the TI Tag-It™ inlays. They're considered "consumable", so they must be getting affordable¹; and there are 10,000 on a whole reel of the smaller ones... that's at most three personal libraries. I bet I can find two other Seattle-based book maniacs to divvy up a reel.

Despite the good geeky fun, I should think about what good this would do me. I'm not actually likely to wander through my friend's houses with a reader; doing so would probably reduce the number of friends who borrowed books. Might actually reduce the number of my friends. Gadget blowback, very insidious. No, to increase the likelihood of getting books back, I think the old-fashioned physical sign-out card is best. Even when I forgot to have the borrower sign and return the card, they'd see the card pocket, which would have some form of our address on it.

On the other hand, we lose books we have, even though we don't have all that many. A system that made it easy to find books densely packed into the quondam garage might be cheaper than reinforcing the house foundations enough to support the much larger bookshelves necessary to file by subject. (Not that we would successfully maintain a filing system; and the data entry for maintaining a proper index, with some books relevant to several subjects, would be much of the work in organizing them by RFID in the first place.)

What would be efficient and all sorts of geeky would be to give up on keeping anything published too late for it to go online, or not published online in the first place. The current commercial stuff I can get from the city library, which has already reinforced its foundations. In that case I need a Minolta Overhead scanner and more server space...

¹ Although the TI online store is closed as I write, and I can't find TI transponders in the DigiKey catalog - am I blind? - so I don't know.

August 01, 2003

Some know, some don't

I don't know why it took me so long to notice that Dorothea Salo works (and thinks and opines) in electronic publishing. Her blog is not so much about cataloging, although clearly I would be happy with metadata as she thinks it should be done. It is much about the formats books go through, and where to put what markup when. Many points that match my memory of my startlingly long-ago efforts putting vast programming manuals into a format from which we could publish simultaneously to paper and CD - yeah, right - and in less than six months in Japanese. We learned to stay with the most abstract format as long as possible. (We used to talk about a problem being 'up' in SGML or 'down' in the wordprocessing or CDROM formats, and open our hands gently while looking up in imagination towards SGML, very like Hope on a Beaux-Arts monument. SGML! That was a long time ago!)

On the other hand, there's the Internet Book List, which I admire for existing, but which causes me pain because the identifying data for a book is so... so... so newbie. No field for publisher, for instance, although that's a necessary part of the older ways of identifying books. "Series" and "Series Part" considered very important, though; also "Genre" soi-disant. Now, it may become a useful reference for slightly geeky light reading, apt to be mined for If-You-Liked-This recommendations; and that wouldn't be a bad thing. But it would need munging to be integrated into a similar design made by people with slightly different interests, and more munging to be clean data in a system designed to identify nearly all books.

June 13, 2003

Pushmi Pullyou

In the perfect world, some few bookish institutions would keep the database of all known printed books, and someone referring to a book (in a review, bibliography, for sampling, whatever) could not only refer to it by this official, understood description, and provide a link there, but could leave a link with the record itself, so that one could automatically see a list of all references and uses. "Trackback", in blog.

But gosh, what a spammable resource. I can just see every book on any political or philosophical topic full of references to Chomsky's and Gingrich's latest... entertaining as it is to imagine something written by both of them, I am actually thinking of more likely duelling spams. Euck.

I don't think anyone wants to provide server space for that. Amazon does the moderation necessary to prune it, but (leaving out questions of who owns the review) Amazon is not good at things not for sale. Besides, any centralized and corporate entity would be susceptible to censorship or greed.

I think an intermediate database would make me happy enough; some reasonably global bibliography, legally copiable, with some grand smarts to lump and split as users want. (Sometimes I want to know that these-three-reviews are of different printings of the same text: sometimes not.)

And then all the cleverness of reputation servers might come into it; I could tell my Reader's Agent that I liked reviews by these six people, and wanted, every week, to see twelve fiction reviews and twelve nonfiction reviews, starting with them and working outward through the circles of trust. For that matter, I'd like to see the commentary by these other four people who I despise... one of the most likely uses of reputation services, it seems to me; the Holy Rollers and the Rock-and-Rollers are likely to use each others' recommendations, only inverted.

In this case, I am thinking of the (Dublin Core or something like) database as providing an intelligent ID, along with help in finding a copy of the work. Instead of references to references to a book, references to copies of it: available for sale, scanning, download, rental, borrowing with such-a library privilege. I drool at the thought.

June 05, 2003

Referring to...

In which I rant about things that bother me because I think they're encoding bad librarianship[?]. Yet I know nothing about library science - or the Sematic Web, or knowledge representation, whatever. Sorry. Corrections hoped for.

The ISBN is a bad choice for the default ID of books mentioned online.

  1. Plenty of books were published without an ISBN; many of those are still in copyright, so we can't 'just digitize them'. (Are books still published without ISBN? In what countries, in what sense of 'published'?)
  2. No way built into the ISBN to get from the ISBN for one physical representation of a book to the ISBN for another representation; so no good way for autodiscovery that two mentions refer to the same thing.
  3. And what about books that are available online?
    • most of which were published well before the ISBN;
    • some are copies of a text which may have been republished with an ISBN;
    • some are approximations of several physical versions, so are bibliographically distinct although related to the others.
  4. And what about non-books: obviously Webpages, but also broadcasts, music, journal articles; heck, any time-space coördinate.


The ISBN is a great online ID for books.

It's common, recognizable and short enough to build apps around.
  • With John Udell's bookmarklet, you can look up an ISBN at many public libraries, or Amazon, or....
  • AllConsuming can aggregate references to what people are reading, as long as they use the ISBN and link to one of the main stores.
  • Or there's a MovableType plugin, based on AllConsuming, to create a link to Amazon's book page, and put a little picture of the front cover on your blog.
Three things about these make me unneccesarily cranky. The first is that we're too effin' close to trashing everything older than the ISBN anyhow. I'm not [forgot his name] who gets all miffy about electronic catalogs because 'a word incised in stone demands to be read as stone'; I think everything good about card catalogs can be encoded in an online catalog (except the smell), and more and better besides; although I doubt it always is. But I do worry that old books and their contents are hard to get. It doubly bugs me that ignoring any pre-ISBN book loses pre-TV history, because it seems to me that we are especially bad at remembering anything before TV. Old fuzzy newsreels may appear mostly as background in cheap amusement, but that's what makes them the unquestioned Way Things Used to Be. Separate rant there, sorry.

The second thing that bothers me is an increase in encouraging us all to read the same things at the same time. Individual blogs do this only weakly; bestseller lists and AllConsuming strongly. I don't think there's an equally strong online method to find 'more like this' measured by subtler similarities. What I miss is the great joy of seeing all the possible cross-reference subjects in a card catalog, once I had finally found the entry I was looking for.

And, pettily, the pictures of the covers annoy me a lot. With a link to an online store right there, why waste space and attention to attract the magpie brain? (Because some people want to pick it up in a physical place: okay, fair.) I'm looking forward to print-on-demand so I can get the books I want hardcopy all the same size, and bound to match. If all my books were the same size, I could pack them in my bookcases much more efficiently. If they were all bound with my binding, I could retrieve them from my friends' shelves, especially if I included RFID. I wave my hand in lordly fashion to assume that the careful work of layout & design for each book can be scaled to fit the sizes I like (wide margins?).

Dublin Core

And look! the annoying detail work about how to refer to anything anyone might want to look up has been well-begun! They're even going to be in my town this fall.

At least one other book-reviewing enthusiast has worked out an RDF by which an item announces which items it's about.

Catalogablog does know about libraries, and has a list of blog-suitable metadata initiatives.

There should be a public-knowledge database of books, for everyone to refer to. Maybe being the de facto database is worth enough to Amazon that they would commit to providing an interface to the skeleton of their data; it might not be worth the effort, esp. as they're now linked to without promising anything.

Maybe one could be assembled by scraping library databases; hm; do libraries own their databases, do they buy rights from publishers, do they share data already?


Some problems with referring to things online:
  1. they vanish
  2. they move
  3. they get edited and make your comments irrelevant
  4. I might really be referring to both a source and a process: for instance, for the many things kept in 'dirty ASCII', maybe I should be scrupulous and say both that I started with the data at [URI1], but read it after it was processed by the program at [URI2]. Examples:
    • I read an online text after it's been HTML-prettied by one program, or turned into a .pdb file by another
    • The English-language version of something I read (a foreign newspaper?) was generated by a particular translating program; boy, there's something to hack to damage international relations.
Some fixes: caching the copy you first saw; permalinks; store a hash instead of a version# or ISBN.

February 15, 2003

Distributed Proofreading

I have been reading books backwards, odd pages only, and with unusual attention to their punctuation; I am sucked into the Distributed Proofreading project. A page a day for the public domain, you know. Proofreading a page a day doesn't seem so useful the slightly-old way, with a proofreader assigned to each book; much can happen in a year, your partial efforts can easily be lost. Also it's daunting. DP just shows you a black-and-white image of a single page, and checks out a copy if the OCR text produced from it, and if all you can do today is that page, fine; the improvement in the common store of available text is proportional to your effort.

I'd be better off if I could keep it down to a page a day; I find it as mindlessly enthralling as Tetris. I'm not even attempting the hard stuff yet, the Hakluyt or Anatomy of Melancholy. Latin, Greek, superscripts as contractions, ligature characters, poetry, nested footnotes; tricky. Especially tricky because the Project Gutenberg standards, towards which DP tends its efforts, are all old skool Latin-1 completely linear text. I was made a bit sad taking the page numbers out of an index; the topic titles in a good index are not always enough to find the page referred to, because a good index may have a topic filed under a explicit term when the text identifies something in context; "Clarissa Character, bankruptcy of" might point to a paragraph saying "From epistolary evidence, in this year his sister signed over her share of the inheritance completely, and it was lost with the whole." Hypertext can be very good at this, of course, and at footnotes and endnotes. Really excellent footnotes are a form of commented linking that hypertext is still thinking about. It seems a pity to be washing out some of the links that we could improve instead. On the other hand, do I feel like doing it all myself? Not that book, no. Maybe I'll think of a tool.

Really, embarrassingly, if the two goals are to allow fairly precise internal references and to be readable by both machines and humans, page numbers are not at all bad. Three, four digits every eighty lines? there aren't many HTML anchors smaller, let alone identifiers in newer cooler schemes.

Many sites are storing images of the originals, although they present a more emphatic choice between easy-to-read formats, images good enough to be useful and attractive, and compression algorithms that will be reversible later.

