Thursday, December 01, 2005

Massive Digitization Projects

Earlier this week I went to a talk sponsored by George Mason University's Center for History and New Media on massive digitization projects and their long-term implications. The speakers, Clifford Lynch (Executive Director, Coalition for Networked Information) and attorney Jonathan Band, were engaging and informative; the audience seemed to be library-oriented rather than lawyer-packed. My notes here do not necessarily reflect what was said, only what the speakers made me think about.

Lynch pointed out that different classes of works (books, photographs, etc.) have very different histories and patterns of use, but all are now grist for large-scale digitization projects. This clearly has some relationship to the change in humanities scholarship, which is increasingly looking for non-text sources of evidence such as images and music. The harder sciences also can stand to gain a lot from digitization of things like hand-written/typed temperature logs; once the information is scanned and manipulable in large-scale data analysis, previously unaskable questions can be answered.

Highly concentrated copyright ownership made digitization easier for journals (e.g., JSTOR) than for books, where rights tend to revert to the author after some time. This connects with Jonathan Band's later argument, which I think deserves close attention, that the very fact that both publishers and authors are suing Google for its library project supports Google's fair use case. That is, like the perfomers and record companies in Napster, the plaintiffs can't agree on who owns the rights to authorize digitization, and to the extent that individual authors do (or even may) own the rights, the claim that asking permission would be easy and thus a market could be formed is less persuasive, supporting Google's argument on the fourth fair use factor. Compare this to the Tasini litigation and other Authors' Guild suits against content delivery companies whose contracts with publishers turned out not to protect them from copyright claims.

Lynch emphasized that there are a lot of things we don't know about Google's plans, except its contract with the University of Michigan (apparently Stanford also plans to let Google go comprehensively through the monograph collection, including in-copyright works, but I haven't found anything from Stanford that definitively says so). Harvard, Oxford and the NYPL have agreed to experimental digitization of public domain materials, extensible by mutual agreement. (Lynch noted the irony that the protesting publishers' groups include as members the respective university presses, and suggested that greater press-university coordination might have been in order.) Google has been coy about what it will allow people to do with the public domain works; there is a difference between letting you read a book online, letting you download a book, and letting you download the corpus of scanned books. Of course, non-copyright controls on public domain works are nothing new, as museums have been using physical control to manage copies for decades.

Lynch closed by offering what he considered a bizarre prospect: that students and other searchers would have easy access to a lavishly documented world before 1923, which would then just sort of stop (at least with respect to high-investment content), with later material behind digital locks. Researchers might justly consider this an arbitrary division between the commercialized and the shared world. Will the 20th century be 80% missing from digital life? My comment: this of course assumes that pay-for-access digitization will be prohibitively priced; for many university students, at least, the commercialized digital world will be available, if their libraries can afford it.

Jonathan Band then spoke about the legal issues in the Google controversy, expanding on his previously published analysis to argue that Google is engaging in fair use. According to him, Google has recognized that, for certain reference works such as dictionaries, even a snippet might substitute for access to the work, so Google Book Search might give you a result listing for a term found in a dictionary but no text at all. As noted above, he emphasized the difficulty of finding the proper rights owners of all the relevant books; many are essentially orphan works, and an opt-in rather than opt-out arrangement would gut the project even if rights owners were generally willing to opt in once contacted.

As he pointed out, Google's web search business depends on opt-out, and if Google Book Search isn't fair use, then its web search seems vulnerable as well (a fact that may help Google more than hurt it, since web search is so important to the internet as it has developed). Of course a properly configured robot exclusion header doesn't require the author to keep up with who's doing search these days the way Google's opt-out for books does. On the other hand, as a practical matter only Google (and maybe Microsoft and Yahoo!) is likely to engage in large-scale digitization. I think this cuts both ways: maybe Google uniquely has the resources to seek permission or maybe this means that letting Google do this does not threaten any other likely market and it will be easy to opt out from digitization by telling Google. Band also emphasized the economic benefit to Google from Book Search: if it makes Google's search engine more attractive, suddenly there's a billion-dollar barrier to entry that can't be overcome just by writing a better search spider.

Ultimately, I don't think Google's size or uniqueness should be central to the fair use inquiry. Either an intermediary should be able to make a digital copy for the purpose of returning small segments to searchers, analogous to the software reverse engineering cases, or it shouldn't. This is especially important given that Google plans to give a complete digital copy to the library whose book was digitized; the library will then be in a similar position to Google. If it also restricts uses the way Google does (or some other way, for example by locking up the physical copy for preservation purposes and then allowing one electronic check-out at a time, the way some ebook lending works now), then the fair use analysis should be similar, with the exception of the commercial/noncommercial factor. The conceptual difficulty here is that there are a number of interrelated, arguably fair uses: Google gets to make a digital copy for itself in consideration for digitizing the book for the library; the library gets a digital copy back in consideration for lending the book to Google; these copies may never be accessible in full to any end-user.

I, for one, am very interested to see what's going to happen next.

No comments: