a fascinating smackdown for all the metadata whiners, from a member of the google books team. a lot of them have illusions about the quality of the metadata their institutions produce.
In paragraph 3, Geoff describes some of the problems we have with dates, and in particular the prevalence of 1899 dates. This is because we recently began incorporating metadata from a Brazilian metadata provider that, unbeknownst to us, used 1899 as the default date when they had no other. Geoff responded by saying that only one of the books he cited was in Portuguese. However, that metadata provider supplies us with metadata for all the books they know about, regardless of language. To them, Stephen King’s Christine was published in 1899, as well as 250K other books.
To which I hear you saying, “if you have all these metadata sources, why can’t the correct dates outvote the incorrect ones?” That is exactly what happens. We have 10s of metadata records telling us that Stephen King’s Christine was written in 1983. That’s the correct date. So what should we do when we have a metadata record with an outlier date? Should we ignore it completely? That would be easy. It would also be wrong. If we put in simple common sense checks, we’d occasionally bury uncommonly strange but genuine metadata. Sometimes there is a very old book with the same name as a modern book. We can either include metadata that is very possibly wrong, or we can prevent that metadata from ever being seen. The scholar in me — if he’s even still alive — prefers the former.