Tag: search

Google Books Metadata

a fascinating smackdown for all the metadata whiners, from a member of the google books team. a lot of them have illusions about the quality of the metadata their institutions produce.

In paragraph 3, Geoff describes some of the problems we have with dates, and in particular the prevalence of 1899 dates. This is because we recently began incorporating metadata from a Brazilian metadata provider that, unbeknownst to us, used 1899 as the default date when they had no other. Geoff responded by saying that only one of the books he cited was in Portuguese. However, that metadata provider supplies us with metadata for all the books they know about, regardless of language. To them, Stephen King’s Christine was published in 1899, as well as 250K other books.

To which I hear you saying, “if you have all these metadata sources, why can’t the correct dates outvote the incorrect ones?” That is exactly what happens. We have 10s of metadata records telling us that Stephen King’s Christine was written in 1983. That’s the correct date. So what should we do when we have a metadata record with an outlier date? Should we ignore it completely? That would be easy. It would also be wrong. If we put in simple common sense checks, we’d occasionally bury uncommonly strange but genuine metadata. Sometimes there is a very old book with the same name as a modern book. We can either include metadata that is very possibly wrong, or we can prevent that metadata from ever being seen. The scholar in me — if he’s even still alive — prefers the former.

Jeff Dean keynote

The attention to detail at Google is remarkable. Jeff gleefully described the various index compression techniques they created and used over the years. He talked about how they finally settling on a format that grouped 4 delta of positions together in order to minimize the number of shift operations needed during decompression. They paid attention to where their data was laid out on disk, keeping the data they needed to stream over quickly always on the faster outer edge of the disk, leaving the inside for cold data or short reads. They wrote their own recovery for errors with non-parity memory. They wrote their own disk scheduler. They repeatedly modified the Linux kernel to meet their needs. They designed their own servers with no cases, then switched to more standard off-the-rack servers, and now are back to custom servers with no cases again.

You think?