Tag: semweb

Industry-scale KG

comparison of various knowledge graphs

Google Books Metadata

a fascinating smackdown for all the metadata whiners, from a member of the google books team. a lot of them have illusions about the quality of the metadata their institutions produce.

In paragraph 3, Geoff describes some of the problems we have with dates, and in particular the prevalence of 1899 dates. This is because we recently began incorporating metadata from a Brazilian metadata provider that, unbeknownst to us, used 1899 as the default date when they had no other. Geoff responded by saying that only one of the books he cited was in Portuguese. However, that metadata provider supplies us with metadata for all the books they know about, regardless of language. To them, Stephen King’s Christine was published in 1899, as well as 250K other books.

To which I hear you saying, “if you have all these metadata sources, why can’t the correct dates outvote the incorrect ones?” That is exactly what happens. We have 10s of metadata records telling us that Stephen King’s Christine was written in 1983. That’s the correct date. So what should we do when we have a metadata record with an outlier date? Should we ignore it completely? That would be easy. It would also be wrong. If we put in simple common sense checks, we’d occasionally bury uncommonly strange but genuine metadata. Sometimes there is a very old book with the same name as a modern book. We can either include metadata that is very possibly wrong, or we can prevent that metadata from ever being seen. The scholar in me — if he’s even still alive — prefers the former.

Intelligence in Wikipedia

using wikipedia infoboxes for training extractors, and then asking users to confirm guesses, increasing contribution and extraction quality in a mutual positive feedback loop.

WikiProfessional

semantic wiki for scientific collaboration. this could actually work

Nash Equilibria Modeling

I personally think that Exhibit and Potluck are the best examples out there of solutions that don’t specifically change the nature of the game but shift the paybacks, thus attempting to reduce the gap between Pareto optimal states and Nash balanced ones. A lot more has to happen on the Potluck front, of course, being practically just paperware and a lot more has to happen about harvesting the collective intelligence of people using these tools, to further improve on their use and emerge data that could be useful to increase coordination and make it easier to predict integration costs.

i love stefano, but sometimes he is just full of buzzwords

Automated Scraping

our scraping tools (Solvent and Crowbar) let you deal with web pages at the level of the DOM (e.g., evaluating XPaths, retrieving HTML attributes) rather than at the level of streaming characters. This higher level of abstraction is easier to operate in. Furthermore, Solvent and Crowbar can wait for all the dynamic Javascript code in web pages to finish running; this means that you can even scrape those new Web 2.0 sites rather than just static web pages.

their DOM scraping has come a long way

Semantic web mashups

To define a facet, you drag/drop the column name to another area of the canvas. For my merged exhibit, I used the facets origin (i.e., CSAIL vs CCNMT), plus group, position, and building.

Then I exported the merged data into the same JSON format as the original sources, cloned one of the pages, and referenced the merged data set. From there it was just a bit of tweaking to make the div elements in the HTML page reference the facets that I’d defined.

Stunning. Behind the scenes it’s all RDF, but the point is that nobody needs to know or care about that. And the larger point is that the Simile folks — having spent years fighting ontology wars — have now gone AWOL. The new stance is: Everybody gets to name their fields as they prefer, and mashup tools like Potluck can define equivalences among them. All the original source data, and all the merged data, is available in a common format that translates into grist for the engines in the RDF mill. All the data, and all the interactive behavior associated with the data, is cleanly separated from the presentation.

he really likes what he sees. and so do i. this needs to ride on the coattails of someone to get wide exposure

Giant Global Graph

So, if only we could express these relationships, such as my social graph, in a way that is above the level of documents, then we would get re-use. That’s just what the graph does for us. We have the technology — it is Semantic Web technology, starting with RDF OWL and SPARQL. Not magic bullets, but the tools which allow us to break free of the document layer.

maybe the time for the semweb is finally ripe.

GRDDL Primer

This document serves as an introduction to GRDDL (Gleaning Resource Descriptions from Dialects of Languages), a mechanism for obtaining RDF data from XML documents and in particular XHTML pages using explicitly associated transformation algorithms, typically represented in XSLT. It uses a number of examples from the GRDDL Use Cases document to illustrate the techniques GRDDL provides for associating documents with appropriate instructions for extracting any embedded data.

not sure how this helps anything.

OSM Army

I’m not convinced that the state of the art in GIS databases has appropriate answers. The OSM community, as ever, creates new cart-tracks across well-paved spaces. The debate is too heated for any but the really committed to follow, the tracks become effaced in debate, but perhaps they’re leading somewhere new. Or as the New Data Model paper puts it, Complexity does not mean that it has to be more complicated.

jo thinks the new OSM data model is more RDF-like, which of course she approves of