Tag: search

Bot classes

After a couple days of robots.txt love, I have now much less crap in my logs. A good opportunity to see which bots are well-written. Based on what I am seeing with /robots.txt, I am sure glad I blocked most of these festering piles of dung from my site.

not using conditional get while requesting /robots.txt

Only kinjabot, OnetSzukaj/5.0 and Seekbot/1.0 get this right. All other bots, including Google and Yahoo, do not. Lame.

requesting /robots.txt too often

The biggest offender is VoilaBot, checking /robots.txt every 5 minutes, every day. You gotta be kidding me. Google and Yahoo are not much better, you’d think they’d figured out a way by now to communicate the state of /robots.txt across different crawlers. Other bots fare better by virtue of being less desperate.

Problems like this are economic opportunities.

Cog in the crawler

Now that Google is helping me to surf faster (works as advertised, by the way), I have effectively become a cog in a huge distributed crawling machine. Obviously, this is only the first step (alexa-style traffic analysis is naturally already happening). If you control the proxy that people use, annotation and tagging at internet scale are suddenly becoming feasible. ‘tag this’ button in the google toolbar anyone? This will lead to a repeat of the third voice lawsuits, but these features are too useful to be derailed by these problems for long. Years ago at KPMG, I experimented with the office server extensions annotation system, and I am eager to see it return in a cross platform way. People have been pointing out the possibilities for adsense (targeted ads based on your surfing history), personalized search and cobrowsing

Search as a force for good

peter starts with the well-known globe with the google queries superimposed. “google saves over 9000 person-years of effort every year. So Google saves 9000 lives per year.” 🙂 numerous mentions of people making a living from google ads. shows keyhole. “the computers from star trek are always omniscient but never helpful, they never tell you: don’t do this.” the spelling checker is not dictionary based but works on their huge accumulated data, like the 500 spellings of britney. the web is 1M times larger than the largest computational linguistics corpuses. 1B documents really makes the difference. humans achieve around 95% accuracy. work is being done on semantic understanding. one program they run is extracting categories and members of these categories from their corpus. “done very simply: you take the whole web, break it into sentences, and look for 6 patterns, such as including, as in Software companies, including..” they take an automated approach to machine translation by looking for pages on the web that have documents in 2 languages and derive the model from it. this yields a level of translation that is good enough. “doug cutting was more interested in the crawling and indexing side, so that’s where lucene looks good, not so much in the sorting.” google focuses on the 95%, the easy part, to get more leverage, but will have to go back to the hard part. they found that the feedback button didn’t work. some people were writing them that they were looking for a specific book by typing in “library”, which indicates a deeper problem. the first google logo was done during burning man 1999 to indicate that “no one is at the office, don’t blame us”.