Tag: search

URL Best Practices

the search engines should make things like this more prominent

Describe Your Content
Keep it Short
Static is the Way & the Light
Descriptives are Better than Numbers
Keywords Never Hurt
Subdomains Aren’t the Answer
Fewer Folders
Hyphens Separate Best
Stick with Conventions
Don’t be Case Sensitive
Don’t Append Extraneous Data

Search personalization

Requires a few 100 queries per user before the users interests converge. they call this problematic, but seriously who doesn’t search all the time?

Bot classes

After a couple days of robots.txt love, I have now much less crap in my logs. A good opportunity to see which bots are well-written. Based on what I am seeing with /robots.txt, I am sure glad I blocked most of these festering piles of dung from my site.

not using conditional get while requesting /robots.txt

Only kinjabot, OnetSzukaj/5.0 and Seekbot/1.0 get this right. All other bots, including Google and Yahoo, do not. Lame.

requesting /robots.txt too often

The biggest offender is VoilaBot, checking /robots.txt every 5 minutes, every day. You gotta be kidding me. Google and Yahoo are not much better, you’d think they’d figured out a way by now to communicate the state of /robots.txt across different crawlers. Other bots fare better by virtue of being less desperate.

Problems like this are economic opportunities.

Crawler blight

i went ahead and blocked most crawlers in my robots.txt. there are too many of them, and for most, my ROI is negative anyway. if you had any doubts how far search still has to go, or how many moronic copycat companies there are in this space, spend some time with your log files.

Cog in the crawler

Now that Google is helping me to surf faster (works as advertised, by the way), I have effectively become a cog in a huge distributed crawling machine. Obviously, this is only the first step (alexa-style traffic analysis is naturally already happening). If you control the proxy that people use, annotation and tagging at internet scale are suddenly becoming feasible. ‘tag this’ button in the google toolbar anyone? This will lead to a repeat of the third voice lawsuits, but these features are too useful to be derailed by these problems for long. Years ago at KPMG, I experimented with the office server extensions annotation system, and I am eager to see it return in a cross platform way. People have been pointing out the possibilities for adsense (targeted ads based on your surfing history), personalized search and cobrowsing

New hope for web search?

i have spent some time recently trying to do deep searches (the kind that give less than 10 results), and i noticed that the link farming/spamming has become so bad that search engines are falling back to circa 1995-levels of accuracy and duplicate removal. as ben hammersley notes, yahoo is the new google, what with openly publishing research into better search algorithms.

one-upping XMLHTTP

if the techniques behind google suggest are not snappy enough for you, try micah’s web search without the web. clever data prepopulation beats (asynchronous) network calls.

Google vs MSN, round 2

google (newly updated) now has over 60k entries on my name (results may differ depending on your location). the new msn search limps along with ~14k. please tell me there is more to MSN search than meets the eye, i’d like some competition 🙂 not because i would necessarily switch engines, but because it would kick google into high gear.

Search as a force for good

peter starts with the well-known globe with the google queries superimposed. “google saves over 9000 person-years of effort every year. So Google saves 9000 lives per year.” 🙂 numerous mentions of people making a living from google ads. shows keyhole. “the computers from star trek are always omniscient but never helpful, they never tell you: don’t do this.” the spelling checker is not dictionary based but works on their huge accumulated data, like the 500 spellings of britney. the web is 1M times larger than the largest computational linguistics corpuses. 1B documents really makes the difference. humans achieve around 95% accuracy. work is being done on semantic understanding. one program they run is extracting categories and members of these categories from their corpus. “done very simply: you take the whole web, break it into sentences, and look for 6 patterns, such as including, as in Software companies, including..” they take an automated approach to machine translation by looking for pages on the web that have documents in 2 languages and derive the model from it. this yields a level of translation that is good enough. “doug cutting was more interested in the crawling and indexing side, so that’s where lucene looks good, not so much in the sorting.” google focuses on the 95%, the easy part, to get more leverage, but will have to go back to the hard part. they found that the feedback button didn’t work. some people were writing them that they were looking for a specific book by typing in “library”, which indicates a deeper problem. the first google logo was done during burning man 1999 to indicate that “no one is at the office, don’t blame us”.

Xpath scraping

mnot: xpath2rss. more xpath search tools are always welcome to make the case for more semistructured data. full disclosure: i am a doc-head.