Abstract: This paper describes a strategy for searching for domain names with typographical differences by using Google, and compares the results to a previous search using approximate string matching.
An earlier paper by this author (Seth Finkelstein ) considered the problem of finding squatted domain targets, as an issue of fuzzy-matching against strings. But another approach is to view the "typographical error" as an issue of spelling-correction.
Google , the famous search engine, has a feature where it will make a suggestion for search terms which might be misspelled. So if the user searches for the term "Gogle" , the result page will contain the suggestion
"Did you mean: Google"
This suggested spelling correction uses Google's proprietary algorithms, so it is somewhat opaque and perhaps even changes over time. However, despite these oracular aspects, given Google's popularity as an information-retrieval source, it seemed worthwhile to examine it as a cybersquatting-search tool.
The general procedure was simple. Domains with typographical errors were submitted to Google as search queries, and when Google returned a "Did you mean:" suggested correction, that was recorded. If no correction was given, the domain suffix (.com, .net, .org) was removed, and then the term re-submitted. A few times, this resulted in a suggested correction as more than one word or with characters not permitted in domain names. In these situations, any characters not permitted in domain names were removed from the result, and then the suffix re-attached.
Note - no attempt was made to determine if the results were always a valid (currently registered) domain!
The format of the search results files is as follows:
typo-domain-name, comma separator, then the ID used for it (to facilitate matching the dataset), comma separator, then a "corrected" domain name and another comma separator. If the results were from the second attempt at correction, the keyword "noext" was added after the last comma separator.
About 26% of the domain names yielded no suggested correction at all, as opposed to only 5-6 % to using the agrep -x -1 procedure . In the table below, in an attempt to find the target of a typo'ed domain name, a "hit" was defined as having at least one result from the procedure, while a "miss" meant the procedure had no results.
|dataset||total items||agrep hit||agrep miss||Google hit||Google miss|
|list-A||303||283 (93%)||20 (7%)||212 (70%)||91 (30%)|
|complete||5342||5039 (94%)||303 (6%)||3954 (74%)||1388 (26%)|
The interesting aspect of the Google corrections was that when there was a result, it seemed to be, in some subjective sense, the "most popular". That is, the agrep results might have several variants for the target domain name, but the Google result would have only the relevant match. Perhaps Google was returning the equivalent of the highest-scoring result in its internal rankings.
Google also did better than agrep in that it more readily found targets resulting from transposition (exchange of two adjacent letters, e.g. audioreveiw.com vs audioreview.com). This due to agrep regarding a transposition as two changes in the string (as two letters are different), which makes such changes rank low, and exceeded the setting used in the agrep-based results.
This project was not supported by anyone. If anyone is providing financial support for such projects, this author would like to know too.
Version 1.2 February 14 2003 (small correction to agrep statistics, full Google dataset calculated)See also: Domains with Typographical Errors - A Simple Search Strategy
(if you subscribed a few months ago, please resubscribe due to a crash)
See more of Seth Finkelstein 's Censorware Investigations