Abstract: This paper describes a simple strategy for searching for domain names with typographical differences, and the results of one such search.
This problem, of finding squatted domain targets, can be viewed as an issue of fuzzy-searching for strings. In general, such approximate searching is a well-analyzed problem, and a tool has been developed for performing approximate string-matching. The program is called agrep
Thus, one need only perform an agrep search over the appropriate domain space in order to generate candidate domain targets.
Searches were done for all names in the above typographical errors dataset. The data itself can be extracted from the given files with a simple perl expression. The following small script will produce a result of domain name, a comma separator, and then the id used.
perl -ne 'print "$1,$2\n" if (/domainname=([^\046]+)\046domainid=(\d+)\047/);' datafile.html
For every typo-domain-name, that name was searched over the space of domains using zone files from December 31, 2002. That is, a .com typo-name was searched through all .com domains, a .org typo-name through all .org domains, etc. The typo-domain-name itself was eliminated from the search results. The agrep search options used were:
agrep -x -1
The format of the search results files is as follows:
typo-domain-name, comma separator, then the ID used for it (to facilitate matching the dataset), comma separator, then the list of candidate matches which are semicolon-separated
While not every search produced an obvious match, the results were clearly helpful in determining squatting targets.
The most interesting outcome was that, for a typo-domain, the approximate search generated other obvious typo-domains. Further investigation of these groupings might reveal other squatters.
This project was not supported by anyone. If anyone is providing financial support for such projects, this author would like to know too.
Version 1.0 January 31 2003
See also: Domains with Typographical Errors - A Google Search Strategy(if you subscribed a few months ago, please resubscribe due to a crash)
See more of Seth Finkelstein 's Censorware Investigations