Domains With Typographical Errors - A Simple Search Strategy

by Seth Finkelstein

Abstract: This paper describes a simple strategy for searching for domain names with typographical differences, and the results of one such search.

Introduction

In a report Large-Scale Registration of Domains with Typographical Errors the author Benjamin Edelman describes an extensive series of domain names with typographical errors which have been registered by a cybersquatter. This creates what might be called an "inverse problem", of determining what are the target of the squatted typo'ed name. Edelman asks for help in identifying these targets.

Strategy

This problem, of finding squatted domain targets, can be viewed as an issue of fuzzy-searching for strings. In general, such approximate searching is a well-analyzed problem, and a tool has been developed for performing approximate string-matching. The program is called agrep

Thus, one need only perform an agrep search over the appropriate domain space in order to generate candidate domain targets.

Searches were done for all names in the above typographical errors dataset. The data itself can be extracted from the given files with a simple perl expression. The following small script will produce a result of domain name, a comma separator, and then the id used.

perl -ne 'print "$1,$2\n" if (/domainname=([^\046]+)\046domainid=(\d+)\047/);' datafile.html

For every typo-domain-name, that name was searched over the space of domains using zone files from December 31, 2002. That is, a .com typo-name was searched through all .com domains, a .org typo-name through all .org domains, etc. The typo-domain-name itself was eliminated from the search results. The agrep search options used were:

agrep -x -1

Results

The format of the search results files is as follows:

typo-domain-name, comma separator, then the ID used for it (to facilitate matching the dataset), comma separator, then the list of candidate matches which are semicolon-separated

While not every search produced an obvious match, the results were clearly helpful in determining squatting targets.

The most interesting outcome was that, for a typo-domain, the approximate search generated other obvious typo-domains. Further investigation of these groupings might reveal other squatters.

Support

This project was not supported by anyone. If anyone is providing financial support for such projects, this author would like to know too.


Version 1.0 January 31 2003

See also: Domains with Typographical Errors - A Google Search Strategy


Mail comments to: Seth Finkelstein <sethf@sethf.com>

For future information:   subscribe    to   Seth Finkelstein's Infothought list    or read the    Infothought blog

(if you subscribed a few months ago, please resubscribe due to a crash)

See more of Seth Finkelstein 's Censorware Investigations