August 07, 2006

AOL Search Data Launches World's Biggest Experiment On Privacy Invasion

AOL Search Data has been released for more than half a million users:

This collection consists of ~20M web queries collected from ~650k users over three months. The data is sorted by anonymous user ID and sequentially arranged.

This is a privacy problem!

While it'll be well-discussed, I'll observee: AOL has just given us the world's biggest real-world experiment as to whether privacy invasion can be done from search-engine data. Previously, when discussing the Google Search subpoena, all people could do was speculate - the data might have this, it could include that, maybe possibly someone could do this from it. Now we have both a huge amount of data, and many interested geeks playing with it and mining it.

I joked we'll now see a huge distributed reverse-engineering collaborative effort to track down as many anonymous user ID's as possible. At least, I hope that was joke. Maybe it wasn't.

Note this data is being far, far, more widely released than the subpoena data, which would have been under confidentiality agreements and protective orders. Worrying about Big Government can be a distraction over far worse Big Corporations.

By Seth Finkelstein | posted in google | on August 07, 2006 10:29 AM (Infothought permalink)

Seth Finkelstein's Infothought blog (Wikipedia, Google, censorware, and an inside view of net-politics) - Syndicate site (subscribe, RSS)

Comments

Thanks for the comment about how fears of big government distract from the actions of big corporations.

I wrote this comment for the FD list, but it deserves wider distribution:

Since it has been widely mirrored, AOL will next find a scapegoat so the public will be more worried about those villains that dared to point out the problem and mirror the evidence.

Here is the instant recipe:

1) PR department reaches out to their media contacts. Journalists then tell sensationalist story of "hackers" or "bloggers" who mirrored *your* private data. AOL worms out of responsibility for letting the data loose in the first place by declaring war on the evil bloggers.

2) Now that there's no public support for the blogger, AOL can safely trick a government agency into publicly denouncing the blogger. Since the blogger is clearly a danger to public safety, the government is allowed to ignore all applicable law. After all their heart was in the right place, and that matter's more than an individual's rights. Also, since the press is already committed to portraying the blogger as a villain, the government knows that they will never have to apologize if they make a mistake. The press has a vested interest not to report the error.

3) Next AOL's team of corporate lawyers will file a lawsuit. It doesn't matter if the lawsuit is frivolous - they are after the PR value of "prosecuting on behalf of the public", and reinforcing to the media that the blogger who dared link to the info is the evil one. If the blogger is poor, weak, and has no media platform of their own, then AOL might actually win the lawsuit by default, adding further legitimacy to their "public defender" posture.

4) The public doesn't understand that killing the messenger only guarantees successful cover ups in the future. And as far as I can tell, they don't care that there is a layer of people who corporations can calculate as having no Constitutional rights in this country (if a person can't defend their rights, they might as well not exist). AOL's "issues management" team is weaving these assumptions into their strategy.

Scapegoating worked for Kaiser Permanente. It'll work for AOL.

Posted by: gadfly at August 7, 2006 02:37 PM

I already found personally identifiable information, though to verify the person in question really searched for this, I'd have to embarass him and send an email asking. And then, probably, that person would ignore me or pretend he's not it even though he is.

And that, of course, was just found by quickly glancing over the data... and without knowledge of this person. It would actually be even easier to *recognize* a person you know through his or her searches... and then connect it to potentially embarassing other searches.

The database of intentions, as John Battelle calls it... maybe someone will release a book of all searches through Lulu.com -- *every user is a chapter*. If I understand AOL's terms right, as long as the book is non-commercial, that's OK... so you just need to make sure you don't get a commission from sales.

Posted by: Philipp Lenssen at August 7, 2006 03:51 PM

As someone who in a previous career worked with highly confidential personal data - a government statistics agency - let me say that I think while you are imagining how bad it is, if you are in the list you may wish to get religion if you don't already.

As a test for another section of a confidentialized file, I broke it in minutes - an academic complained about a confidentialized unit record file, saying he could see himself there. After mining the releaed file I came up with 2 likely candidates, I then compared his actual data to what I pulled - as I expected the data releaed candidates were an amalgam of actual data - no surprise as this is how the confidentializing was done.

So trolling individuals will be done easily. Will AOL pay the blackmail for those affected?

The real problem is you don't need names and addresses.

Now assuming it is actually a random sample - names and addresses while a concern for those affected aren't my greatest fear. In gov statistics you learn fast that people move and change names fast enough that matching it up is almost always more pain than it is worth.

The rest of us will suffer as well.

However - it is the breadth and depth of data which will give real power to the evil doers. Insomniacs may wish to look up "factorial experiment design".

A genuine random sample will give marketers, "smart" spammers and other less desirables a goldmine of targetting data. Spam filters will become less effective, sites will be constructed targetting niches based on this data which is fine until all it loads is trojans, ads will be targetted with precision.

Paid-to-post scum who pose as "individuals" posting toforums/blogs/comments whereever, will now be able to be realistically automatically generated - essentially bots given personality. Need a doctor to comment on a medical scandal for a PR firm save money generate one based on the data in the list.

What makes bots and shills/trolls so easy to pick right now is lack of depth from not having a believable background - prepare to repel boarders. A smart PR firm would start generating the personalities now and "evolve" (ie start having machine generated posts now)them now so that they have dr's, engineers, whatever on hand for when they need them. [Seth - still looking for an interesting job?]

If we are really lucky AOL have screwed up and not generated a true random sample - but 600k+ of individuals will probably outweigh any non-randomised bias.

Posted by: tqft at August 8, 2006 07:42 AM

seth, we should not be worry about either the gov or the corporation, we should just be ourselves and do what we think is right. No matter what we do- right or wrong- the electronic fingerprints are all over cyberspace. Its easier to let others know about your own intentions and attention data, in this way - one negates being isolated. If totally isolated, you could trigger a "postive postive", rather then a "false postive"

Posted by: /pd at August 8, 2006 10:36 AM

AOL has just given us the world's biggest real-world experiment as to whether privacy invasion can be done from search-engine data

Did you see the NYT article today?

They scanned thru the queries and postively identified user No. 4417749 on the basis of her queries. Contacted the woman and confirmed it:

[S]earch by search, click by click, the identity of AOL user No. 4417749 became easier to discern. There are queries for "landscapers in Lilburn, Ga," several people with the last name Arnold and "homes sold in shadow lake subdivision gwinnett county georgia."
It did not take much investigating to follow that data trail to Thelma Arnold, a 62-year-old widow who lives in Lilburn, Ga., frequently researches her friends’ medical ailments and loves her three dogs. "Those are my searches," she said, after a reporter read part of the list to her.

She's cancelling her AOL subscription, by the way.

Posted by: Lis Riba at August 9, 2006 10:04 AM

BTW, this morning's NPR talked about a provision to a House appropriations bill that would require labels on sexually explicit Web sites to make it easier for filtering software.

I tuned in on the middle of the story, but this doesn't sound like DOPA or any of the filtering bills I've been hearing about.

Do you have any more info on it?

http://www.npr.org/templates/story/story.php?storyId=5629338

Posted by: Lis Riba at August 9, 2006 10:08 AM

Lis: Thanks, yes, I had seen it, but I didn't think I had anything to say that everyone else wasn't already saying - I just did a post anyway. The labeling proposal isn't DOPA - here's a news article on it

tqft: Sometimes I wonder if my life would have been better if I had been willing to work for the Dark Side :-).

Posted by: Seth Finkelstein at August 9, 2006 11:18 AM