A dataset released by the New York City Taxi and Limousine Commission is causing quite some uproar in the privacy community. The set contains details about every yellow cab ride in New York in 2013, including the pickup and drop off times, locations, fare and tip amounts, as well as anonymized (hashed) versions of the taxi’s license and medallion numbers. The dataset was not released voluntarily by the Commission, but was acquired through a Freedom of Information request.
Vijay Pandurangan had already found out that the dataset wasn’t properly anonymized. He was able to link taxi rides in the dataset to particular taxi licence numbers and medallion numbers. And with data available elsewhere, Pandurangan found out which taxi driver drove where:
“There’s a ton of resources on NYC Taxi and Limousine commission, including a mapping from licence number to driver name, and a way to look up owners of medallions. I haven’t linked them here but it’s easy to find using a quick Google search.”
With the license numbers deanonymized, Anthony Tockar took Pandurangan’s work one step further. Tocker searched the internet for images of “celebrities in taxis in Manhattan in 2013″. Some of the images he found displayed a celebrity getting on or off a taxi with a visible license number. This information enabled him to link the taxi rides data he already had to particular celebrities. Anthony Tockar explains what he was able to reveal about Bradley Cooper’s and Jessica Alba’s taxi rides:
“In Brad Cooper’s case, we now know that his cab took him to Greenwich Village, possibly to have dinner at Melibea, and that he paid $10.50, with no recorded tip. Ironically, he got in the cab to escape the photographers! We also know that Jessica Alba got into her taxi outside her hotel, the Trump SoHo, and somewhat surprisingly also did not add a tip to her $9 fare. Now while this information is relatively benign, particularly a year down the line, I have revealed information that was not previously in the public domain. Considering the speculative drivel that usually accompanies these photos (trust me, I know!), a celebrity journalist would be thrilled to learn this additional information.”
But it gets worse. Using the data, Tockar was also able to tell the drop off locations of people who presumably visited Larry Flynt’s Hustler Club. Now, if that location is in front of a large apartment building, then the identity of such a visitor is hard to reveal. But if only a few people live at that location, it is of course easier to find out who “had to work late”:
“Examining one of the clusters in the map above revealed that only one of the 5 likely drop-off addresses was inhabited; a search for that address revealed its resident’s name. In addition, by examining other drop-offs at this address, I found that this gentleman also frequented such establishments as “Rick’s Cabaret” and “Flashdancers”. Using websites like Spokeo and Facebook, I was also able to find out his property value, ethnicity, relationship status, court records and even a profile picture!”
The research done by Pandurangan and Tockar not only shows the importance of proper data anonymization, but also the risks incurred by putting all kinds of data online. Combining data and datasets is one way to deanonymize data. The more computing power and publicly available data, the easier it becomes to identify individuals in the data. In a time when even government institutions upload large online datasets for the sake of open data policies, the scale of the problem of deanonymized data providing insights into everyone’s day-to-day life will only increase.
Written by: Stefan Kulk. He is a PhD-candidate at Utrecht University in the Netherlands.
Thanks to Joost Gerritsen (Lawyer at De Gier | Stam & Advocaten) for alerting us about this new research.
big data, deanonymization, open data, privacy