It is no secret that speech (oral and written) and accents help determine where someone is from and their social background.
LADO (Language Analysis for the Determination of Origin) has been used since 1993 to determine where asylum seekers are really from (1), as different patterns of speech can be linked with different places of origin.
Speech patterns can also help governments find out the identity of an online user.
The true identity of Satoshi Nakamoto, the creator of Bitcoins
Last summer, an article published on medium.com claimed that the NSA had found out who Satoshi Nakamoto really is (2). According to the author, the NSA used stylometry to analyze thousands of emails, posts and comments written by this person in order to find the 50 most common words present in his writings. Then, they compared them to «trillions of writings from more than a billion people» they had gathered using PRISM and MUSCULAR and were able to find Satoshi’s identity in less than a month. The author declared that this intel was from someone working at the Department of Homeland Security.
There is no proof that what the author is saying about Satoshi Nakamoto is true but we know for sure that patterns found in our writings can help identify us.
Dr David Wright, lecturer at Nottingham Trent University managed to identify the authors of emails based on speech pattern (3).
To do so, he used a database that contains 1.7 million emails (4) sent by employees of a company called Enron. He selected thousands of emails sent by 12 different employees and searched for sequences and patterns particular to those individuals. He identified the words and expressions used by each person and in which order they used them. He explained that the key to finding someone’s identity is in the banal phrases used everyday.
Although one might use the same word or expression as a coworker, his or her speech pattern will still be identifiable. Where the word is in the sentence matters as much as the word in itself.
Forensic linguistics is a powerful tool : it can help determine, when there is a dispute, the real author of a text. It can also help law enforcement find out who posted threats online, who blackmailed someone… but if abused, it could be used to surveil people.
Whistleblowers and informers might want to protect their identity for safety reasons ; the same goes for anyone who would like to be excluded from mass surveillance.
Drexel University’s Privacy, Security And Automation Laboratory (PSAL) published a « Document Anonymization Tool » called Anonymouth on GitHub (5). The goal of this tool is to detect stylometric patterns that can help determine the user’s identity so they can change the sentences and better hide their identity.
Written by Marine Rouet
Published on November 7th, 2017
(1) http://etheses.whiterose.ac.uk/15266/1/Kim%20Wilson%20MPhil%20Thesis%20Feb%202016%20-%20LADO%20An%20Investigative%20Study.pdf (2) https://medium.com/cryptomuse/how-the-nsa-caught-satoshi-nakamoto-868affcef595 (3) http://www.newsweek.com/identity-fraud-linguistics-email-scam-identify-crime-695972 (4) https://www.ntu.ac.uk/about-us/news/news-articles/2017/10/small-words-in-an-email-can-reveal-a-persons-identity (5) https://github.com/psal/anonymouth