An analysis of the Yahoo! passwords
Last month the biggest security news in the mainstream press was about the password (hash) "breaches" at LinkedIn, eHarmony, and last.fm. Last week, it was a bunch of passwords that were leaked via a Yahoo! service. These passwords were for a particular Yahoo! service, but the e-mail addresses being used were for quite a few domains. There has been some discussion of whether, for example, the passwords for Google accounts were also exposed. The short answer is, if the user committed one of the cardinal sins of passwords and reused the same one for multiple accounts, then, yes, some Google (or other) passwords may also have been exposed. Having said all of that, that isn't primarily what I wanted to look at today. I also don't plan to spend too much time on the password policy (or lack thereof) or the fact that the passwords were apparently stored in the clear, both of which most security folks would probably agree are bad ideas.
The domains
First, I did a quick analysis of the domains. I should note that some of the e-mail addresses were clearly invalid (misspelled domains, etc.). There were a total of 35008 domains represented. The top 20 domains (after converting all to lower case) are shown in the table below.
137559 yahoo.com
106873 gmail.com
55148 hotmail.com
25521 aol.com
8536 comcast.net
6395 msn.com
5193 sbcglobal.net
4313 live.com
3029 verizon.net
2847 bellsouth.net
2260 cox.net
2133 yahoo.co.in
2077 ymail.com
2028 hotmail.co.uk
1943 earthlink.net
1828 yahoo.co.uk
1611 aim.com
1436 charter.net
1372 att.net
1146 mac.com
The passwords
I saw an interesting analysis of the eHarmony passwords by Mike Kelly at the Trustwave SpiderLabs blog and thought I'd do a similar analysis of the Yahoo! passwords (and I didn't even need to crack them myself, since the Yahoo! ones were posted in the clear). I pulled out my trusty install of pipal and went to work. As an aside, pipal is an interesting tool for those of you that haven't tried it. As I was preparing this diary, I noted that Mike says the Trustwave folks used PTJ, so I may have to take a look at that one, too.
The first thing to note is that of the 442,836 passwords, there were 342,508 unique passwords, so over 100,000 of them were duplicates.
Looking at the top 10 passwords and the top 10 base words, we note that some of the worst possible passwords are right there at the top of the list. 123456 and password are always among the first passwords that the bad guys guess because for some reason we haven't trained our users well enough to get them to stop using them. It is interesting to note that the base words in the eHarmony list seemed to be somewhat related to the purpose of the site (e.g., love, sex, luv, ...), I'm not sure what the significance of ninja, sunshine, or princess is in the list below.
Top 10 passwords
123456 = 1667 (0.38%)
password = 780 (0.18%)
welcome = 437 (0.1%)
ninja = 333 (0.08%)
abc123 = 250 (0.06%)
123456789 = 222 (0.05%)
12345678 = 208 (0.05%)
sunshine = 205 (0.05%)
princess = 202 (0.05%)
qwerty = 172 (0.04%)
Top 10 base words
password = 1374 (0.31%)
welcome = 535 (0.12%)
qwerty = 464 (0.1%)
monkey = 430 (0.1%)
jesus = 429 (0.1%)
love = 421 (0.1%)
money = 407 (0.09%)
freedom = 385 (0.09%)
ninja = 380 (0.09%)
sunshine = 367 (0.08%)
Next, I looked at the lengths of the passwords. They ranged from 1 (117 users) to 30 (2 users). Who thought allowing 1 character passwords was a good idea?
Password length (count ordered)
8 = 119135 (26.9%)
6 = 79629 (17.98%)
9 = 65964 (14.9%)
7 = 65611 (14.82%)
10 = 54760 (12.37%)
12 = 21730 (4.91%)
11 = 21220 (4.79%)
5 = 5325 (1.2%)
4 = 2749 (0.62%)
13 = 2658 (0.6%)
We security folks have long preached (and rightly so) the virtues of a "complex" password. By increasing the size of the alphabet and the length of the password, we increase the work the bad guys must do to guess or crack the passwords. We've gotten in the habit of telling users that a "good" password consists of [lower case, upper case, digits, special characters] (choose 3). Unfortunately, if that is all the guidance we give, users being human and, by nature, somewhat lazy will apply those rules in the easiest way.
First capital last symbol = 1259 (0.28%)
First capital last number = 17467 (3.94%)
On the other hand, if we don't enforce at least that much, users won't bother.
Only lowercase alpha = 146516 (33.09%)
Only uppercase alpha = 1778 (0.4%)
Only alpha = 148294 (33.49%)
Only numeric = 26081 (5.89%)
I thought it was also interesting looking at the passwords that contained a year:
Years (Top 10)
2008 = 1145 (0.26%)
2009 = 1052 (0.24%)
2007 = 765 (0.17%)
2000 = 617 (0.14%)
2006 = 572 (0.13%)
2005 = 496 (0.11%)
2004 = 424 (0.1%)
1987 = 413 (0.09%)
2001 = 404 (0.09%)
2002 = 404 (0.09%)
What is the significance of 1987 and why nothing more recent that 2009? When I analyzed some other passwords, I'd see either the current year, or the year the account was created, or the year the user was born. And finally, some statistics inspired by the Trustwave analysis:
Months (abbr.) = 10585 (2.39%)
Days of the week (abbr.) = 6769 (1.53%)
Containing any of the top 100 boys names of 2011 = 18504 (4.18%)
Containing any of the top 100 girls names of 2011 = 10899 (2.46%)
Containing any of the top 100 dog names of 2011 = 17941 (4.05%)
Containing any of the top 25 worst passwords of 2011 = 11124 (2.51%)
Containing any NFL team names = 1066 (0.24%)
Containing any NHL team names = 863 (0.19%)
Containing any MLB team names = 1285 (0.29%)
I wish I had their list of curse words to test. :)
Conclusions?
So, what conclusions can we draw from all of this? Well, the obvious is that without any direction, most users will not choose particularly strong passwords and the bad guys know this. What constitutes a good password? What constitutes a good password policy? Personally, I think the longer, the better and I actually recommend [lower case, upper case, digit, special character] (choose at least one of each). Hopefully none of these users were using the same password here as on their banking sites. What do you, our faithful readers, think?
---------------
Jim Clausing, GIAC GSE #26
jclausing --at-- isc [dot] sans (dot) edu
The opinions expressed here are strictly those of the author and do not represent those of SANS, the Internet Storm Center, the author's spouse, kids, or pets.
Comments