## Friday, August 12, 2011

### Learning from comics–XKCD at least

This is a bit surprising to me; I’m not a big fan of random passwords, but mostly because they are very difficult to remember so they tend to be written down.

But as if the cartoon itself isn’t nerdy enough, there’s a further discussion and explanation here: http://ask.metafilter.com/193052/Oh-Randall-you-do-confound-me-so#2779020

which the artist contributes to, therefore cementing his reputation as the nerdiest cartoonist working today:

Sorry I don't have time to dive into this discussion; I knew this one would lead to a lot of arguments :)
But to clear up a point or two, zompist is right—there are some subtle problems here, and I was looking at the math pretty carefully (at least, carefully for a webcomic). Information theory is one of my favorite areas, and I spent a while recently working on practical password cracking, which is what I based a lot of this on. I tried to pack in as much context as I could for the math for people who wanted to follow, but I figured there was only so much wall-of-text people would read before they get bored and switch over to Garfield :)

He's not pointing to the length of the password, but to something subtler. He's basically claiming that when asked to create strong passwords (i.e. with numbers, mixed case, and punctuation), people end up following a predictable formula. The formula is explained in the first panel, with an estimate of how much variation there is in each piece, measured in bits. E.g. a number can be stored in just three bits (actually wrong, you need four),
I was rounding to the closest bit, since that gives you a smaller error on the resulting time than rounding up—even though four is the number of bits required (since for storage you obviously have to round up). But there's actually another reason–I don't think all numbers are equally probable, so the entropy is actually less than 3 bits. Here's a table of ending digit frequency from an actual set of decrypted passwords:
aram:~\$ cat plaintext_passes.txt | egrep -oi "[0-9]\$" | sort | uniq -c
52 0
350 1
63 2
168 3
76 4
62 5
99 6
71 7
60 8
42 9
In practice, I think the ending digit probably has more like 2.5 bits of entropy, if you're looking only at passwords constructed of the form in my comic.
while a single bit suffices to indicate whether there's an initial capital or not.
The four-common-words password is also a formula, and his estimate of 11 bits for each one is equivalent to assuming that the words are chosen from a list of 2048 words. (Which is an adequate definition of "common".)
I actually cheated a little on that one, because "staple" isn't actually in the 2048-most-common wordlist I checked, but it sounded funnier to me :P (and it depends on your choice of list anyway.)
I think he's underestimating the "strong password" variability-- e.g. three bits for "common substitutions" means he thinks there are only 8 possibilities, which is awfully low. It looks like he's just counting the vowels, but there are other easy substitutions.
There are other easy substitutions, but they're made in a pattern, so the practical entropy is again reduced. People usually either do all the numeral substitutions or none, and the more esoteric ones (like 7 for t) come up a lot less frequently. That aside, you also have to look at how frequently there are opportunities for substitutions in the pool of base words. I came up with my number based on frequencies of common and uncommon substitutable letters in a list of six- to nine-letter words. But my guesses for what substitutions are most common could, of course, be wrong!
On the other hand, I think he's underestimating the pool of dictionary words too. No one has the OED memorized, but an educated speaker knows at least ten times that number of words, so his total goes up to 56 bits. And even more if you use a quirky non-dictionary word.

I did random sampling from the default Debian dictionary and from a few other corpuses, and against whatever algorithm they have at http://testyourvocab.com/ and decided that even if people *were* picking randomly, 60,000 was a generously high estimate for the number of base words in their vocabulary. 30,000 is closer to typical based on dictionary words. I suspect including non-dictionary words doesn't expand the list nearly as much as one might think; we spend all our lives reading and learning dictionary words, and comparatively little of the text we write pulls from any larger vocabulary of strings. Now, if I took into account the fact that people are without a doubt not capable of generating anything close to a "random word", my entropy would actually be a serious overestimate, but I decided to be generous with that one to make the comparison more fair—since I was assuming that in the passphrase example, the person had a good method for picking a random word. In practice, I bet if you asked a bunch of people on the street to pick a random word they thought no one else would guess, the result would have 8 or 9 bits of entropy at best; I mean, half of them would say "lol, ok, i'm so random ... penguin!"
Anyway, I hope that clears a few things up! I'm not sure I got everything right, and there are a lot of fundamentally hazy unknowns that depend on the person and the schema, but I think the assumptions and math are basically solid for what they cover. On the other hand, I'm just a guy with a webcomic, so I'll defer to any actual CS and security experts out there.
Information theory is a really cool field, and if you find these issues at all interesting but don't know a lot about them, I definitely recommend picking up a book that talks about entropy and randomness. It's one of the most interesting things I've ever studied.

posted by xkcd at 9:17 AM on August 10 [194 favorites]

Note that first sentence: ‘Sorry I don't have time to dive into this discussion’. I wonder how long his post would be if he has time to dive into the discussion!