Anth's Computer Cave Tutorials

Python Password Analyzer

In this series we create a new open-source password-cracking program for penetration testing, and for clowns who've locked themselves out of their files.

First we build scripts to analyze passwords and find the average number of guesses for different password lengths and guessing methods, then we use the results to design our program.

Use the links below to read each article.

You can download all of the code for this series here.

Password Analyzer four: Word list

In the previous article we created brute-force password guessing methods and saw the massive number of guesses required to guess any passwords longer than about five characters.

Today we'll prepare to add a dictionary-attack method that takes password-length out of the equation. This method uses a large list of words to guess passwords instead of trying every combination of characters.

Word list

Our first step is to create a word list. We'll create a script to scrape words from general text documents.

Source text

I've written a lot of words over the years, so I have hundreds of pages of my own text to scrape. If you have some word-rich documents on hand, let's scrape those words and store them in a way our dictionary method can use them. We'll convert them to .txt files then use our script to store the words in a JSON list.

Most of my texts were in an old DOCX format created in Microsoft word. Our word-scraping script can't easily deal with DOCX, but it's quite easy to convert it to a Python-friendly TXT format. Just highlight the entire text inside the document and copy and paste it into a blank Notepad file. Notepad will save it as a .txt file. This should also work with text from other file formats like PDFs.

Scraper script

The scraper script,, is in the code download folder (See link at top of page).The folder also contains a 9500-word word list called words.txt to get you started. It also has the finished password analyser code we'll cover in the next article.

Let's look at the code in

There is a wordfile variable (the source text to scrape) and a wordlist variable (the file to store your new wordlist when done).

wordfile = "source.txt"   # Source text
wordlist = "words.txt"  # JSON list to store all_words array

# mode is 'new' to create new list, 'append' to add to existing list
mode = "append"

The mode variable decides whether the program opens and adds words to an existing word list, or creates a new list. To build on the provided 9500-word list, leave as 'append'. To start a new list, change to 'new'. Note this will overwrite the provided list unless you also change the wordlist variable.

We have an alphabetic array to hold our words, which will be written to the wordlist file.

# Alphabetic array to store words
all_words = {"a": [], "b": [], "c": [], "d": [], "e": [], "f": [], "g": [], "h": [], "i": [], \
            "j": [], "k": [], "l": [], "m": [], "n": [], "o": [], "p": [], "r": [], "s": [], \
            "t": [], "u": [], "w": [], "x": [], "y": [],  "q": [], "v": [], "z": [], "common": []}
# Destination in words array, letters or common
destination = "letters"

line_count = 0  # Number of lines in source file
word_count = 0 # Number of unique words collected

# Items to exclude from words
junk = ['"', ".", ",", "!", "?",":", ";", "\n"]
numbers = ['1', '2', '3', '4', '5', '6', '7', '8', '9', '0']

The all_words array also has a 'common' section you can populate by changing the destination variable to 'common'. In the provided word list, words.txt, the common section contains the 2000 most-common passwords used on the web.

There is a junk array of symbols we should exclude, and a numeric array used to exclude words begining with numbers: chapter-headings, dates, etc.

The next block of code opens your word list and loads any existing words into the all_words array if mode is set to append.

# Load existing word list
if mode != "new":
    with open(wordlist) as word_file:
        word_holder = json.load(word_file)
        if len(word_holder[0]) > 0:
            all_words = word_holder[0]
            for word in all_words:
                word_count += len(all_words[word])
        print("Words loaded: " + str(word_count))

It also checks the number of words for each letter and increments the word_count, then prints the total number of words loaded.

Next the script will open the source text file to scrape. It reads each line of your target text and splits the line by spaces.

# Open the source file to read text
word_source = open(wordfile, 'r')
for line in word_source:
    if len(str(line)) > 1:
        # Split line into individual words
        words = str(line).split(" ")
        for w in words:
            # Check that word has length
            if w != "":
                bare_word = str(w)
                # Handle single quotes
                if "'" in bare_word:
                    if bare_word.index("'") == 0 or bare_word.index("'") == -1:
                        # If used as quotation marks, remove all
                        bare_word = bare_word.replace("'", "")
                        # If used as apostrophe, remove apostrophe and letters after
                        bare_word = bare_word.split("'")[0]
                # Remove junk characters
                for j in junk:
                    if j in bare_word:
                        bare_word = bare_word.replace(j, "")
                # Convert word to lower-case
                bare_word = bare_word.lower()

It works through the resulting words, excluding junk characters and removing apostrophes.

It also converts the words to lower-case. We're converting to lower-case so we can store just one instance of each word. The programs that use the list can then create upper-case variations on the fly.

Next the program removes any hidden formatting or encoding then ensures the remaining text still has length and does not start with a number. strip() fails to remove some unwanted encoding, such as the items below beginning with '\u201'.

                # Remove hidden characters
                bare_word = bare_word.strip()
                # Check that word still has length, and begins with a letter
                if len(bare_word) > 1 and bare_word[0] in all_words:
                    # These may be in texts copied from .docx format
                    bare_word = bare_word.replace(u'\u201d', '')
                    bare_word = bare_word.replace(u'\u2026', '')
                    bare_word = bare_word.replace(u'\u2019s', '')
                    bare_word = bare_word.replace(u'\u2019t', '')
                    bare_word = bare_word.replace(u'\u2019', '')
                    bare_word = bare_word.replace(u'\u2019ll', '')
                    bare_word = bare_word.replace(u'\u2019d', '')
                    bare_word = bare_word.replace('\n', '')

A mixture of these were appended to some of the words in my initial list. I reckon it's because I copied my source text from .DOCX to .TXT.

You may find some new items in your words depending on the format of the text you are scraping. If so, just replicate one of the bare_word.replace() lines above and substitute your characters to exclude.

The program now checks if the word is hyphenated and, if so, removes the hyphen and splits the word into two.

                    # Split word if hyphenated
                    second_word = ""
                    if "-" in bare_word:
                        second_word = bare_word.split("-")[1]
                        bare_word = bare_word.split("-")[0]
                    # Add word to either alphabetical or common section of list
                    if bare_word not in all_words[bare_word[0]] and bare_word not in all_words["common"]:
                        if destination == "letters":
                        word_count += 1
                    # Add second word if original was hyphenated    
                    if second_word != "" and second_word[0] in all_words:                            
                        if second_word not in all_words[second_word[0]] and second_word not in all_words["common"]:
                            if destination == "letters":
                            word_count += 1                                                               
    line_count += 1  

It then checks the finished word and determines whether it has already stored that exact word. If not it adds it to the relative letter in the all_words array and increments the word count. If the destination variable is set to 'common', the word will instead be added to the 'common' section of all_words.

To finish up, the program writes your updated word list to file and displays the results.

# Write updated list to file
holder = [all_words]
with open(wordlist, 'w') as data_file:
    json.dump(holder, data_file)

# Created array to hold number of words for each letter
ordered_letters = {}
for let in all_words:
    ordered_letters[let] = len(all_words[let])
# Print the letters in order of most words
letter_order = sorted([(value,key) for (key,value) in ordered_letters.items()], reverse=True)
for i in letter_order:
print("Lines read: " + str(line_count))
print("Unique words: " + str(word_count))

It prints the number of lines read. Note the lines are actually paragraphs, not lines as you saw them in your source text.

It prints each letter, and the number of words beginning with that letter. Lastly it prints the total number of unique words in the word list.

Run the program

Let's take it for a run.

Change the wordfile variable at the top of the code to the name of your source text file. I'm using an old half-finished novel with a couple of hundred pages of text.

Here's the result from my text file.

An alphabetical summary of words scraped from a document.

The program has read 2652 lines, scraped 7033 unique words and sorted them alphabetically. It has then displayed the number of words beginning with each letter.

You can now change the list_mode variable to 'append' and run it against more text files to add to your word list.

Common passwords

A while ago I downloaded a list of 2000 of the most common passwords. Unfortunately I can't remember where I found it, so I can't direct you there, but I've scraped the old downloaded file and added it to the list in a separate section called common. It contains many of the key-combinations produced by sliding fingers along a keyboard, and many common names. This is the first section the Dictionary attack will try, before moving on to the slower alphabetical sections.

If you have a list of known common passwords you wish to add to the common section of your word file, change the destination variable to 'common' and run the program with the source text.

Finding source text

As I mentioned earlier, if you have any word-heavy ebooks or PDF files to scrape, start by copying and pasting those into a text file.

If you use Linux you may already have access to large word lists to convert with our word grabber. Debian has a range of wordlist packages you can install. You can copy the word files as txt files and run our script to add to your JSON list. In my tests with the wamerican package our script scraped 72,000 words to my list.

These lists are often already in .txt format, usually with a single word per line, so they're ready for scraping with our script.

You'll find open word lists online, too, but you'll need to figure out how to download them in a format you can use. I'll try to get some examples here soon.


In the next artcle we'll add dictionary attack functionality to our password analyzer and use our new word list to try out some passwords.



Previous: Brute-force password analyzer Part2

Next: Dictionary password analyzer



Leave a comment on this article

Leave a comment on this article