Anth's Computer Cave Tutorials

Python Password Analyzer

In this series we create a new open-source password-cracking program for penetration testing, and for clowns who've locked themselves out of their files.

First we build scripts to analyze passwords and find the average number of guesses for different password lengths and guessing methods, then we use the results to design our program.

Use the links below to read each article.

You can download all of the code for this series here.

Password Analyzer four: Word list

In the previous article we created brute-force password guessing methods and saw the massive number of guesses required to guess any passwords longer than about five characters.

Today we'll prepare to add a dictionary-attack method that takes password-length out of the equation. This method uses a large list of words to guess passwords instead of trying every combination of characters.

Word list

Our first step is to create a word list. We'll create a script to scrape words from general text documents.

Source text

I've written a lot of words over the years, so I have hundreds of pages of my own text to scrape. If you have some word-rich documents on hand, let's scrape those words and store them in a way our dictionary method can use them. We'll convert them to .txt files then use our script to store the words in a JSON list.

Most of my texts were in an old DOCX format created in Microsoft word. Our word-scraping script can't easily deal with DOCX, but it's quite easy to convert it to a Python-friendly TXT format. Just highlight the entire text inside the document and copy and paste it into a blank Notepad file. Notepad will save it as a .txt file. This should also work with text from other file formats like PDFs.

Scraper script

Click here to download the password_analyser folder.This contains the word grabber script we'll use today, and a 9500-word word list to get you started. It also has the finished password analyser code we'll cover in the next article.

Let's look at the code in the scraper script,

There is a wordfile variable (the source text to scrape) and a wordlist variable (the file to store your new wordlist when done).

The list_mode variable decides whether the program opens and adds words to an existing word list, or creates a new list. To build on the provided 9500-word list, change to 'append'. To start a new list, change to 'new'. Note this will overwrite the provided list unless you change the wordlist variable.

wordfile = "source.txt"   # Source text
wordlist = "words.txt"  # JSON list to store all_words array
list_mode = "new" # 'new' to create new list, 'append' to append to existing list
# Set destination to 'letters' for alphabetical storage 
#  or 'common' store in most-common list
destination = "letters"  # alphabetical section and a common word section
line_count = 0  # Number of lines in source file
word_count = 0 # Number of unique words collected
all_words = {"a": [], "b": [], "c": [], "d": [], "e": [], \ 
	    "f": [], "g": [],"h": [], "i": [], "j": [], "k": [], "l": [], "m": [], \
            "n": [], "o": [], "p": [], "q": [], "r": [], "s": [], "t": [], "u": [], \
            "v": [], "w": [], "x": [], "y": [], "z": [], "common": []}
# Items to exclude from words
junk = ['"', ".", ",", "!", "?", "\n"]
numbers = ['1', '2', '3', '4', '5', '6', '7', '8', '9', '0']

We have an alphabetic array to hold our words, which will be written to the wordlist file. The array also has a 'common' section you can populate by changing the destination variable. In the provided word list, words.txt, the common section contains the 2000 most-common passwords used on the web.

There is a junk array of symbols we should exclude, and a numeric array used to exclude words begining with numbers: chapter-headings, dates, etc.

The next block of code opens your word list and loads any existing words into the all_words array if list_mode is set to append.

# Load existing word list
if list_mode != "new":
    with open(wordlist) as word_file:
        word_holder = json.load(word_file)
        if len(word_holder[0]) > 0:
            all_words = word_holder[0]
            for word in all_words:
                word_count += len(all_words[word])
        print("Words loaded: " + str(word_count))

It also checks the number of words for each letter and increments the word_count, then prints the total number of words loaded.

Next the script will open the source text file to scrape. It reads each line of your target text and splits the line by spaces.

# Open the source file to read text
word_source = open(wordfile, 'r')
for line in word_source:
    if len(str(line)) > 1:
        # Split line into individual words
        words = str(line).split(" ")
        for w in words:
            if w != "" and "*" not in w:
                bare_word = str(w)
                if "'" in bare_word:
                    if bare_word.index("'") == 0 or bare_word.index("'") == -1:
                        # If used as quotation marks, remove all
                        bare_word = bare_word.replace("'", "")
                        # If used as apostrophe, remove apostrophe and letters after
                        bare_word = bare_word.split("'")[0]
                # Remove junk characters
                for j in junk:
                    if j in bare_word:
                        bare_word = bare_word.replace(j, "")
                # Convert word to lower-case
                bare_word = bare_word.lower()

It works through the resulting words, excluding junk characters and removing apostrophes.

It also converts the words to lower-case. We're converting to lower-case so we can store just one instance of each word. The programs that use the list can then create upper-case variations on the fly.

Next the program ensures the word does not start with a number and checks for any hidden formatting or encoding, such as the items below beginning with '\u201'.

A mixture of these were appended to some of the words in my initial list. You may find some new items in your words depending on the format of the text you are scraping. If so, just replicate one of the bare_word.replace() lines below and substitute your characters to exclude.

                if len(bare_word) > 1 and bare_word[0] not in numbers:
                    # Remove hidden characters
                    if bare_word[0] in all_words:
                        bare_word = bare_word.replace(u'\u201d', '')
                        bare_word = bare_word.replace(u'\u2026', '')
                        bare_word = bare_word.replace(u'\u2019s', '')
                        bare_word = bare_word.replace(u'\u2019t', '')
                        bare_word = bare_word.replace(u'\u2019', '')
                        bare_word = bare_word.replace(u'\u2019ll', '')
                        bare_word = bare_word.replace(u'\u2019d', '')
                        bare_word = bare_word.replace('\n', '')
                        # Split word if hyphenated
                        second_word = ""
                        if "-" in bare_word:
                            second_word = bare_word.split("-")[1]
                            bare_word = bare_word.split("-")[0]
                        # Add word to either alphabetical or common section of list
                        if bare_word not in all_words[bare_word[0]] and bare_word not in all_words["common"]:
                            if destination == "letters":
                            word_count += 1
                        if second_word != "" and second_word[0] in all_words:                            
                            if second_word not in all_words[second_word[0]] and second_word not in all_words["common"]:
                                if destination == "letters":
                                word_count += 1                                                               
    line_count += 1                                  

The program checks if the word is hyphenated and, if so, removes the hyphen and splits the word into two.

It then checks the finished word and determines whether it has already stored that exact word. If not it adds it to the relative letter in the all_words array and increments the word count. If the destination variable is set to 'common', the word will instead be added to the 'common' section of all_words.

To finish up, the program writes your updated word list to file and displays the results.

# Write updated list to file
holder = []
with open(wordlist, 'w') as data_file:
    json.dump(holder, data_file)
print("Lines: " + str(line_count))
ordered_letters = {}
for b in all_words:
    ordered_letters[b] = len(all_words[b])
# Print the letters in order of most words
letter_order = sorted([(value,key) for (key,value) in ordered_letters.items()], reverse=True)
for i in letter_order:
print("Words: " + str(word_count))

It prints the number of lines read. Note the lines are actually paragraphs, not lines as you saw them in your source text.

It prints each letter, and the number of words beginning with that letter. Lastly it prints the total number of unique words in the word list.

Run the program

Let's take it for a run.

Change the wordfile variable at the top of the code to the name of your source text file. I'm using an old half-finished novel with a couple of hundred pages of text.

Here's the result from my text file.

An alphabetical summary of words scraped from a document.

The program has read 2652 lines, scraped 7033 unique words and sorted them alphabetically. It has then displayed the number of words beginning with each letter.

You can now change the list_mode variable to 'append' and run it against more text files to add to your word list.

Common passwords

A while ago I downloaded a list of 2000 of the most common passwords. Unfortunately I can't remember where I found it, so I can't direct you there, but I've scraped the old downloaded file and added it to the list in a separate section called common. It contains many of the key-combinations produced by sliding fingers along a keyboard, and many common names. This is the first section the Dictionary attack will try, before moving on to the slower alphabetical sections.

If you have a list of known common passwords you wish to add to the common section of your word file, change the destination variable to 'common' and run the program with the source text.

Build your word list

To get a head-start you can use the provided word list as a base for your own word list, and use the script to add words from your own texts.

Otherwise you can start from scratch.

Either way you'll need some text files to scrape words from.

If you have any word-heavy ebooks or PDF files to scrape, start with those.

If you use Linux you may already have access to large word lists to convert with our word grabber. Debian has a range of wordlist packages you can install. You can copy the word files as txt files and run our script to add to your JSON list. In my tests with the wamerican package our script scraped 72,000 words to my list.

You'll find open word lists online, too, but you'll need to figure out how to download them in a format you can use. I'll try to get some examples here soon.


In the next artcle we'll add dictionary attack functionality to our password analyzer and use our new word list to try out some passwords.



Previous: Brute-force password analyzer Part2

Next: Dictionary password analyzer



Leave a comment on this article