Anth's Computer Cave Tutorials

Python Password Analyzer

In this series we'll create a new open-source password-cracking program for penetration testing, and for clowns who've locked themselves out of their files.

First we'll build scripts to analyze passwords and find the average number of guesses for different password lengths and guessing methods, then we'll use the results to design our program.

This is an ongoing series, and we'll add to it every week. Use the links below to read each article.

Password Analyzer four: Word list

In the previous article we created brute-force password guessing methods and saw the massive number of guesses required to guess any passwords longer than about five characters.

Today we'll prepare to add a dictionary-attack method that takes password-length out of the equation. This method uses a large list of words to guess passwords instead of trying every combination of characters.

Word list

Our first step is to create a word list. We'll create a script to scrape words from general text documents.

Source text

I've written a lot of words over the years, so I have hundreds of pages of my own text to scrape. If you have some word-rich documents on hand, let's scrape those words and store them in a way our dictionary method can use them. We'll convert them to .txt files then use our script to store the words in a JSON list.

Most of my texts were in an old DOCX format created in Microsoft word. Our word-scraping script can't easily deal with DOCX, but it's quite easy to convert it to a Python-friendly TXT format. Just highlight the entire text inside the document and copy and paste it into a blank Notepad file. Notepad will save it as a .txt file. This should also work with text from other file formats like PDFs.

Scraper script

Click the button below to copy the full code.

Paste the code into an empty python file in a folder with the source text file you wish to scrape.

Let's look at the code.

There is a wordfile variable (the source text to scrape) and a wordlist variable (the file to store your new wordlist when done).

The list_mode variable decides whether the program opens and adds words to an existing word list, or creates a new list. You can change that to 'append' after you've created your first list.

wordfile = "source.txt"   # Source text
wordlist = "words.txt"  # JSON list to store all_words array
list_mode = "new" # 'new' to create new list, 'append' to append to existing list
destination = "letters"  # Sections in all_words array to store words, letters (alphabetical) or common
line_count = 0  # Number of lines in source file
word_count = 0 # Number of unique words collected
all_words = {"a": [], "b": [], "c": [], "d": [], "e": [], "f": [], "g": [], "h": [], "i": [], \
            "j": [], "k": [], "l": [], "m": [], "n": [], "o": [], "p": [], "r": [], "s": [], \
            "t": [], "u": [], "w": [], "x": [], "y": [],  "q": [], "v": [], "z": [], "common": []}
# Items to exclude from words
junk = ['"', ".", ",", "!", "?", "\n"]
numbers = ['1', '2', '3', '4', '5', '6', '7', '8', '9', '0']

We have an alphabetic array to hold our words, which will be written to the wordlist file. There is also a 'common' section you can populate by changing the destination variable.

There is a junk array of common symbols we should exclude, and a numeric array used to exclude words begining with numbers: chapter-headings, dates, etc.

The next block of code opens your word list and loads any existing words into the all_words array. This won't happen until you've created an existing list and changed the list_mode variable to 'append'.

# Load existing word list
if list_mode != "new":
    with open(wordlist) as word_file:
        word_holder = json.load(word_file)
        if len(word_holder[0]) > 0:
            all_words = word_holder[0]
            for word in all_words:
                word_count += len(all_words[word])
        print("Words loaded: " + str(word_count))

It also checks the number of words for each letter and increments the word_count, then prints the total number of words loaded.

Next the script will open the text file to scrape. It reads each line of your target text and splits the line by spaces.

# Open the source file to read text
word_source = open(wordfile, 'r')
for line in word_source:
    if len(str(line)) > 1:
        # Split line into individual words
        words = str(line).split(" ")
        for w in words:
            if w != "" and "*" not in w:
                bare_word = str(w)
                if "'" in bare_word:
                    if bare_word.index("'") == 0 or bare_word.index("'") == -1:
                        # If used as quotation marks, remove all
                        bare_word = bare_word.replace("'", "")
                        # If used as apostrophe, remove apostrophe and letters after
                        bare_word = bare_word.split("'")[0]
                # Remove junk characters
                for j in junk:
                    if j in bare_word:
                        bare_word = bare_word.replace(j, "")
                # Convert word to lower-case
                bare_word = bare_word.lower()

It works through the resulting words, excluding junk characters and removing apostrophes. It also converts the words to lower-case.

Next it determines the word does not start with a number and checks for any hidden formatting or encoding.

The items beginning with '\u201' below were appended to some of the words in my initial list. You may find some new items in your words depending on the format of the text you are scraping. If so, just replicate one of the bare_word.replace() lines below and substitute your characters to exclude.

                if len(bare_word) > 1 and bare_word[0] not in numbers:
                    # Remove hidden characters
                    if bare_word[0] in all_words:
                        bare_word = bare_word.replace(u'\u201d', '')
                        bare_word = bare_word.replace(u'\u2026', '')
                        bare_word = bare_word.replace(u'\u2019s', '')
                        bare_word = bare_word.replace(u'\u2019t', '')
                        bare_word = bare_word.replace(u'\u2019', '')
                        bare_word = bare_word.replace(u'\u2019ll', '')
                        bare_word = bare_word.replace(u'\u2019d', '')
                        bare_word = bare_word.replace('\n', '')
                        # Split word if hyphenated
                        second_word = ""
                        if "-" in bare_word:
                            second_word = bare_word.split("-")[1]
                            bare_word = bare_word.split("-")[0]
                        # Add word to either alphabetical or common section of list
                        if bare_word not in all_words[bare_word[0]] and bare_word not in all_words["common"]:
                            if destination == "letters":
                            word_count += 1
                        if second_word != "" and second_word[0] in all_words:                            
                            if second_word not in all_words[second_word[0]] and second_word not in all_words["common"]:
                                if destination == "letters":
                                word_count += 1                                                               
    line_count += 1                                  

The program checks if the word is hyphenated and, if so, removes the hyphen and splits the word into two.

It then checks the finished word and determines whether it has already stored that exact word. If not it adds it to the relative letter in the all_words array and increments the word count. If the destination variable is set to 'common', the word will instead be added to the 'common' section of all_words.

To finish up, the program writes your updated word list to file and displays the results.

# Write updated list to file
holder = []
with open(wordlist, 'w') as data_file:
    json.dump(holder, data_file)
print("Lines: " + str(line_count))
ordered_letters = {}
for b in all_words:
    ordered_letters[b] = len(all_words[b])
# Print the letters in order of most words
letter_order = sorted([(value,key) for (key,value) in ordered_letters.items()], reverse=True)
for i in letter_order:
print("Words: " + str(word_count))

It prints the number of lines read. Note the lines are actually paragraphs, not lines as you saw them in your source text.

It prints each letter, and the number of words beginning with that letter, then prints the letters in order of highest to lowest number of words.

Lastly it prints the total number of unique words in the word list.

Run the program

Let's take it for a run.

Change the wordfile variable at the top of the code to the name of your source text file. I'm using an old half-finished novel with a couple of hundred pages of text.

Here's the result from my text file.

An alphabetical summary of words scraped from a document.

The program has read 2652 lines, scraped 7033 unique words and sorted them alphabetically. It has then displayed the number of words beginning with each letter.

You can now change the list_mode variable to 'append' and run it against more text files to add to your word list.

Common passwords

A while ago I downloaded a list of 2000 of the most common passwords. Unfortunately I can't remember where I found it, so I can't direct you there, but I've scraped the old downloaded file and added it to my list in a separate section called common. It contains many of the key-combinations produced by sliding fingers along a keyboard, and many common names. This is the first section the Dictionary attack will try, before moving on the the alphabetical sections.

If you wish to add words to the common section of your word file, change the destination variable to 'common' and run the program with your source text of common words.

Build your word list

To get a head-start you can download my 9500 word list here as a base for your own word list, and use the script to add words from your own texts.

Otherwise you can start from scratch.

If you have any word-heavy ebooks or PDF files to scrape, start with those.

If you use Linux you may already have access to large word lists to convert with our word grabber. Debian has a range of wordlist packages you can install. You can copy the word files as txt files and run our script to add to your JSON list. In my tests with the wamerican package our script scraped 72,000 words to my list.

You'll find open word lists on line, too, but you'll need to figure out how to download them in a format you can use. I'll try to get some examples here soon.


In the next artcle we'll add dictionary attack functionality to our password analyzer and use our new wordlist to try out some passwords.



Previous: Brute-force password analyzer Part2

Next: Dictionary password analyzer



Leave a comment on this article