Anth's Computer Cave

Aaimi SiteSearch Tutorial

Today I'll show you how to embed the AAIMI SiteSearch system in your web site to help your visitors find what they need.

An AAIMI SiteSearch embeddable search-box

We covered the SiteSearch features in an article last week. It is a drop-in search-box that runs on your site, without redirecting your visitors or collecting their personal data.

Let's download the program, scan the server to create a word list, and embed the code into your web files.


Limitation

As long as you have only static content on your site, within the body of your HTML files everything will work fine.

In this version the word list is created on the server side by scanning HTML files and extraction the content. It is not a web-based crawler, meaning it currently can't get dynamic content. There are work-arounds, but they are beyond the scope of this tutorial

The next version will use a web-based crawler to get the dynamic content as well. We built the server-side crawler first because it indexes sites much quicker than having to download each page over the web.


Download

You can download the AAIMI SiteSearch program folder here.

Extract the folder and move it to the webroot directory on your server.

You can leave the Python programs, crawl_html_pages.py and site_search.py in the main program folder, but I prefer to move them outside the webroot. If so, you'll also need to copy the page_titles.txt and word_list.txt files into the same folder as the Python files. Whenever you crawl your site you'll need to copy these txt files back to the aaimi_site_search directory afterwards. There are also several code modification, which I'll cover in a moment.

All the other files must stay in the program folder


Create word list

The first step is to create a list of every unique word on your site and record the number of instances of each word in every page.

For this we use the included Python program, crawl_html_pages.py.

Code modifications

There are a couple of code modifications required before you run the program.

On line 63 you'll find a variable called starting_point, followed by another called last_line.

Setting a unique line for AAIMI SiteSearch to begin scraping text

The starting_point is a unique line in all your pages close to the begining of the page's content. By default it looks for the opening body tag, but you can change that if you wish. For instance, I had a nav panel and search boxes at the top of the body on my pages. I changed the starting point to the last unique line in the search-box to avoid scraping words from identical objects that are on every page.

The last_line is a unique line after the content on your site where you wish to stop scraping. The default is the closing body tag, but I changed that to a line at the top of the comment sections.

The next code-mod is the excluded_folders list on line 69. Here you can exclude entire directories or single files.

To exclude a directory, add the full path, along with a trailing slash, in quotes to the excluded folders list. To exclude a file enter the full path and filename instead. You can add as many exclusions as you want, separated by comas.

Exclude folders from crawling by AAIMI SiteSearch

On line 72 change the yoursite variable to the URL for your website

Next there is the existing_list variable on line 88. This denotes whether to open and append to an existing word list, or to create one from scratch.

If you are just idexing one site you should leave this as 'no'. If you are indexing more than one site into the same word list, change this to 'yes' once you have indexed the first site, and subsequent sites will be appended.

If you wish to move the crawl_html_pages.py file outside of the webroot directory you'll need to manually add the full path to your webroot directory on line 77.

If you also wish to move the site_search.py file, you'll need to modify the PHP file, site_search.php. Comment out lines 6 and 7, and uncomment line 9. Replace the '/home/path/to/site_search.py' to the full path to its location.

Run the crawler

Save the Pyhton file and open a terminal and navigate into the aaimi_site_search folder. Type:

python crawl_html_pages.py

The program will move recursively through the folders on your server and create a list of every HTML file.

It then opens each file and scans for its starting_point line. Until found it will ignore everything except title tags. It will add any content within a title tag to the page titles list to display with the page in search results. Once it finds the starting_point line it will begin scraping every word not inside a tag.

Once it reaches the defined last line it will add an array for each word to the main list. If the word already exists it will add an entry for the page's url to that word's array, along with the number of instances.

It will take a few seconds to index your entire site.


Embed the search box HTML and Javascript

In the setup folder you'll find two files called site_search_embed_code.html and site_search_embed_code.js.

HTML code

The HTML file contains the code for the search box. Paste this into your desired position in the HTML of all of your webpages'. On page load it will merely display a button that unhides the searchbox. You can see on this page that I have placed my code at the bottom of the menu panel in the left column. Being in the left column means it will be towards the top of the page on mobile devices using the dynamic one-column mobile layout.

You can add your own inline CSS to the SiteSearch code as long as you don't change any classes or ids for the elements (That would brick the Javascript functions).

Javascript code

The Javascript file contains the functions to show and hide the search box, run the search program and display the results. Paste this anywhere in your existing Javascript file.

Try it out

That's it, save your work and you're done

Open a browser and visit your site. You should see the AAIMI SiteSearch button. Once clicked, you'll see the search-box, along with a Hide Search button. Type some search terms and click Search.

An AAIMI SiteSearch embeddable search-box

Cheers

Anth

_____________________________________________


Comments

Leave a comment on this article