Today’s post will walk you through how you products can automatically suggest subcategorisation opportunities for you. This technique will work for any eCommerce Website. The idea is to generate as many n-gram combinations as possible from your existing product inventory and validate the opportunity by checking for search volume and / or CPC data.
Example of the Script Output
Here’s an example of the unfiltered final output. All of the subcategory suggestions shown below were created automatically by the script by clustering the product range!
Just want to get to the Github? Find it here.
- Automatic suggestions for new subcategories
- Subcategory suggestions are tied back to their parent category automatically!
- Includes the number of products available to populate the suggested subcategories
- Includes CPC and search volume data
- Shows the similarity percentage match to an existing category (Will automatically discard suggestions if a category already exists. Can find words out of order / plurals / close variations etc. 100% user configurable)
To get started, we’ll need to export two crawls from Screaming Frog to be ready into Python. This post will guide you through how to get the initial crawl data as well as how to setup and use the script. The steps towards the final output roughly breakdown as:
- Crawl (Crawl the site using Screaming Frog)
- Cluster (Cluster product names using the NLTK library)
- Filter (Pre and post filtering to remove nonsensical n-grams)
- Review (The final output to be reviewed by a human)
To get started you will need the following:
- Screaming Frog SEO Spider
- Keywords Everywhere API Key
- Google Search Console API Key (Optional)
- The Script from out GitHub
Prepping the Crawl Files
In order for the script to work, we need two exports from Screaming Frog, with two custom extractions set. The custom extractions are used to inform the script which pages are products and which are categories.
The idea is to set a custom extraction for something that is unique to each page. (For product pages, it’s usually the price attribute and for categories, it’s usually a product sorting parameter).
The extractions can be anything at all, as long as they clearing distinguish between the two page types as illustrated in the picture below.
Setting the Custom Extractions
The first thing to do is to navigate to either a product or category page on the target site and bring up the developer console. (Chrome / Firefox
Ctrl + Shift + J /
Command + Option + J)
Once open, click on the ‘Elements’ section in the very top left.
Search for a piece of content on the page which is unique to either a product or category.
In this example I’m going to select the ‘Availability’ section of the page which is an element unique to the product page
Once you have chosen the selected element, we need to search for it.
Click into the elements tab and press Ctrl + F or Command + F and search for your chosen element.
Typically there are more than one match on the page, so you’ll need to cycle through them until the element you’re interested in is highlighted like the image below:
Once you’re happy you have selected the correct element, you can copy the extractor ready to be imported into Screaming Frog.
To coy the extractor, you just need to right click on it, choose ‘copy’ and then ‘copy selector’.
Open Screaming Frog and choose
Configuration > Custom > Extraction from the navigation.
- Name the custom extraction ‘product’ (The script expects this naming convention)
- Choose CSSPath from the dropdown menu
- Paste in your clipboard
- Choose ‘Extract text’ from the dropdown
Rinse and repeat the entire process for the category pages, naming the custom extraction ‘category’. When complete, the custom extraction should look like below:
Tip: If you plan to repeat this process regularly for a Website, you can export a configuration file from Screaming Frog which will save your extractions for next time. (You can even schedule this crawl to automate this process weekly/monthly/quarterly and so on).
Crawling the Site
Once the extractors are set, it’s time to crawl the site as normal. If you have done everything correctly, you should see extractions which are unique to both product and category pages when clicking the ‘Custom Extraction’ tab.
If it looks different to the above, you will need to check your custom extractions are highlighting each page type correctly.
Exporting the CSV Files
Once the crawl has completed satisfactorily, it’s time to make the exports.
We require two .csv exports from Screaming Frog:
Please note: The exports must be in csv format
Exporting the Internal HTML File
The internal HTML report can be exported from Internal > HTML > Export. It is solely used to merge the product and category H1s into the all_inlinks report.
Exporting the All InlinksThe inlinks export is required so that suggested subcategory names can be tied back to a parent category.
Getting the API Keys
Keywords Everywhere API (Mandatory)
Keywords Everywhere is used to check for search volume and / or CPC data. In order to use this script, you must have a valid Keywords Everywhere API key.
You can sign up for an API Key here. It’s inexpensive at $10 for 100,000 keyword checks.
Once you have your API key, place it in a text file named kwe_key.txt
Search Console API (Optional)
This is a completely optional requirement. If you are just getting started with the script, I recommend not enabling this until you are comfortable with the wider script setup.
To use the Search Console API, we require two files named
I recommend reading this guide on how to get these files. It’s a bit of a pain to setup in the first instance, but once you have these files – they’re good for all projects, so it’s worth doing.
Get the Code from Github
I have uploaded the script code, with supporting files and demo files to github
Download all of the files listed below and place them into an empty folder.
SEO Spider Config.seospider – Example configuration file for Screaming Frog with a default custom extraction (for reference only)
inlinks.csv – optional demo crawl csv file to test the script (use your own crawl if you have it!)
internal_html.csv – optional demo crawl csv file to test the script (use your own crawl if you have it!)
category-splitter.py – The main Python script
kwe_key.txt – Paste your Keywords Everywhere API key into this file
If done correctly your folder should contain the following files:
Installing Python (Anaconda Distribution)
Download and install Anaconda which is an open source Python distribution favoured heavily by data scientists.
Once the installation has finished, open the Anaconda Prompt
Once open you’ll be greeted with this screen:
The first thing we need to do is to change directory to where we places the files downloaded from github. (I find the easiest way to do this on Windows, is to navigate to the folder using Windows Explorer and copying the location).
Navigate to the folder containing your files by typing:
Next we need to install all the required libraries to make the script work. Copy and paste them into the command prompt one at a time.
pip install git+https://github.com/joshcarty/google-searchconsole
pip install pandas
pip install requests
pip install polyfuzz
pip install nltk
Once installed the script can be run by typing:
Setting the Configuration Options
The script contains many configurable options:
min_product_match – Set minimum matching products in order for a category to be suggested
min_sim_match – Set how close the suggested subcategories are allowed to match existing categories. Scale 1-100 // Default 96%
keep_longest_word – Keeps the longest n-gram fragment (Reduces QA time at the expensive of some false positives)
heck_gsc_impressions – Optional setting to check GSC data for at least 1 impression in the last three months
min_search_vol – Match keywords to a minimum search volume to be considered for final output
min_cpc – Match keywords to a minimum CPC amount to be considered for final output
country_kwe – Set the country code for Keywords Everywhere API
currency_kwe – Set the default Currency for the Keywords Everywhere API
ata_source_kwe – Set the Keywords Everywhere Data Source (AdWords / Clickstream or Both)
All Keywords Everywhere API Settings are explained here: https://api.keywordseverywhere.com/docs/#/
Using this method is an efficient way to generate new subcategories at scale. Let me know how you get on by tweeting me @LeeFootSEO