Automatically Generate Subcategories From Your Products Using Python

Table of Contents

Today’s post will walk you through how your products can automatically suggest subcategorisation opportunities for you. This technique will work for any eCommerce Website. The idea is to generate as many n-gram combinations as possible from your existing product inventory and validate the opportunity by checking for search volume and / or CPC data.

Example of the Script Output

Here’s an example of the unfiltered final output. All of the subcategory suggestions shown below were created automatically by the script by clustering the product range.

Example of the final script output. 149 New Subcategory Suggestions with a Search Volume of 1,581,503 p/m Generated in 11.45 Seconds!

Just want to get to the Github? Find it here.

Features

  • Automatic suggestions for new subcategories
  • Subcategory suggestions are tied back to their parent category automatically
  • Includes the number of products available to populate the suggested subcategories
  • Includes CPC and search volume data
  • Shows the similarity percentage match to an existing category (Will automatically discard suggestions if a category already exists. Can find words out of order / plurals / close variations etc. 100% user configurable)

Getting Started

To get started, we’ll need to export two crawls from Screaming Frog for Python. This post will guide you through how to get the initial crawl data as well as how to setup and use the script. The steps towards the final output roughly breakdown as:

  • Crawl (Crawl the site using Screaming Frog)
  • Cluster (Cluster product names using the NLTK library)
  • Filter (Pre and post filtering to remove nonsensical n-grams)
  • Review (The final output to be reviewed by a human)

Requirements

To get started you will need the following:

Prepping the Crawl Files

In order for the script to work, we need two exports from Screaming Frog, with two custom extractions set. The custom extractions are used to inform the script which pages are products and which are categories.

The idea is to set a custom extraction for something that is unique to each page. (For product pages, it’s usually the price attribute and for categories, it’s usually a product sorting parameter).

The extractions can be anything at all, as long as they clearly distinguish between the two page types as illustrated in the picture below.

Setting the Custom Extractions

The first thing to do is navigate to either a product or category page on the target site and bring up the developer console. (Chrome / Firefox Ctrl + Shift + J / Command + Option + J)

Once open, click on the ‘Elements’ section in the very top left.

Search for a piece of content on the page which is unique to either a product or category.

In this example I’m going to select the ‘Availability’ section of the page which is an element unique to the product page

Once you have chosen the selected element, we need to search for it.

Click into the elements tab and press Ctrl + F or Command + F and search for your chosen element.

Typically there are more than one match on the page, so you’ll need to cycle through them until the element you’re interested in is highlighted like the image below:

Once you’re happy you have selected the correct element, you can copy the extractor ready to be imported into Screaming Frog.

To copy the extractor, you just need to right click on it, choose ‘copy’ and then ‘copy selector’.

Open Screaming Frog and choose Configuration > Custom > Extraction from the navigation.

  1. Name the custom extraction ‘product’ (The script expects this naming convention)
  2. Choose CSSPath from the dropdown menu
  3. Paste in your clipboard
  4. Choose ‘Extract text’ from the dropdown

Rinse and repeat the entire process for the category pages, naming the custom extraction ‘category’. When complete, the custom extraction should look like below:

Tip: If you plan to repeat this process regularly for a Website, you can export a configuration file from Screaming Frog which will save your extractions for next time. (You can even schedule this crawl to automate this process weekly/monthly/quarterly and so on).

Crawling the Site

Once the extractors are set, it’s time to crawl the site as normal. If you have done everything correctly, you should see extractions which are unique to both product and category pages when clicking the ‘Custom Extraction’ tab.

If it looks different to the above, you will need to check your custom extractions are highlighting each page type correctly.

Exporting the CSV Files

Once the crawl has completed satisfactorily, it’s time to make the exports.

We require two .csv exports from Screaming Frog:

  • Internal_html.csv
  • all_inlinks.csv

Please note: The exports must be in csv format

Exporting the Internal HTML File

The internal HTML report can be exported from Internal > HTML > Export. It is solely used to merge the product and category H1s into the all_inlinks report.

Internal >HTML >Export

Exporting the ‘All Inlinks’; The inlinks export is required so that suggested subcategory names can be tied back to a parent category.

Bulk Export > Links > All Inlinks

Getting the API Keys

Keywords Everywhere API (Mandatory)

‘Keywords Everywhere’ is used to check for search volume and / or CPC data. In order to use this script, you must have a valid Keywords Everywhere API key.

You can sign up for an API Key here. It’s inexpensive at $10 for 100,000 keyword checks.

Once you have your API key, place it in a text file named kwe_key.txt

Search Console API (Optional)

This is a completely optional requirement. If you are just getting started with the script, I recommend not enabling this until you are comfortable with the wider script setup.

To use the Search Console API, we require two files named client_secrets.json and credentials.json

I recommend reading this guide on how to get these files. It’s a bit of a pain to setup in the first instance, but once you have these files – they’re good for all projects, so it’s worth doing.

Get the Code from Github

I have uploaded the script code, with supporting files and demo files to github

Download all of the files listed below and place them into an empty folder.

File Explanations

SEO Spider Config.seospider – Example configuration file for Screaming Frog with a default custom extraction (for reference only)

inlinks.csv – optional demo crawl csv file to test the script (use your own crawl if you have it)

internal_html.csv – optional demo crawl csv file to test the script (use your own crawl if you have it)

category-splitter.py – The main Python script

kwe_key.txt – Paste your Keywords Everywhere API key into this file

If done correctly your folder should contain the following files:

Installing Python (Anaconda Distribution)

Download and install Anaconda which is an open source Python distribution favoured heavily by data scientists.

Once the installation has finished, open the Anaconda Prompt

Once open you’ll be greeted with this screen:

The first thing we need to do is change the directory to where we placed the files downloaded from github. (I find the easiest way to do this on Windows, is to navigate to the folder using Windows Explorer and copying the location).

Navigate to the folder containing your files by typing:

cd c:/path/to/files/here

Example

cd C:\Users\lee\Documents\python_scripts\

Next we need to install all the required libraries to make the script work. Copy and paste them into the command prompt one at a time.

pip install git+https://github.com/joshcarty/google-searchconsole
pip install pandas
pip install requests
pip install polyfuzz
pip install nltk

Once installed the script can be run by typing:

python category-splitter.py

Check out the script in action!

Setting the Configuration Options

The script contains many configurable options:

min_product_match – Set minimum matching products in order for a category to be suggested

min_sim_match – Set how closely the suggested subcategories are allowed to match existing categories. Scale 1-100 // Default 96%

keep_longest_word – Keeps the longest n-gram fragment (Reduces QA time at the expensive of some false positives)

check_gsc_impressions – Optional setting to check GSC data for at least 1 impression in the last three months

min_search_vol – Match keywords to a minimum search volume to be considered for final output

min_cpc – Match keywords to a minimum CPC amount to be considered for final output

country_kwe – Set the country code for ‘Keywords Everywhere’ API

currency_kwe – Set the default Currency for the ‘Keywords Everywhere’ API

data_source_kwe – Set the ‘Keywords Everywhere’ Data Source (AdWords / Clickstream or Both)

All Keywords Everywhere’ API Settings are explained here: https://api.keywordseverywhere.com/docs/#/

Conclusion

Using this method is an efficient way to generate new subcategories at scale. Let me know how you get on by tweeting me @LeeFootSEO

Scroll to Top