Tag Smarter, Not Harder: AI for Content Classification

Tags are a type of metadata assigned to content pages (like blog posts) to make them easier to navigate or organize. They’re often used to create filtered views like /tags/ai/ or show related content by tag.

In Hugo, taxonomies are classification systems for your content. They’re used to group related content together. Tags are a form of taxonomy and are placed in the post front matter.

Hugo ships with two built-in taxonomies:

tags
categories

I find it useful to use tags to group related technologies and create connections between content. The concept of linking relational data has obvious organizational benefits and helps maintain a level of structure, especially when using a static site generator like Hugo which does not rely on a backend database.

The Goal#

After writing a sufficient amount of content per post, I find the most accurate way of generating tags is to let AI parse the body and spit out some contextual tags plus keywords. Keywords are words or short phrases that summarize the core topics or themes of content. They help describe what the post is about. Keywords are typically used for Search Engine Optimization. Having AI generate tags/keywords is a great timesaver and offers greater accuracy and a broader array of results. I wanted to automate the tag generation as part of my site build/deploy process. The goal was to include an additional process during Pull Request creation that generated the tags, giving me the opportunity to review them before merging to the main branch.

Processing#

I developed a Python process that could load markdown and generate tags using OpenAI. The OpenAI API allows using natural language prompts just like a UI, such as ChatGPT. Using a semantic AI prompt within the code alleviated the need for complex sanitizing of results as my prompt specified lowercase, single words, use hyphens for compound terms etc.

My default new post front matter is in YAML format, where I define only the fields I need. The title field is generated by a Hugo template expression which extracts the basename of the content file:

---
title: "{{ replace .File.ContentBaseName "-" " " | title }}"
date: "{{ .Date }}"
author: "Oliver Reardon"
tags: []
keywords: []
description: ""
showFullContent: false
readingTime: true
hideComments: true
---

The Python library frontmatter is used to read, parse, and write front matter metadata from text files. Loading the post content into a post object allows access to two main attributes post.metadata (dict) and post.content (str). The front matter and post body respectively.

Some people organize their spice rack. I use an ordered list to keep front matter fields in check.

 # Manually write YAML with proper field order
field_order = [
    'title', 'date', 'author', 
    'tags', 'keywords', 'description', 'showFullContent', 
    'readingTime', 'hideComments', 'draft'
]

The primary formatting logic inside the generate_and_apply_suggestions function helps retain the correct YAML data type format (array, boolean, string etc) and then write the file once done. The main guard pattern function is the orchestrator, it sets up the command line argument options and processes each supplied file.

The Python code can also be run independently against one or multiple posts using glob patterns, and will by default skip a post if tags and keywords already exist.

export OPENAI_API_KEY='my_api_token'
# All posts
python .github/scripts/generate_frontmatter.py content/posts/*/index.md

# Specific posts by pattern
python .github/scripts/generate_frontmatter.py content/posts/jamf*/index.md
python .github/scripts/generate_frontmatter.py content/posts/*terraform*/index.md

# With force flag
python .github/scripts/generate_frontmatter.py --force content/posts/*/index.md

Automation#

A GitHub Workflow is used in order to trigger the above process during Pull Request creation. The workflow watches for new or modified markdown posts using:

on:
  pull_request:
    paths: 
      - 'content/posts/**/*.md'
    types: [opened, synchronize]
...
 - name: Get changed files
        id: changed-files
        uses: tj-actions/changed-files@v40
        with:
          files: content/posts/**/*.md

It then runs the Python tag generator process on the detected changed file(s) and commits them to the PR branch explicitly:

      - name: Generate and apply AI suggestions
        env:
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
        run: |
          python .github/scripts/generate_frontmatter.py ${{ steps.changed-files.outputs.all_changed_files }}
          
      - name: Commit AI changes
        run: |
          git config --local user.email "action@github.com"
          git config --local user.name "AI Front Matter Bot"
          git add .
          if ! git diff --cached --quiet; then
            git commit -m "Add AI-generated tags and keywords"
            git push origin ${{ github.head_ref }}  # Push to the PR branch explicitly
          fi

The final step of the generate-frontmatter job uses a summary report to create a pull request comment specifying which post front matter was modified. After reviewing the new AI generated tags I have the option to merge the pull request or cancel it if the generated tags are not satisfactory for the subject matter or content.

Result#

The AI generated front matter tags and keywords for this blog post:

---
title: "Tag Smarter, Not Harder: AI for Content Classification"
date: 2025-08-01 18:57:45-04:00
author: Oliver Reardon
tags:
  - ai
  - content-classification
  - taxonomy
  - metadata
  - static-site-generator
  - natural-language-processing
  - python
  - openai
keywords:
  - tagging
  - content organization
  - keyword generation
  - seo
  - automation
  - machine learning
description: "Learn how to automate content tagging for Hugo static sites using AI. This guide covers building a Python script with OpenAI's API to generate relevant tags and keywords, integrating it into GitHub Actions workflows for automatic processing during pull requests, and maintaining consistent YAML front matter structure across all blog posts."
showFullContent: false
readingTime: true
hideComments: true
---