Improving Related Posts Across My Sites

A few years ago I switched to Jekyll across all of my sites, and one of the things I enjoy most is that the content is stored as markdown files, and so is easy to work with.

When I first switched to Jekyll I wanted to have related posts on my sites. With WordPress I used Jetpack related posts, but Jekyll didn’t have anything similar. So I built my own related posts system using PHP. I made a small control panel as well, so I could easily trigger updates.

These days I use a shared related posts library which updates the related posts across many of my websites, which means I only have to solve this problem once and every site benefits. The library loads in specified markdown files, extracts the text content, and then compares posts using Jaccard similarity. It then saves the results as front matter in the markdown files.

Recently I started to feel like the results were a bit off. Posts that should have been obvious matches were sometimes missing, so I wanted to take a closer look at what was actually going on, and it turned out there were a few small improvements I could make.

HTML Improvements

The first issue was how I handled HTML. I was stripping it out entirely, which seemed sensible at first since I didn’t want any markup interfering with the rankings, but in doing so I was also removing useful information.

Link text in particular carries a lot of meaning, especially in blog posts where links often describe concepts or tools. So I adjusted the parsing to keep meaningful text like links while still removing the parts that add noise. This alone made quite a big difference in the quality of the matches.

Phrases and Word Groupings

Since I’d made some small improvements I thought I would dig a bit deeper and see if there were any other quick wins.

Originally, everything was based on individual words, which does work, but misses phrases made up of multiple words. Brand names, or people names, for example. To tackle this in the early version I manually created a list of common word pairs that would be stored as pairs.

This time I automated the whole thing. I split the text into words, then combined neighbouring words to create phrases. Most of them were nonsense, but after removing the phrases that only appeared once I ended up with some useful groups. I could then use this code to automate creating two-word, three-word, and four-word groupings without any manual effort. Because it is driven by the content itself, it adapts naturally to whatever I write.

I then used these as “words” in the Jaccard similarity rankings, which means that if two posts both mention “machine learning” as a phrase, it will be treated as a stronger match than if they just had the words “machine” and “learning” separately. This helps to surface more relevant related posts, especially when there are specific terms or concepts that are important to the content.

function create_ngrams( $words, $n = 2, $multiplier = 2 ) {

	$ngrams = [];
	$words = array_values( $words );
	$count = count( $words );

	for ( $i = 0; $i <= $count - $n; $i++ ) {

		$parts = array_slice( $words, $i, $n );

		// Skip if any word is empty
		if ( in_array( '', $parts, true ) ) {
			continue;
		}

		// Skip if the words are the same.
		if ( count( array_unique( $parts ) ) !== $n ) {
			continue;
		}

		$phrase = implode( ' ', $parts );
		$ngrams[] = $phrase;
	}

	// Count occurrences
	$ngrams = array_count_values( $ngrams );

	// Only keep repeated phrases
	$ngrams = array_filter(
		$ngrams,
		function( $count ) {
			return $count > 1;
		}
	);

	$result = [];

	// Add each phrase to the result array as many times as it occurs, multiplied by the multiplier.
	// This way, phrases that occur more frequently will have a greater influence on the Jaccard similarity score.
	foreach ( $ngrams as $phrase => $count ) {
		for ( $i = 0; $i < $count * $multiplier; $i++ ) {
			$result[] = $phrase;
		}
	}

	return $result;
}

Singular and Plural Forms

The last improvement was handling singular and plural forms of words. Previously, “game” and “games” would be treated as completely separate terms, which reduces the similarity scoring. To calculate plurals I went with a lightweight approach. I created a few simple rules based on common plural patterns and then layered in a handful of exceptions for the irregular cases I tend to use. This probably misses some plurals but it does catch the most common giving more weight to connected words.

function stem_word( $word ) {

	static $exceptions = [
		'vertices' => 'vertex',
		'indices' => 'index',
		'matrices' => 'matrix',
		'analyses' => 'analysis',
		'theses' => 'thesis',
		'crises' => 'crisis',
		'axes' => 'axis',
		'series' => 'series',
		'species' => 'species',
	];

	if ( isset( $exceptions[ $word ] ) ) {
		return $exceptions[ $word ];
	}

	$len = strlen( $word );

	if ( $len <= 3 ) {
		return $word;
	}

	// libraries -> library
	if ( preg_match( '/[^aeiou]ies$/', $word ) ) {
		return substr( $word, 0, -3 ) . 'y';
	}

	// boxes -> box, classes -> class, brushes -> brush
	if ( preg_match( '/(xes|ches|shes|sses|zzes)$/', $word ) ) {
		return substr( $word, 0, -2 );
	}

	// pages -> page, games -> game, tools -> tool
	if (
		substr( $word, -1 ) === 's' &&
		! preg_match( '/(ss|us|is)$/', $word ) &&
		$len > 4
	) {
		return substr( $word, 0, -1 );
	}

	return $word;
}

All Done

With these changes combined, the related posts are now much more accurate.

Because all of my sites use the same build system, changing this one library is like changing a WordPress plugin; it updates all of my sites automatically. Jekyll builds a static site, so the related posts are calculated before the site is generated, and then the build is fast since it doesn’t need to calculate them in Jekyll.

How was it for you? Let me know on BlueSky or Mastodon

(Please) Link to this page

Thanks for reading. I'd really appreciate it if you'd link to this page if you mention it in your newsletter or on your blog.

Related Posts

12 Jun 2023

Rebuilding the Binary Moon Website

I have recently rebuilt the Binary Moon website. It’s been something I’ve been thinking about for a while, and a recent issue with my web host pushed me to finally make a start. In this post, I’ll share some of...
23 Mar 2017

New Adventures in Jekyll

I use WordPress a lot, but a couple of weeks ago I decided to rebuild one of my older sites with Jekyll (a static site editor) so that I could host it on Github pages. As I have explained before...
28 Sep 2021

Creating Generative Art with PHP

These last few weeks I’ve been experimenting with Generative Art, using PHP. You can see the evolution of my latest series on Twitter. Generative Art is creating artworks through programming. Generative art has a few different names, Procedural art and...
11 Mar 2026

The Games I’ve Made (So Far)

I was recently asked for a list of the games I have worked on over the years. That question sent me down a bit of a rabbit hole. I realised I don’t actually have a single list anywhere. Some of...
14 May 2013

Redesigning the WordPress Post Editor

Ghost is a project born from frustration with WordPress. Ironically it seems to be mostly WordPress power users who want to use it. The Ghost team – led by John O’Nolan – put Ghost on KickStarter last week and it...
28 Jul 2006

New Miniclip website

The old Miniclip website was a nostalgic playground for gamers, filled with simple flash games ideal for wasting time on. Unfortunately it is no more. This post was written in 2006. As of 2015 I no longer work at Miniclip....