Improving Related Posts Across My Sites

01 Apr 2026

HTML Improvements

The first issue was how I handled HTML. I was stripping it out entirely, which seemed sensible at first since I didn’t want any markup interfering with the rankings, but in doing so I was also removing useful information.

Link text in particular carries a lot of meaning, especially in blog posts where links often describe concepts or tools. So I adjusted the parsing to keep meaningful text like links while still removing the parts that add noise. This alone made quite a big difference in the quality of the matches.

Phrases and Word Groupings

Since I’d made some small improvements I thought I would dig a bit deeper and see if there were any other quick wins.

Originally, everything was based on individual words, which does work, but misses phrases made up of multiple words. Brand names, or people names, for example. To tackle this in the early version I manually created a list of common word pairs that would be stored as pairs.

This time I automated the whole thing. I split the text into words, then combined neighbouring words to create phrases. Most of them were nonsense, but after removing the phrases that only appeared once I ended up with some useful groups. I could then use this code to automate creating two-word, three-word, and four-word groupings without any manual effort. Because it is driven by the content itself, it adapts naturally to whatever I write.

I then used these as “words” in the Jaccard similarity rankings, which means that if two posts both mention “machine learning” as a phrase, it will be treated as a stronger match than if they just had the words “machine” and “learning” separately. This helps to surface more relevant related posts, especially when there are specific terms or concepts that are important to the content.

function create_ngrams( $words, $n = 2, $multiplier = 2 ) {

	$ngrams = [];
	$words = array_values( $words );
	$count = count( $words );

	for ( $i = 0; $i <= $count - $n; $i++ ) {

		$parts = array_slice( $words, $i, $n );

		// Skip if any word is empty
		if ( in_array( '', $parts, true ) ) {
			continue;
		}

		// Skip if the words are the same.
		if ( count( array_unique( $parts ) ) !== $n ) {
			continue;
		}

		$phrase = implode( ' ', $parts );
		$ngrams[] = $phrase;
	}

	// Count occurrences
	$ngrams = array_count_values( $ngrams );

	// Only keep repeated phrases
	$ngrams = array_filter(
		$ngrams,
		function( $count ) {
			return $count > 1;
		}
	);

	$result = [];

	// Add each phrase to the result array as many times as it occurs, multiplied by the multiplier.
	// This way, phrases that occur more frequently will have a greater influence on the Jaccard similarity score.
	foreach ( $ngrams as $phrase => $count ) {
		for ( $i = 0; $i < $count * $multiplier; $i++ ) {
			$result[] = $phrase;
		}
	}

	return $result;
}

Singular and Plural Forms

The last improvement was handling singular and plural forms of words. Previously, “game” and “games” would be treated as completely separate terms, which reduces the similarity scoring. To calculate plurals I went with a lightweight approach. I created a few simple rules based on common plural patterns and then layered in a handful of exceptions for the irregular cases I tend to use. This probably misses some plurals but it does catch the most common giving more weight to connected words.

function stem_word( $word ) {

	static $exceptions = [
		'vertices' => 'vertex',
		'indices' => 'index',
		'matrices' => 'matrix',
		'analyses' => 'analysis',
		'theses' => 'thesis',
		'crises' => 'crisis',
		'axes' => 'axis',
		'series' => 'series',
		'species' => 'species',
	];

	if ( isset( $exceptions[ $word ] ) ) {
		return $exceptions[ $word ];
	}

	$len = strlen( $word );

	if ( $len <= 3 ) {
		return $word;
	}

	// libraries -> library
	if ( preg_match( '/[^aeiou]ies$/', $word ) ) {
		return substr( $word, 0, -3 ) . 'y';
	}

	// boxes -> box, classes -> class, brushes -> brush
	if ( preg_match( '/(xes|ches|shes|sses|zzes)$/', $word ) ) {
		return substr( $word, 0, -2 );
	}

	// pages -> page, games -> game, tools -> tool
	if (
		substr( $word, -1 ) === 's' &&
		! preg_match( '/(ss|us|is)$/', $word ) &&
		$len > 4
	) {
		return substr( $word, 0, -1 );
	}

	return $word;
}

All Done

With these changes combined, the related posts are now much more accurate.

Because all of my sites use the same build system, changing this one library is like changing a WordPress plugin; it updates all of my sites automatically. Jekyll builds a static site, so the related posts are calculated before the site is generated, and then the build is fast since it doesn’t need to calculate them in Jekyll.

How was it for you? Let me know on BlueSky or Mastodon

(Please) Link to this page

Thanks for reading. I'd really appreciate it if you'd link to this page if you mention it in your newsletter or on your blog.

<a href="https://www.binarymoon.co.uk/writing/related-posts/">Improving Related Posts Across My Sites</a>

Improving Related Posts Across My Sites

Categories

HTML Improvements

Phrases and Word Groupings

Singular and Plural Forms

All Done

(Please) Link to this page

Related Posts

Rebuilding the Binary Moon Website

New Adventures in Jekyll

Creating Generative Art with PHP

The Games I’ve Made (So Far)

Redesigning the WordPress Post Editor

New Miniclip website

Categories

HTML Improvements

Phrases and Word Groupings

Singular and Plural Forms

All Done

(Please) Link to this page

Join the Ninja Sparks Newsletter!

Related Posts

Rebuilding the Binary Moon Website

New Adventures in Jekyll

Creating Generative Art with PHP

The Games I’ve Made (So Far)

Redesigning the WordPress Post Editor

New Miniclip website