A few years ago I switched to Jekyll across all of my sites, and one of the things I enjoy most is that the content is stored as markdown files, and so is easy to work with.
When I first switched to Jekyll I wanted to have related posts on my sites. With WordPress I used Jetpack related posts, but Jekyll didn’t have anything similar. So I built my own related posts system using PHP. I made a small control panel as well, so I could easily trigger updates.
These days I use a shared related posts library which updates the related posts across many of my websites, which means I only have to solve this problem once and every site benefits. The library loads in specified markdown files, extracts the text content, and then compares posts using Jaccard similarity. It then saves the results as front matter in the markdown files.
Recently I started to feel like the results were a bit off. Posts that should have been obvious matches were sometimes missing, so I wanted to take a closer look at what was actually going on, and it turned out there were a few small improvements I could make.
HTML Improvements
The first issue was how I handled HTML. I was stripping it out entirely, which seemed sensible at first since I didn’t want any markup interfering with the rankings, but in doing so I was also removing useful information.
Link text in particular carries a lot of meaning, especially in blog posts where links often describe concepts or tools. So I adjusted the parsing to keep meaningful text like links while still removing the parts that add noise. This alone made quite a big difference in the quality of the matches.
Phrases and Word Groupings
Since I’d made some small improvements I thought I would dig a bit deeper and see if there were any other quick wins.
Originally, everything was based on individual words, which does work, but misses phrases made up of multiple words. Brand names, or people names, for example. To tackle this in the early version I manually created a list of common word pairs that would be stored as pairs.
This time I automated the whole thing. I split the text into words, then combined neighbouring words to create phrases. Most of them were nonsense, but after removing the phrases that only appeared once I ended up with some useful groups. I could then use this code to automate creating two-word, three-word, and four-word groupings without any manual effort. Because it is driven by the content itself, it adapts naturally to whatever I write.
I then used these as “words” in the Jaccard similarity rankings, which means that if two posts both mention “machine learning” as a phrase, it will be treated as a stronger match than if they just had the words “machine” and “learning” separately. This helps to surface more relevant related posts, especially when there are specific terms or concepts that are important to the content.
function create_ngrams( $words, $n = 2, $multiplier = 2 ) {
$ngrams = [];
$words = array_values( $words );
$count = count( $words );
for ( $i = 0; $i <= $count - $n; $i++ ) {
$parts = array_slice( $words, $i, $n );
// Skip if any word is empty
if ( in_array( '', $parts, true ) ) {
continue;
}
// Skip if the words are the same.
if ( count( array_unique( $parts ) ) !== $n ) {
continue;
}
$phrase = implode( ' ', $parts );
$ngrams[] = $phrase;
}
// Count occurrences
$ngrams = array_count_values( $ngrams );
// Only keep repeated phrases
$ngrams = array_filter(
$ngrams,
function( $count ) {
return $count > 1;
}
);
$result = [];
// Add each phrase to the result array as many times as it occurs, multiplied by the multiplier.
// This way, phrases that occur more frequently will have a greater influence on the Jaccard similarity score.
foreach ( $ngrams as $phrase => $count ) {
for ( $i = 0; $i < $count * $multiplier; $i++ ) {
$result[] = $phrase;
}
}
return $result;
}
Singular and Plural Forms
The last improvement was handling singular and plural forms of words. Previously, “game” and “games” would be treated as completely separate terms, which reduces the similarity scoring. To calculate plurals I went with a lightweight approach. I created a few simple rules based on common plural patterns and then layered in a handful of exceptions for the irregular cases I tend to use. This probably misses some plurals but it does catch the most common giving more weight to connected words.
function stem_word( $word ) {
static $exceptions = [
'vertices' => 'vertex',
'indices' => 'index',
'matrices' => 'matrix',
'analyses' => 'analysis',
'theses' => 'thesis',
'crises' => 'crisis',
'axes' => 'axis',
'series' => 'series',
'species' => 'species',
];
if ( isset( $exceptions[ $word ] ) ) {
return $exceptions[ $word ];
}
$len = strlen( $word );
if ( $len <= 3 ) {
return $word;
}
// libraries -> library
if ( preg_match( '/[^aeiou]ies$/', $word ) ) {
return substr( $word, 0, -3 ) . 'y';
}
// boxes -> box, classes -> class, brushes -> brush
if ( preg_match( '/(xes|ches|shes|sses|zzes)$/', $word ) ) {
return substr( $word, 0, -2 );
}
// pages -> page, games -> game, tools -> tool
if (
substr( $word, -1 ) === 's' &&
! preg_match( '/(ss|us|is)$/', $word ) &&
$len > 4
) {
return substr( $word, 0, -1 );
}
return $word;
}
All Done
With these changes combined, the related posts are now much more accurate.
Because all of my sites use the same build system, changing this one library is like changing a WordPress plugin; it updates all of my sites automatically. Jekyll builds a static site, so the related posts are calculated before the site is generated, and then the build is fast since it doesn’t need to calculate them in Jekyll.
How was it for you? Let me know on BlueSky or Mastodon
(Please) Link to this page
Thanks for reading. I'd really appreciate it if you'd link to this page if you mention it in your newsletter or on your blog.