Categorising articles using their embeddings


Last updated on

As our relationship with information and tools to access it continue to evolve I wanted to spend a bit of time learning about embeddings.

And I found a perfect scenario to build and test it out:

Once I started adding articles to this site, I wanted to start classifying them, and, of course, I didn’t want to manually come up with tags and categories, I wanted the content of the articles to dictate the section names and structure. I may be writing another book review or sharing some coding examples, so I wanted articles to appear in relevant sections without having to manually define them.

TLDR: This is the first post, there is still a long way to go, but here is what I managed to achieve so far:

  • Create embeddings for current articles.
  • Visualise embeddings in 2 dimensions.
  • Extract tags and categories from each article and create a taxonomy file that covers all articles

For this post I’m using AWS Bedrock models and direct LLM calls using ‘boto3’ client in Python. The full step-by-step notebook is available in the repo.

The first question I wanted to answer is:

What do my articles look like?

Or more specifically, can I visualise and cluster my articles based purely on embeddings?

I did a quick search and found a useful blog post from Thomas Rogers at AWS.

It confirmed my approach so I was curious to see my articles embedded and clustered, I was sure it had two obvious clusters: “book reviews”, and maybe something to do with “data analytics”.

I loaded my articles from the local directory:

Loading articles from local directories...
✓ Loaded 4 articles

Article types: {'blog': 4}
Date range: 14 Sep 2024 00:00:000 to 2025-06-20 14:00
Average content length: 2206 characters

Generated the embeddings using amazon.titan-embed-text-v2:0 and saved locally to a pickle file for later:

Generating embeddings with caching...
✓ Loaded cache with 9 embeddings
  1/4: Using cached embedding for 'Analysis of property in Zaporizhzhia declared "own...'
  2/4: Using cached embedding for 'Book Review - The Men Who Killed the News...'
  3/4: Using cached embedding for 'Book Review - The Coming Wave...'
  4/4: Using cached embedding for 'Book Review - Nexus...'
✓ Saved cache with 9 embeddings
✓ Generated embeddings shape: (4, 1024)

And ran the HDBSCAN to detect some clusters:

Clustering 4 articles...
✓ Found 0 clusters
  Noise points (unclustered): 4

🤔 Hmm, not even 1? Not even book reviews?

I don’t have many articles, potentially each article could be the beginning of it’s own category (a cluster of size=1), but I have at least 3 book reviews, which should form a cluster. So let’s keep going.

I was determined to get my “book reviews” category, so I went ahead to visualise my embeddings. Since these embeddings are highly-dimensional (1024), we can’t visualise them straight away, we need to reduce their dimensionality first. I used UMAP to bring it into 2 dimensions:

Yep, not very exciting.

Next, let’s see if we can extract some useful info from the articles and cluster based on that. Hopefully it will give us more data points.

Can I find clusters using LLM-generated tags and categories?

Next step was to take each article, feed it into the LLM to get some structured responses containing tags and categories, and see if we can find any clusters there. Here is an example prompt I used:

prompt = f"""Analyze the following blog article and provide:

1. Tags (single words or short phrases) for categorisation.
2. Categories (one or multiple) under which to appear in a blog (type of article, topic, etc.)
3. A brief summary what this article is about (1 sentences) (eg. "Personal account of xyz...")

Article Title: {title}

Article Content:
{content_for_analysis}

Respond in this exact JSON format:
{{
  "tags": ["tag1", "tag2", "tag3"],
  "categories": ["category1", "category2"],
  "summary": "Brief summary about the article"
}}"""

And here are the results I got back:

📄 Analysis of property in Zaporizhzhia declared "ownerless" by Russia.
   🏷️ LLM Tags: Ukraine, Russia, occupied territories, property seizure, Zaporizhzhia, data analysis
   📋 LLM Categories: Conflict Analysis, Data Visualization
   📝 Summary: An analysis of property in Russian-occupied Zaporizhzhia, Ukraine, declared 'ownerless' by Russian authorities, revealing patterns in seizures ranging from small items to large businesses.

📄 Book Review - The Men Who Killed the News
   🏷️ LLM Tags: media industry, journalism, book review, media ownership, digital transformation
   📋 LLM Categories: Media Analysis, Book Reviews
   📝 Summary: A review of Eric Beecher's book 'The Men Who Killed the News', exploring the history and challenges of the news industry, including visualizations of media outlet timelines, ownership, and audience sizes.

📄 Book Review - The Coming Wave
   🏷️ LLM Tags: AI, technology, book review, The Coming Wave, Mustafa Suleyman, future, innovation
   📋 LLM Categories: Book Reviews, Technology Trends
   📝 Summary: A review of Mustafa Suleyman's book 'The Coming Wave', which explores the rapid advancement of technology and its societal implications.

📄 Book Review - Nexus
   🏷️ LLM Tags: book review, information technology, Yuval Noah Harari, information networks, AI, truth, fallibility
   📋 LLM Categories: Book Reviews, Technology and Society
   📝 Summary: A review of Yuval Noah Harari's book 'Nexus: Information Networks from Stone Age to AI', exploring the evolution and impact of information on societies and governments.

👍 These look pretty good.

I had to play around with the prompt a bit to improve the quality of the answer. At the beginning I saw the LLM struggle with the same challenge I was faced with before: is it book review or Book Review? It is possile to restrict it and always use lower case, or some other restrictions, but it demonstrates the point: results won’t always line up perfectly semantically unless we have some sort of structure we are trying to fit it into. That’s why need the taxonomy to help create “harbours” an article could belong to, or assign it to it’s own category.

Based on the current articles, we have the following identified categories:

  • Conflict Analysis, Data Visualization
  • Media Analysis, Book Reviews
  • Book Reviews, Technology Trends
  • Book Reviews, Technology and Society

And we finally we have the “Book Reviews”! But there are some other categories, which could be closely aligned. Let’s visualise them and then attempt to cluster them.

I generated the embeddings for each tag and category, and plotted them together with the articles. Some clusters were idenfied.

Looks a bit like a pizza 🍕

Now we can pass it to the LLM to come back with unifiying names for our clusters. Here is the prompt I used:

prompt = f"""
    Based on the clustered tags and categories from blog articles below, create a clear, hierarchical blog taxonomy.  
    Your task is to:  
    1. A consolidated list of 1-7 main blog sections/categories.
    2. For each section, suggest 3-8 relevant tags that are well-organised and non-duplicative.
    3. Merge similar/duplicate terms and use clear, consistent naming.
    4. Provide a short explanation of major consolidation or naming decisions. 

    CATEGORY CLUSTERS:
    {chr(10).join(category_cluster_summaries)}

    TAG CLUSTERS:
    {chr(10).join(tag_cluster_summaries)}

    Respond in **valid JSON only** and in this exact format (no extra text):  
    {{
    "main_sections": [
        {{
        "name": "Section Name",
        "description": "Brief description of this section",
        "tags": ["tag1", "tag2", "tag3"]
        }}
    ],
    "all_tags": ["consolidated_tag1", "consolidated_tag2"],
    "taxonomy_notes": "Short explanation of how duplicates were merged or terms renamed"
    }}
    """

Here is the file I received (minus the tags), it identified the following main sections:

{
 "main_sections": [
  {
   "name": "Media & Information",
   "description": "Analysis of media trends, journalism, and information networks",
  },
  {
   "name": "Technology & Future",
   "description": "Exploration of technology trends, AI, and future scenarios",
  },
  {
   "name": "Geopolitics & Conflict",
   "description": "Analysis of global conflicts and geopolitical issues",
  },
  {
   "name": "Data & Visualization",
   "description": "Data analysis techniques and visualization methods",
  },
  {
   "name": "Book Reviews",
   "description": "Reviews and discussions of notable books",
  }
 ],
 "taxonomy_notes": "Merged 'AI' into 'artificial intelligence'. Combined 'technology trends' and 'future' into 'future trends'. Grouped author names under a general 'authors' tag. Removed specific book titles and kept general 'book review' tag."
}

As someone who knows the content, I’m quite happy with the result, nothing suprising.

Why bother with clustering and not just feed all articles into an LLM to generate taxonomy based on all the content?

We could do it but this would be a very very lazy approach and it wouldn’t scale, even with long context windows nowadays, it doesn’t sound like an elegant solution. We can’t let LLMs do all the work!

What’s next?

The next logical step would be to test our taxonomy to see how well it covers all our articles. Which is exactly what I went to do. I didn’t want to use an LLM for that - based on semantic similarity, we should be able to map our Taxonomy to existing articles it was created from. For new articles we could use an LLM, but for existing we shouldn’t have to. And this is where I’m currently at:

============================================================
TAXONOMY MAPPING RESULTS
============================================================
...

🏷️  SECTION DISTRIBUTION:
  Unassigned: 3 articles (avg score: 0.000)
  Media & Information: 1 articles (avg score: 0.000)

Not quite what I expected, but that’s for next article.

In the next post I plan to cover:

  • Testing and mapping the generated taxonomy file over existing articles
  • Using it to classify new articles and assign relevant tags/categories.
  • Update UI to reflect the new structure.

Thank you for reading!