Fun Teaching My PC to See History

For the last few years, I've been (slowly) building a tool called MS:Inspector - a computer vision system for searching and classifying objects in medieval manuscript images. The stack is YOLOv8 for detection, DINOv2 for embeddings, UMAP and HDBSCAN for clustering, backed by a Go API, PostgreSQL, and a React frontend. A Go scraper downloads around 20,000 images to form a manuscript database, each record carrying metadata: manuscript name, folio, date range, country of origin, institution, and crowd-sourced tags from the source sites.

The workflow is straightforward in principle. You define a 'motif', the visual concept you want to detect. You build it from 'descriptors': crowd-sourced tags that already exist on artworks in the database. Importing descriptors pulls matching images into the motif's pool, which you can then triage: 'active' (you draw bounding boxes on a canvas), 'not_present' (hard negative), 'skip' (ambiguous).

Then you train, YOLOv8 exports the annotated set, trains, runs inference across the full corpus, and extracts DINOv2 embeddings from every prediction.

Next, you review predictions: 'accepted' predictions feedback as training data, and rejected predictions become hard negatives. Retrain. Loop. Once reliable enough, you can 'release' the model and run inference across filtered subsets of the corpus by date, country, institution, etc., to find more instances or for research.

That part works. What I didn't anticipate was what the model would tell me about the problem itself.

The problem the model handed back

I started with 'kettle helms', the open-faced metal helmets with a brim, very common in medieval manuscript images.

If you train a model to find them, it finds them. It also finds things that look like 'kettle helms' but aren't quite. Subtypes. Hybrids. Things that are clearly mid-transition between one form and another.

First instinct: tighten the training data. Cleaner labels, harder negatives.

Wrong instinct. I'd broken the category, not the model.

'Kettle helms' isn't a binary classification - it's a spectrum. And no amount of cleaner labels fixes that, because the fuzziness isn't in the training data. It's in the subject matter. The model accurately reflected the domain's structure. I just didn't want to see it.

Every category is fuzzier than you think

If you try to define "hat" formally, you break immediately.

"A hat is a covering for the head" - so is a hood. Is a hood a hat? Most people say no.

"A hat has a brim" - a beanie has no brim.

"A hat is rigid" - a floppy sun hat isn't.

Every rule you add either excludes something you'd call a hat or includes something you wouldn't.

What you really have is a cluster of features: brim, rigidity, sits-on-top, not-attached-to-clothing. If you get enough of them together, it equals a hat, but no single one is required.

Wittgenstein called this family resemblance: members of a category share overlapping features the way family members do, without any one feature running through all of them.

You know what a hat is because in your head, you have built a fuzzy prototype from thousands of examples. "Close enough" equals hat. But the threshold for "close enough" is never formally specified.

This cluster of features is precisely the CV problem I was facing. A hard decision boundary around a soft category will always produce edge cases, because the edge cases are genuinely part of the domain. The model finding things that are "almost a kettle helmet" isn't a failure. It's the correct output. The failure was expecting a hard boundary to exist in the first place.

Three vocabularies, none of them clean

The problem compounds when you look at the source data. The crowd-sourced tags feeding the motif pool come from different people with different mental models and at least three distinct vocabularies that don't map cleanly onto one another or onto the objects themselves.

Academic terms are precise and defined within scholarly tradition. "Helm" means one specific thing. Exact, consistent within the specialist community, useful, but it creates artificially hard edges around objects that don't necessarily have them in the material record.

Common and reenactor terms are widely used and understood, but often technically incorrect and applied loosely. "Kettle helm" - an ugly modern coinage that gets used to describe anything vaguely hat-shaped and metal. "Helmet" in a crowd-sourced tag could mean a Roman helmet, a bicycle helmet, a bascinet, or a great helm: high coverage, almost no precision.

Then there's semantic drift; it's worth watching it happen in slow motion. "Helm" starts as a precise academic term with a specific referent. It gets borrowed by reenactors who want to sound precise. Through repetition, it blurs into a general medieval-flavoured synonym for "helmet." Meanwhile, "helmet", already vague, absorbs it entirely. The word that arrives in your training data still looks technical. The precision left years ago.

A tag of "helmet" in the source metadata is nearly useless for classification. But treating any single tag as ground truth is wrong, regardless of whether the data is an aggregate of different people's prototypes, with disagreement baked into the labels.

There's a further complication worth flagging early here. The images themselves don't give you a direct record of what existed. Illuminators drew from their own visual vocabulary; they painted what they knew. A French illuminator depicting a German battle draws French armour, because that's the only armour they know. Their own exposure bounds their visual vocabulary: local workshops, local fashion, local conventions. The subject matter is German. The visual language is French. I'll come back to this, but it matters for how you read the data from the start.

My fix: define geometrically, not terminologically

The vocabulary problem forced a decision. "Kettle helm", the term I'd started with, the term in a lot of the crowd-sourced tags turns out not to be a typology at all. A "helm" is a specific object, a fully enclosed head covering. "Kettle helm" is a modern hybrid coinage that incorrectly fuses the two, and it's the term most people use. Hat fits better - it's something that goes 'on' the head, not around it... So a "kettle hat" might fit better, but then: what's a kettle? What's a kettle hat?

This "thing" is an open-faced (not a helm) that goes on the top of your head; it's not closed like a helm. It's specifically for war/battle; the thing I'm trying to capture is a war hat.

So the first move was to rename the motif. The naming doesn't matter to the model - it doesn't care - but the working definition underneath it needs solid ground. If you build a definition on a bad term, you inherit all the ambiguity and baggage that came with it.

My solution was to stop using terminology entirely and define motifs geometrically. On observations about the shape of an object, not terms from any vocabulary.

Here's my current working definition for a 'war hat':

Protective headgear with a brim extending away from the wearer's skull. Distinct from close-fitting helmets (bascinets, cervellières) and enclosed helmets (great helms). Includes wide-brimmed, narrow-brimmed, pointed, and flat-topped variants across all regions and periods.

The diagnostic feature is the brim. Skull only = cervellière. Skull plus brim = war hat. Everything else, brim width, profile, period, and region, is variance within the category, not grounds for a new one.

A geometric definition works for two reasons. It's trainable: one necessary feature, specific exclusions, and everything else accepted. And it doesn't carry any of the vocabulary baggage. Academic terms drift between scholarly communities and common usage. Reenactor terms drift between subcultures. Common terms drift between anyone who uses them. Geometry doesn't drift. "Brim extending away from the skull" means the same thing in any language, any period, any community. That's what the model actually learns.

Letting the clusters surface the child taxonomy

Once the coarse model is running reliably, the clustering pipeline takes over. The DINOv2 embeddings on cropped detections, UMAP for dimensionality reduction, and HDBSCAN for clustering. The scatter plot surfaces groupings you wouldn't have named in advance.

We also have metadata, the dates and the regions. These layers are on top of the visual clustering. So a grouping isn't just "these look similar," it's "these look similar, and they're all from northern France in the 14th century."

Some of those groupings will be real typological distinctions, the things that become sallets, morions, cabacetes as the form evolves. Some will be regional variants. Some will be period variants. Some will be illuminator conventions. The point is, you don't decide upfront which groupings matter. You let the data show you, then promote meaningful clusters to child motifs and sub-train.

The taxonomy is discovered, not imposed. War hat is the parent. Everything else emerges from the evidence.

The complication you don't escape

I flagged a problem with the illuminators earlier. For context, armour and clothing in the medieval period weren't separate systems; they co-evolved as fashion.

The armourer and the tailor were responding to the same aesthetic pressures simultaneously. Pointed sabatons mirroring pointed fashionable shoes, fluted surfaces echoing textile pleating. The illuminator drawing a knight isn't recording armour. They're drawing fashionable martial dress as it looked in their world, at their time, in their region, using their visual vocabulary.

Which means when you surface a cluster from DINOv2 embeddings, you can't cleanly attribute it to a single cause. It might be a real typological distinction, a genuine difference in object form. It might be a regional variant. It might be a period variant. Or it might be an illuminator convention, a visual habit inherited from a workshop tradition, repeated across dozens of manuscripts, filtered through what the illuminator had actually seen rather than what soldiers at the depicted location actually wore.

These variables are entangled. A French illuminator working in 1380 likely depicts their local fashions, which will favour the illustration. A cluster that appears to represent "German armour c.1350" might be exactly that, or it might be a French interpretation of German armour, or a stylistic quirk of one illuminator, or something in between. The system can't tell. Neither can a human work from the same material.

What the system does is make the entanglement visible. You can cross-reference clusters against manuscript provenance, known illuminator attributions, and date ranges to start disentangling them over time. But the limitation is in the sources, not in the model.

The honest answer to a simple question

Ask the system: "Show me 13th-century English kettle helms."

Here's what it actually returns: everything a French illuminator drew that looks like a brim, in a manuscript dated to 13th-century England.

Which may or may not reflect what a 13th-century English soldier actually wore. And "kettle helm" was already the wrong term before I even ran the first query!

With multiple layers of complexity in one simple question: wrong vocabulary, illuminator knowledge limits, manuscript metadata describing provenance, not subject matter, and a geometric category that casts wider than the user intended.

The system doesn't resolve this complexity. It makes it visible. You see exactly what you're getting and exactly why the answer is incomplete. That's not a failure mode! It is the correct behaviour for a tool working with genuinely difficult source material.

I didn't build a classifier. I built a tool that shows you exactly how complicated the question was in the first place.