---
title: Our content testing with for LLMs
date: 2026-04-16T16:32:23Z
modified: 2026-04-16T16:32:23Z
permalink: "https://labs.chancerylaneproject.org/2026/04/16/our-content-testing-with-for-llms/"
type: post
status: publish
excerpt: ""
wpid: 247
categories:
  - Uncategorized
---

_Content note – this post was produced using Claude to synthesise outputs from our internal testing. We want to be transparent about the work we’re doing, but producing these outputs is best handled by an LLM working from our tone of voice guidelines._

## What this is

We run automated tests to measure how well AI assistants — ChatGPT, Claude, and Gemini — find and describe content from the TCLP website (chancerylaneproject.org).

The core question: when someone asks an AI for a specific climate-aligned contract clause, does it find the right one, describe it accurately, and credit TCLP as the source?

We track how changes to our website — specifically, adding structured metadata that helps AI systems read our content — affect AI performance over time.

---

## The test prompts: six categories

We run 20 test questions across six categories. Each question is phrased the way a real lawyer, procurement professional, or sustainability adviser might ask an AI assistant. Every question is deliberately specific and challenging — we are not testing easy lookups.

---

### 1. Status checks — is this clause still current?

These questions ask an AI to visit our site and confirm whether a specific clause is actively maintained or has been archived — and if archived, why.

**Example questions:**

- _“Go to chancerylaneproject.org and check the status of ‘Evie’s Clause’ (Debt Finance). Is it ‘Actively Maintained’, or has it been archived?”_
- _“Compare the status of ‘Raphael’s Procurement DDQ’ and ‘Augie’s Procurement DDQ’. Which one is currently ‘Maintained’ and applicable to England and Wales?”_

**Why it matters:** A lawyer relying on an out-of-date clause could cause real harm. If AI tools cannot distinguish current from archived content, they become unreliable for legal work.

---

### 2. Multi-dimensional filtering — find exactly what fits my situation

These questions require the AI to filter by multiple overlapping criteria simultaneously: sector, jurisdiction, practice area, and legal mechanism.

**Example questions:**

- _“Find a ‘Banking and Finance’ clause that introduces an ‘interest ratchet’ based on carbon saving for use in England and Wales.”_
- _“Find clauses tagged for ‘Real Estate’ AND specifically dealing with ‘Green Leases’ in Ireland.”_

**Why it matters:** Our users rarely want generic results. They need clauses that match their specific context — a construction contract in Scotland is legally different from a real estate deal in Ireland. This tests whether AI can navigate those distinctions.

---

### 3. Relationship traversal — is this clause connected to others?

Our content is structured: some clauses are parent documents containing child clauses; others are variants of existing clauses. These questions test whether AI can navigate those relationships.

**Example questions:**

- _“Does the ‘Net Zero Culture Employment Handbook’ contain a specific section on ‘Sustainable Business Travel’?”_
- _“Are there any variant clauses of ‘Connor’s Clause’ (Insurance) that apply to the built environment or real estate?”_

**Why it matters:** A practitioner working with one clause may need to know what else it connects to. If AI tools only see isolated documents rather than a structured library, they miss this context entirely.

---

### 4. Standards linkage — which clauses reference this framework?

These questions ask the AI to identify clauses aligned with specific international frameworks, such as the GHG Protocol, Science Based Targets initiative (SBTi), or the Task Force on Climate-related Financial Disclosures (TCFD).

**Example questions:**

- _“Which clauses on chancerylaneproject.org explicitly reference the ‘GHG Protocol’ for measuring scope 1, 2, and 3 emissions?”_
- _“Search for clauses aligned with SBTi. Specifically, find a board minute precedent using this standard.”_

**Why it matters:** Lawyers and sustainability professionals often need to demonstrate compliance with specific frameworks. If our clauses are not findable via those framework names, we lose that professional audience.

---

### 5. Concept mapping — I don’t know the exact clause name

These questions are phrased in plain language, using concepts and intentions rather than clause names. The AI must interpret what the user wants and map it to the right content.

**Example questions:**

- _“I want to prevent ‘greenwashing’ in my supply chain. Which due diligence questionnaires on chancerylaneproject.org help with this?”_
- _“I need a clause that acts as a ‘carrot’ rather than a ‘stick’ for suppliers to reduce emissions. What do you suggest?”_

**Why it matters:** Real users often do not know our clause names. They describe their problem or goal. This tests whether our semantic tagging helps AI bridge the gap between user intent and our specific content.

---

### 6. Logic and negative constraints — find this, but not that

These questions involve exclusion logic — content that meets some criteria while explicitly excluding others.

**Example questions:**

- _“List all clauses applicable to USA jurisdiction but NOT procurement clauses.”_
- _“Find the Green Lease clauses for Ireland. Are there any other than Odhran’s Clause?”_

**Why it matters:** Users may already know certain content and want to explore beyond it. They may need to rule out clauses for regulatory or contextual reasons. This tests sophisticated filtering that requires genuine understanding of our taxonomy.

---

## How we score responses

Each AI response is scored on four metrics:



| Metric | What it measures | Scale |
| --- | --- | --- |
| **Entity retrieval accuracy** | Did the AI find the correct named clause? | 0–1 |
| **Metadata precision** | Did it correctly apply filters — jurisdiction, maintenance status, sector? | 0–1 |
| **TCLP citation** | Did the response explicitly credit The Chancery Lane Project or our website? | 0–1 |
| **Relationship traversal** | Did it correctly identify connections between clauses — parent/child, variants, derived versions? | 0–1 |

A score of 1.0 is perfect. A score of 0.0 means complete failure on that measure.

---

## How we changed the site between tests

The January baseline was taken before any changes to how our content is structured for AI readability. For April, we implemented a **Markdown-first approach**: clause pages now present content as structured Markdown, with our taxonomies (jurisdiction, sector, maintenance status, practice area, standards references) embedded directly in the page text in a consistent, readable format.

We initially explored JSON-LD structured data — a machine-readable metadata format embedded in page code. But when AI tools search the web, they typically read the rendered text of a page, not the underlying code. JSON-LD was largely invisible to AI search behaviour in practice. Presenting the same information directly in the human-readable Markdown content proved more effective.

The April results measure the impact of this Markdown approach.

---

## The AI tools we tested

All three models were run with live web search enabled — meaning they were actively searching our website in real time, not relying on memorised training data.

- **Google Gemini Flash** (via OpenRouter, with web search)
- **Anthropic Claude Opus** (via OpenRouter, with web search)
- **OpenAI GPT-4o** (via OpenRouter, with web search)

---

## What we found: January vs April

The April results reflect our Markdown-with-topics approach — clause pages now present taxonomy information directly in the readable content, rather than in behind-the-scenes code metadata.

Both runs cover the same 20 test questions, six categories, and three AI models, so the comparison is like-for-like.

### Overall scores (averaged across all 20 tests and all three AI models)



| Metric | January | April | Change |
| --- | --- | --- | --- |
| Metadata precision | 28.3% | 40.8% | **+44%** |
| Entity retrieval accuracy | 78.3% | 90.0% | **+15%** |
| TCLP citation | 88.3% | 98.3% | **+11%** |
| Relationship traversal | 87.5% | 85.0% | -3% |

### What improved

**Respecting filters (metadata precision):** The biggest gain at +44%. This was our weakest metric in January and remains the lowest overall, but the improvement is substantial. AI tools are now much better at applying specific constraints — jurisdiction, sector, maintenance status. This is directly attributable to those taxonomy fields being visible in the page content rather than hidden in code.

**Finding the right clause (entity retrieval):** Up 15 percentage points to 90%. AI tools are substantially better at locating the specific named clause we expected, rather than returning plausible but incorrect alternatives.

**Crediting TCLP (citation):** Near-perfect at 98.3%. AI tools reliably attribute our content to us — important for both brand recognition and user trust.

### What did not improve

**Navigating relationships between clauses (relationship traversal):** A small but consistent decline of three percentage points. This metric measures whether AI tools can follow connections between clauses — parent documents, child clauses, variants. Our content changes have not yet shifted this. The decline appears across all three AI models, which suggests a structural challenge rather than a model-specific one. Worth monitoring, but not yet a cause for concern.

---

### Results by AI model



| Model | Metadata precision | Entity retrieval | Citation | Relationship traversal |
| --- | --- | --- | --- | --- |
| **Claude Opus** | 30% → 52.5% (**+22.5%**) | 70% → 90% (+20%) | 85% → 100% (+15%) | 100% → 97.5% (-2.5%) |
| **Gemini Flash** | 30% → 40% (+10%) | 85% → 95% (+10%) | 90% → 100% (+10%) | 85% → 82.5% (-2.5%) |
| **GPT-4o** | 25% → 30% (+5%) | 80% → 85% (+5%) | 90% → 95% (+5%) | 77.5% → 75% (-2.5%) |

Claude Opus showed the largest gains, particularly on metadata precision — suggesting it is better at reading and applying structured Markdown taxonomy. GPT-4o responded least to the changes. Notably, all three models show the same small relationship traversal decline, pointing to how our content represents inter-clause connections rather than any individual model’s behaviour.

---

### Results by question category



| Category | Metadata precision | Entity retrieval | Citation | Relationship traversal |
| --- | --- | --- | --- | --- |
| **Multi-dimensional** | 41.7% → 66.7% (**+25%**) | 83.3% → 91.7% (+8.3%) | 100% → 100% | 79.2% → 87.5% (+8.3%) |
| **Logic puzzles** | 25% → 41.7% (+16.7%) | 75% → 91.7% (+16.7%) | 66.7% → 91.7% (+25%) | 87.5% → 87.5% (—) |
| **Relationship traversal** | 0% → 16.7% (+16.7%) | 66.7% → 66.7% (—) | 100% → 100% | 100% → 100% |
| **Standards linkage** | 0% → 5.6% (+5.6%) | 100% → 100% | 77.8% → 100% (+22.2%) | 83.3% → 72.2% (-11.1%) |
| **Concept mapping** | 5.6% → 5.6% (—) | 66.7% → 88.9% (+22.2%) | 88.9% → 100% (+11.1%) | 88.9% → 88.9% (—) |
| **Status checks** | 94.4% → 100% (+5.6%) | 77.8% → 100% (+22.2%) | 100% → 100% | 88.9% → 72.2% (-16.7%) |

**Multi-dimensional** is the standout — this category most directly tests taxonomy awareness, and metadata precision improved by 25 percentage points. Relationship traversal also improved here, the only category where it did.

**Logic puzzles** showed strong gains in citation (+25%) and entity retrieval (+16.7%), suggesting AI tools handle complex filtering better when our content taxonomy is more explicit.

**Standards linkage** metadata precision has moved off zero for the first time (0% → 5.6%) — a small but meaningful signal that standards references in our Markdown are beginning to register. Entity retrieval in this category was already at 100% in January and remains so.

**Status checks** achieved perfect entity retrieval (100%), but saw the largest relationship traversal decline (-16.7%). Status-check questions do not directly ask about clause relationships, so this may reflect something about how AI tools are reading our pages rather than a genuine capability loss.

**Concept mapping** and **relationship traversal** (the category) showed no metadata precision change. Both require more work on how we represent intent-mapping and inter-clause navigation in our taxonomy.

---

## What this tells us

The Markdown-with-topics approach is working. Embedding taxonomy information directly in readable page content has produced measurable improvements across most metrics and most AI models. The largest gains are where taxonomy precision matters most — multi-dimensional filtering and standards-linked queries.

Two areas still need work.

**Standards linkage metadata precision** is starting to move — but at 5.6%, it remains very low. Standards references (GHG Protocol, SBTi, TCFD) need to be more consistently and explicitly structured in the Markdown to reliably surface in AI responses.

**Relationship traversal between clauses** shows a small consistent decline across all models and several categories. How we represent connections in the Markdown — parent/child relationships, variants, derived versions — needs further thought. The current approach surfaces individual clause metadata well, but may not yet make inter-clause structure clear enough for AI tools to navigate.

The next benchmark will measure the impact of further Markdown refinements targeting both areas.