ASR Web Discoverability

This document specifies how an Agential Semioverse Repository published as a website makes itself discoverable to automated agents — search engine crawlers, AI training pipelines, retrieval-augmented generation (RAG) systems, and other machines that traverse the web looking for knowledge.

Motivation

An ASR is a structured knowledge repository. Its directory organization, frontmatter schema, and internal linking conventions already carry semantic structure. When published as a website, that structure can be made legible to machines in three complementary ways:

Crawl permissions — telling machines they are welcome.
Structured data — telling machines what each page is.
Sitemap — telling machines what pages exist.

Together, these transform a static site from a collection of pages into a machine-readable knowledge surface.

1. Crawl permissions: robots.txt

An ASR SHOULD publish a robots.txt file at the site root that explicitly welcomes crawlers relevant to the repository’s goals. For a research site aiming to be discoverable by AI agents, this means naming and allowing AI-specific user-agents:

GPTBot (OpenAI)
ClaudeBot (Anthropic)
anthropic-ai (Anthropic training)
Google-Extended (Google AI)
PerplexityBot (Perplexity)
CCBot (Common Crawl)
Applebot-Extended (Apple Intelligence)

The robots.txt file SHOULD also point to the sitemap:

Sitemap: https://example.com/sitemap.xml

2. Structured data: JSON-LD and Schema.org

An ASR SHOULD emit per-page JSON-LD in the <head> element, using Schema.org vocabulary. The Schema.org type is determined by the page’s position in the ASR directory structure:

ASR directory type	Schema.org type	Rationale
`terms/`	`DefinedTerm`	Glossary entries define specific terms
`concepts/`	`DefinedTerm`	Concept notes define specific ideas
`text/`	`Article`	Papers and essays are articles
`curricula/`	`LearningResource`	Lessons are educational resources
`specifications/`	`Article`	Specifications are technical articles
Encyclopedia entries for people	`Person`	Biographical pages
Site root	`WebSite`	Identifies the site as a whole
Default	`Article`	Fallback for any other page

Multi-level pages SHOULD also emit a BreadcrumbList reflecting the directory path. This tells machines the hierarchical context: a term defined under mathematics/objects/posets/terms/ sits within mathematics, within the objects of mathematics, within the study of partially ordered sets.

Mapping ASR frontmatter to Schema.org properties

ASR frontmatter	Schema.org property	Notes
`title`	`headline` / `name`	Display name of the concept
`description`	`description`	Summary for indexing
`authors`	`author`	Creator(s) of the content
`date-created`	`datePublished`	Publication date
`date-updated`	`dateModified`	Last edit date
`tags`	`keywords`	Topical classification

3. Sitemap

An ASR published as a website SHOULD generate an XML sitemap listing all published pages with their last-modified dates. The sitemap enables crawlers to discover content without following every link, and to prioritize recently updated pages.

4. Design principles

Be explicit about what you are

AI systems are better at indexing content when pages declare their type. A page that says “I am a DefinedTerm called ‘closure operator’” is more useful to a knowledge graph than a page that contains the words “closure operator” somewhere in its text.

Welcome agents rather than blocking them

The default posture of an ASR is openness. The repository exists to contribute knowledge to the commons. Blocking crawlers contradicts this purpose. If specific content should not be crawled (private notes, drafts), it should be excluded from publication rather than published and then blocked.

Let structure do the work

The ASR directory organization already encodes semantic structure. The web discoverability layer translates that structure into formats machines understand. No additional markup or annotation should be required from the author — the build process should derive structured data from the directory position and frontmatter that already exist.

Implementation

emsenn.net implements this specification through:

quartz/static/robots.txt — crawl permissions
quartz/components/Head.tsx — per-page JSON-LD generation
Quartz’s built-in sitemap emitter — XML sitemap at /sitemap.xml

emsenn

Explorer