Noosphere Analytics Logo

Schema-Driven Crawling is Cheap and Effective

Published by morgan@noosphereanalytics.com on August 8, 2025

tl;dr


The Problem of Crawling Public Documents

There are over 13,000 school districts in the United States. The state of Washington alone has nearly 300, and each one stores their public documents in a unique way, not just in naming conventions, document schemas, and file types, but also in where and how they store them.

Before we even think about solving the hard problems of format normalization, we have to solve these first:

  1. Figuring out the (usually multiple) places where public documents are stored.
  2. Figuring out whether each storage location is an archive or a dumping ground for fresh data.

Many districts create a new subfolder or subpage for each school year, then move the whole thing into an archive when the year turns over.

If we can tell when a page is only hosting fresh documents for the current year, we can skip crawling it until the year turns over and save a lot of compute.

Examples:

someurl.edu/board/docs/2024-2025
someurl.edu/board/docs/archive/2024-2025
someurl.edu/board/docs/2024-2025-archived

The Basic Approach

  1. Hook an LLM up to a web browser. The usual method is an MCP server plus Playwright (more on why I didn’t do this below).
  2. Navigate to each district website and have the LLM click through until it finds a store of documents.
  3. Record metadata about what’s there and save the location for future crawls.

Abandoned Approaches

Tool calling plus Playwright

In my tests, cheaper models could only navigate a few clicks deep before getting lost. Eventually they’d find an event calendar and declare the job done.

Homepage -> Board of Directors -> Documents -> School Year [subpages for each year]

Even with aggressive markdown conversion, a direct path through a site like this can take hundreds of thousands of tokens.

With a little overhead, the rough cost might look something like:

tokens in/out: 55,000 / 5,000

    gpt-5-mini:
    (55,000/1e6 * 0.25) + (5,000/1e6 * 2.00) = $0.02375

    gpt-5:
    (55,000/1e6 * 1.25) + (5,000/1e6 * 10.00) = $0.11875

This may not seem like much, but when you factor in crawling over 10,000 pages on a recurring basis it adds up to many thousands of dollars a year.

Agents like Operator or firecrawl.dev

Still expensive. In Firecrawl’s case, the metadata extraction was too limited, so I would have to run results through another LLM anyway to decide if a page was current or archival.

The Turning Point

After a lot of frustration with agentic approaches, I realized two things:

  1. Small models are bad at tool use.
  2. Small models are pretty good at extracting data into schemas.

If I shift decision-making into crawler logic and keep LLMs focused on structured extraction, I can use much cheaper models without losing accuracy.

The Schema

Here’s the Pydantic model I ended up with (truncated). Full version: gist link

class RelevantPage(BaseLLMModel):
    url: str
    title: str
    has_data: bool = Field(
        description="True if this page contains the desired data, False otherwise."
    )
    has_data_links: bool = Field(
        description="True if this page contains links to subpages with the desired data, False otherwise."
    )
    description: t.Optional[str] = Field(
        description="A brief description of the page's content."
    )
    data_page_info: t.Optional[DataPageInfo] = Field(
        description="If this page has relevant data this should have metadata about it."
    )
    possible_relevant_pages: t.List[PossibleRelevantPage] = Field(
        description="Links on the current page that seem likely to lead to relevant data"
    )
    no_data_found: t.Optional[int] = Field(
        default=None,
        description="Number of times a scraper visited this page and found no data.",
    )

I treat the schema like a prompt by using descriptive field definitions. If I change the model, I don’t have to update a separate prompt - the descriptions are built in.

How It Works

  1. Start on a given page.
  2. Ask the LLM to classify:
    • Does this page have the data we want?
    • Is it current or archival? (I pass the current date)
    • Which links might lead to relevant data?
  3. Follow the highest-ranked links until we’ve exhausted the site.

This moves complexity out of the LLM and into crawler heuristics.

In Practice

We don't explore the whole site, instead we look for possible data sources while checking the current page for signs of documents. For 'possible relevant page's The model is instructed to assign a confidence of 0.5 if it's truly unsure, and more than 0.5 up to 1.0 for complete certainty. While this is qualitative in nature, it suffices as a baseline particularly if we train LoRA adapters on our known good data and use them in future runs.

In a future post I'll explore the LoRA adapters I've trained in order to distill good decisions from expensive models and stabilize things like 'assign a value' type instructions.

Truncated example. Full crawler logic: gist link

start_page: StartPage = llm_client.invoke_with_model_response_json(
    prompt, StartPage
)

visited.add(url)

data_pages_candidates, to_visit_candidates = filter_possible_relevant_pages(
    start_page.possible_relevant_pages, url, 0.5, visited
)

for dt, page_list in data_pages_candidates.items():
    for p in page_list:
        if p not in data_pages[dt]:
            data_pages[dt].append(p)

to_visit += to_visit_candidates

# From here forward, we're crawling as usual.
while to_visit and pages_visited < max_pages:
    current_url = (to_visit.pop()).url
    if current_url in visited:
        continue
    pages_visited += 1
    logger.info(f"visiting: {current_url}")
    page_html = fetch_html_inner(current_url)
    if not page_html:
        logger.warning(f"No HTML found at the provided URL: {current_url}")
        continue

    page_text = render_page_with_links_as_markdown(page_html, base_url=current_url)
    page_text = llm_utils.trim_to_token_count(page_text)
    visited.add(current_url)
    prompt = RelevantPagesPrompt.format(
        entity_type=EntityType.SCHOOL_BOARD, text=page_text, url=current_url
    )
    try:
        relevant_page: RelevantPage = llm_client.invoke_with_model_response_json(
            prompt, RelevantPage
        )
    except ValueError as e:
        logger.error(
            f"Error generating relevant page response for {current_url}: {e}"
        )
        continue
    relevant_page.url = current_url
    if relevant_page.data_page_info:
        data_pages[relevant_page.data_page_info.data_type] += (
            [relevant_page]
            if relevant_page
            not in data_pages[relevant_page.data_page_info.data_type]
            else []
        )

    data_pages_candidates, to_visit_candidates = filter_possible_relevant_pages(
        relevant_page.possible_relevant_pages,
        url,
        crawl_confidence_threshold,
        visited,
    )

    for dt, page_list in data_pages_candidates.items():
        for p in page_list:
            if p not in data_pages[dt]:
                data_pages[dt].append(p)

    to_visit += to_visit_candidates

The LLM Client

Part of making this easy was adding helper methods to my LLM client for models without tool calling. Truncated. Full file: gist link

def invoke_with_model_response_json(self, template: str, model: t.Any) -> t.Any:
    """This method is intended for cases where a model does not have explicit tool calling capability.

    In these cases the Pydantic Model cannot be bound directly, so, instead its signature is injected
    into the provided prompt along with instructions to return output in this format. Output is further trimmed
    in the common case where a valid JSON object is embedded into junk.
    """
    template = self.compile_model_response_template(template, model)

    def _run(template, last_error: t.Optional[Exception] = None):
        if last_error:
            template = f"{template}\n\n--- THIS IS A RETRY, THE LAST CALL RESULTED IN AN ERROR ---:\n{last_error}"

        # Invoke the llm with the new template
        result = self.invoke(template)

        # Trim any superfluous characters from the start and end of the response
        trimmed_response = self.trim_json_response(result)

        # Try to parse the response as JSON
        json_response = json.loads(trimmed_response)

        # Validate the response against the model's schema
        model_result = model.model_validate(json_response)
        model_result.llm_model_name = self.MODEL_NAME

        return model_result

    try:
        return _run(template)
    except (json.JSONDecodeError, ValueError) as e:
        logger.error(f"Failed to parse or validate response: {e}")
        logger.info("retrying llm call and model serialization.")
        return _run(template, e)

The Results

Sample output (full JSON: gist link):

{
  "pages": {
    "Board of Trustee Information": [
      {
        "url": "https://www.swsd.k12.wa.us/o/swsd/page/series-1000-the-board-of-directors",
        "title": "Series - 1000 The Board of Directors",
        "has_data": true,
        "has_data_links": false,
        "description": "Information about the Board of Directors including policies and procedures.",
        "data_page_info": {
          "data_type": "Board of Trustee Information",
          "is_archive": false,
          "data_years_available": [],
          "confidence": 0.9
        },
        "possible_relevant_pages": [],
        "no_data_found": null,
        "llm_model_name": "deepseek-r1:70b"
      }
    ],
    "Meeting Recordings": [
      {
        "url": "https://www.swsd.k12.wa.us/documents/district/school-board-information/2024-2025-board-documents/2024-2025-board-meeting-recording/694276",
        "title": "2024-2025 Board Meeting Recordings",
        "has_data": true,
        "has_data_links": false,
        "description": "A page containing links to audio and video recordings of board meetings for the 2024-2025 academic year.",
        "data_page_info": {
          "data_type": "Meeting Recordings",
          "is_archive": false,
          "data_years_available": [
            2024,
            2025
          ],
          "confidence": 0.9
        },
        "possible_relevant_pages": [],
        "no_data_found": null,
        "llm_model_name": "deepseek-r1:70b"
      }
    ]
  }
}

At a bit over 55,000 tokens in and 5000 out, for deepseek-r1:70b, this resulted in a current cost of (5000/1000000 * $0.40)+(55000/1000000 * $0.10) == $0.0075, just under a penny.

Out of 298 school districts I track in Washington, this method got me 273 usable data_pages.json files with very cheap models. For the rest, I use more expensive fallback methods.

(venv) ➜  data-pipeline git:(main) find data/countries/usa/states/wa/counties/*/school_boards/* -maxdepth 1 -name 'data_pages.json' | wc -l
     273

It all adds up to a much more efficient and maintainable infrastructure.

What’s Next

In the next post, I’ll show how I use feedback from scraping runs to prune dead pages and make the process a bit faster and cheaper.

← All Posts