markdown-for-agents

Advanced Options

Token estimation, deduplication, and server timing for fine-tuning conversions.

Advanced Options

Token Estimation

Every conversion returns a tokenEstimate with token, character, and word counts. This helps with LLM context window planning and cost estimation without needing a separate tokenizer dependency.

Built-in Heuristic

The default estimator uses a fast heuristic of ~4 characters per token. This is accurate enough for planning purposes and adds zero overhead:

import { convert } from 'markdown-for-agents';

const { markdown, tokenEstimate } = convert(html);

console.log(tokenEstimate);
// { tokens: 12, characters: 46, words: 8 }
from markdown_for_agents import convert

result = convert(html)

print(result.token_estimate)
# TokenEstimate(tokens=12, characters=46, words=8)

The estimate is also surfaced by middleware via the x-markdown-tokens response header.

Custom Token Counter

Replace the built-in heuristic with an exact tokenizer when precision matters:

import { convert } from 'markdown-for-agents';
import { encoding_for_model } from 'tiktoken';

const enc = encoding_for_model('gpt-4o');

const { markdown, tokenEstimate } = convert(html, {
    tokenCounter: text => ({
        tokens: enc.encode(text).length,
        characters: text.length,
        words: text.split(/\s+/).filter(Boolean).length
    })
});
from markdown_for_agents import convert, TokenEstimate
import tiktoken

enc = tiktoken.encoding_for_model("gpt-4o")

result = convert(html, token_counter=lambda text: TokenEstimate(
    tokens=len(enc.encode(text)),
    characters=len(text),
    words=len(text.split()),
))

The custom counter receives the final markdown string and must return a TokenEstimate with tokens, characters, and words fields. When used with middleware, the custom counter's value flows through to the x-markdown-tokens response header.

Standalone Usage

The token estimator is also available as a standalone function:

import { estimateTokens } from 'markdown-for-agents/tokens';

const estimate = estimateTokens('Some markdown text');
// { tokens: 5, characters: 18, words: 3 }

Deduplication

Real-world HTML pages often contain repeated content - navigation links that appear in multiple places, "Read more" buttons, or footer text duplicated across sections. Deduplication removes these repeated blocks from the converted Markdown.

Basic Usage

import { convert } from 'markdown-for-agents';

const { markdown } = convert(html, { deduplicate: true });
from markdown_for_agents import convert

result = convert(html, deduplicate=True)

Minimum Length

The minLength option (default: 10) controls the minimum block length in characters eligible for deduplication. Blocks shorter than this are always kept, which protects separators (---), short headings, and formatting elements.

const { markdown } = convert(html, {
    deduplicate: { minLength: 5 } // catch short repeated phrases like "Read more"
});
from markdown_for_agents import DeduplicateOptions

result = convert(html, deduplicate=DeduplicateOptions(min_length=5))

Lower it to catch short repeated phrases. Raise it for more conservative deduplication.

When to Use It

Deduplication is most useful when combined with content extraction disabled or partially disabled. When extract: true is set, most boilerplate is already stripped before conversion, so there is less to deduplicate.

Enable deduplication when:

  • You're converting pages without extraction and want to clean up repeated blocks
  • Pages have content that is legitimately repeated across sections (e.g. repeated CTAs)
  • You want to minimize token usage as aggressively as possible

Server Timing

Enable serverTiming to measure conversion duration. The result includes convertDuration (in milliseconds), and middleware adapters use it to set a Server-Timing header:

const { markdown, convertDuration } = convert(html, { serverTiming: true });
console.log(`Conversion took ${convertDuration}ms`);
// Middleware sets: Server-Timing: mfa.convert;dur=4.7;desc="HTML to Markdown"
result = convert(html, server_timing=True)
print(f"Conversion took {result.convert_duration}ms")
# Middleware sets: Server-Timing: mfa.convert;dur=4.7;desc="HTML to Markdown"

Server-Timing surfaces in browser devtools (Network tab > Timing) and is accessible via the PerformanceServerTiming API. Middleware also sets an x-markdown-timing header with the same data, which survives CDN caching.

Content-Signal Header

Middleware can set a content-signal HTTP header to communicate publisher consent for AI usage. This is opt-in - the header is only set when explicitly configured.

Basic Usage

import { markdown } from '@markdown-for-agents/express';

app.use(
    markdown({
        contentSignal: {
            aiTrain: true,
            search: true,
            aiInput: true
        }
    })
);
// Sets header: content-signal: ai-train=yes, search=yes, ai-input=yes
from markdown_for_agents import MiddlewareOptions, ContentSignalOptions

app.add_middleware(
    MarkdownMiddleware,
    options=MiddlewareOptions(
        content_signal=ContentSignalOptions(
            ai_train=True,
            search=True,
            ai_input=True,
        ),
    ),
)
# Sets header: content-signal: ai-train=yes, search=yes, ai-input=yes

Signal Fields

FieldHeader valueDescription
aiTrainai-train=yesConsent for AI model training on this content
searchsearch=yesConsent for search engine indexing
aiInputai-input=yesConsent for AI agents to use content as context

Granular Control

Only explicitly set fields are included in the header. Set a field to false to signal denial, or omit it to exclude it entirely:

app.use(
    markdown({
        contentSignal: {
            aiTrain: false, // ai-train=no
            search: true    // search=yes
            // aiInput omitted - not included in header
        }
    })
);
// Sets header: content-signal: ai-train=no, search=yes
MiddlewareOptions(
    content_signal=ContentSignalOptions(
        ai_train=False,  # ai-train=no
        search=True,     # search=yes
        # ai_input omitted - not included in header
    ),
)
# Sets header: content-signal: ai-train=no, search=yes

On this page