Skip to content

fix(fetch): add fallback extraction for readability-stripped content#3947

Open
morozow wants to merge 3 commits intomodelcontextprotocol:mainfrom
morozow:fix/fetch-readability-hidden-content-fallback
Open

fix(fetch): add fallback extraction for readability-stripped content#3947
morozow wants to merge 3 commits intomodelcontextprotocol:mainfrom
morozow:fix/fetch-readability-hidden-content-fallback

Conversation

@morozow
Copy link
Copy Markdown

@morozow morozow commented Apr 14, 2026

Description

Add fallback content extraction in mcp-server-fetch when Mozilla Readability strips too much content from the page. This fixes silent content loss on sites using progressive SSR with hidden pre-hydration markup.
Issue: #3878

Server Details

  • Server: fetch
  • Changes to: extract_content_from_html() in src/fetch/src/mcp_server_fetch/server.py

Motivation and Context

Sites using progressive server-side rendering (streaming + Lambda SSR) deliver content in two phases:

  1. An immediate visible loading shell (small HTML)
  2. SSR-rendered content in a hidden container (visibility:hidden; position:absolute; top:-9999px) that becomes visible after React hydration

Mozilla Readability treats hidden elements as non-content and strips them. The fetch tool receives the full HTML (e.g. 83 KB or 245 KB) but returns only the loading shell text — typically a single line — with no error or warning.

Before fix:

fetch("https://runtimeweb.com")
→ "Unified Serverless Framework for Full-Stack TypeScript Applications"
   (1 line from 245 KB)

fetch("https://stdiobus.com/spec")
→ "Transport-level routing for MCP/ACP protocols"
   (1 line from 83 KB)

After fix:
Both return full page content — headings, paragraphs, tables, code blocks, navigation.

This pattern is used by Next.js streaming, Remix deferred loaders, and custom SSR architectures. The number of affected sites will grow as progressive SSR adoption increases.

How Has This Been Tested?

37 tests total (20 existing + 17 new), all passing.

Tests cover:

  • Happy path — Readability works, no fallback triggered, including threshold boundary
  • Stage 2 fallback — Readability returns too little / None / missing key / whitespace-only → no-readability mode recovers
  • Stage 3 fallback — both readabilipy modes fail → raw markdownify recovers content
  • Real-world SSR patterns — visibility:hidden, position:absolute; top:-9999px, opacity:0
  • Edge cases — empty input, very small HTML, degenerate markup
  • Contract guarantee — no code path produces <error> tags; function always returns extracted content or empty string, leaving quality judgment to the caller
  • Full cascade verification — confirms exact call sequence and stage transitions

Manually verified against https://stdiobus.com and https://runtimeweb.com.

curl -s https://stdiobus.com/ | wc -c        # ~83965 bytes
curl -s https://runtimeweb.com/ | wc -c      # ~245608 bytes
curl -s https://stdiobus.com/ | grep -c "visibility:hidden"   # 1

Breaking Changes

None. The fix only activates when Readability extracts less than 1% of the HTML size as text. Normal sites where Readability works correctly are unaffected.

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Documentation update

Checklist

  • I have read the MCP Protocol Documentation
  • My changes follows MCP security best practices
  • I have updated the server's README accordingly
  • I have tested this with an LLM client
  • My code follows the repository's style guidelines
  • New and existing tests pass locally
  • I have added appropriate error handling
  • I have documented all environment variables and configuration options

Additional context

The fix adds a three-stage extraction pipeline:

  1. Readability (existing) — works for most sites
  2. readabilipy without Readability JS (new fallback) — less aggressive, doesn't filter by visibility
  3. Raw markdownify (new last resort) — converts full HTML directly

Stage 2 and 3 only activate when stage 1 produces text shorter than 1% of the input HTML (max(1, len(html) // 100)). No new dependencies added.

Root cause: readabilipy.simple_json.simple_json_from_html_string(html, use_readability=True) invokes Mozilla Readability which evaluates CSS visibility and discards elements with visibility:hidden, position:absolute; top:-9999px, opacity:0. These styles are standard in progressive SSR for pre-hydration content delivery.

morozow added 2 commits April 9, 2026 01:50
- Add fallback mechanism when readability extracts minimal content (e.g., SSR hydration divs)
- Implement minimum content length threshold (1% of original HTML) to detect over-aggressive stripping
- Fall back to extraction without readability, then raw HTML markdownify if threshold not met
- Update extract_content_from_html to gracefully handle empty content instead of returning error
- Add comprehensive test suite for fallback scenarios including SSR patterns and hidden content
- Update test_empty_content_returns_error to test graceful handling instead of error message
- Replace single generic test with 11 comprehensive test cases covering fallback scenarios
- Add tests for readability returning sufficient content without fallback
- Add tests for readability stripping content and triggering fallback to non-readability mode
- Add tests for both readability modes failing and falling back to raw markdownify
- Add tests for None content and missing 'content' key edge cases
- Add tests for 1% threshold calculation and boundary conditions
- Add tests for whitespace-only content triggering fallback
- Add tests for stage 2 (no-readability) failures falling back to raw extraction
- Improve test clarity with descriptive names and docstrings explaining each scenario
- Ensure comprehensive coverage of the three-stage fallback extraction pipeline
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant