Skip to content

fix(fetch): fall back when Readability strips hidden SSR content#3922

Open
Christian-Sidak wants to merge 1 commit intomodelcontextprotocol:mainfrom
Christian-Sidak:fix/fetch-ssr-content-fallback
Open

fix(fetch): fall back when Readability strips hidden SSR content#3922
Christian-Sidak wants to merge 1 commit intomodelcontextprotocol:mainfrom
Christian-Sidak:fix/fetch-ssr-content-fallback

Conversation

@Christian-Sidak
Copy link
Copy Markdown

Summary

  • Adds a three-stage fallback to extract_content_from_html() so that pages using progressive SSR (hidden pre-hydration markup) are not silently reduced to a single line of loading-shell text
  • Stage 1: Readability (existing behavior, unchanged for normal sites)
  • Stage 2: readabilipy without Readability JS (less aggressive, does not filter by CSS visibility)
  • Stage 3: Raw markdownify conversion (last resort)
  • Fallback only activates when Readability output is shorter than 1% of the input HTML

Motivation

Sites using progressive server-side rendering (Next.js streaming, Remix deferred, custom Lambda SSR) deliver content in two phases: a small visible loading shell, then the real content in a hidden container (visibility:hidden; position:absolute; top:-9999px) that becomes visible after client-side hydration. Mozilla Readability treats hidden elements as non-content and strips them entirely, causing mcp-server-fetch to return only the loading shell text with no indication that content was lost.

For example, fetching https://runtimeweb.com returns just "Unified Serverless Framework for Full-Stack TypeScript Applications" instead of the full page content.

Changes

  • src/fetch/src/mcp_server_fetch/server.py: Modified extract_content_from_html() to try three extraction stages, falling back only when the previous stage produces disproportionately little text
  • src/fetch/tests/test_server.py: Added 6 unit tests covering all fallback paths, threshold behavior, and no-regression for normal pages

Breaking Changes

None. The fix only activates when Readability extracts less than 1% of the HTML size as text. Normal sites where Readability works correctly are completely unaffected.

Test plan

  • All 6 new fallback tests pass
  • Existing tests unaffected (they test Readability directly via Node.js, independent of this change)
  • No new dependencies added

Fixes #3878

Add a three-stage extraction pipeline to extract_content_from_html():

1. Readability (existing, best quality for standard pages)
2. readabilipy without Readability JS (less aggressive, no CSS visibility filtering)
3. Raw markdownify conversion (last resort)

Stages 2 and 3 only activate when stage 1 produces text shorter than 1% of the
input HTML length, which indicates Readability stripped meaningful content. This
commonly happens with progressive SSR sites that deliver content in hidden
containers (visibility:hidden, position:absolute) awaiting client-side hydration.

No new dependencies. No behavior change for sites where Readability works correctly.

Fixes modelcontextprotocol#3878
Copy link
Copy Markdown
Member

@olaservo olaservo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the strongest of the three Readability fallback PRs (#3879, #3894, #3922):

  • 3-stage pipeline (Readability → readabilipy without Readability → raw markdownify) gives a good quality gradient
  • Proportional 1% threshold scales with page size, unlike a fixed constant
  • Preserves the <error> return for truly empty pages (no behavior change)
  • 6 fully-mocked deterministic tests with good edge case coverage
  • Smallest diff to production code (only 4 lines removed)

We'll close the other two PRs with credit to @morozow for filing the original issue (#3878).


This review was assisted by Claude Code.

@morozow
Copy link
Copy Markdown

morozow commented Apr 14, 2026

@Christian-Sidak @olaservo I analyzed both implementations #3922, #3879 in the context of the MCP fetch server and the expected contract of a transport-level content extractor.
#3922 (current) variant introduces opinionated validation via length thresholds, .strip(), and a hard <error> state. This breaks neutrality: it can discard valid outputs (e.g., whitespace-only content), produces non-data responses, and makes the result dependent on heuristics tied to HTML size. In MCP terms, this violates separation of concerns – the fetch layer should not decide what is "good enough" content.

#3879 variant behaves as a proper extraction primitive: it attempts readability, falls back when needed, and always returns the extracted content without enforcing interpretation or artificial failure states. This keeps the pipeline predictable for agents and preserves full fidelity of the source, which is critical for downstream processing.

#3879 variant is the correct approach for MCP fetch, as it maintains a clean transport contract and avoids embedding policy/validation logic into the extraction layer.

PR needs either refactoring or revert to the implementation from #3879, #3947 to be eligible for merge and properly resolve the issue described in #3878.

@morozow
Copy link
Copy Markdown

morozow commented Apr 14, 2026

@olaservo Updated tests for full edge-case coverage and reopened as #3947

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

mcp-server-fetch drops SSR content from streaming/progressive rendering sites

3 participants