fix(fetch): fall back when Readability strips hidden SSR content by Christian-Sidak · Pull Request #3922 · modelcontextprotocol/servers

Christian-Sidak · 2026-04-12T18:26:22Z

Summary

Adds a three-stage fallback to extract_content_from_html() so that pages using progressive SSR (hidden pre-hydration markup) are not silently reduced to a single line of loading-shell text
Stage 1: Readability (existing behavior, unchanged for normal sites)
Stage 2: readabilipy without Readability JS (less aggressive, does not filter by CSS visibility)
Stage 3: Raw markdownify conversion (last resort)
Fallback only activates when Readability output is shorter than 1% of the input HTML

Motivation

Sites using progressive server-side rendering (Next.js streaming, Remix deferred, custom Lambda SSR) deliver content in two phases: a small visible loading shell, then the real content in a hidden container (visibility:hidden; position:absolute; top:-9999px) that becomes visible after client-side hydration. Mozilla Readability treats hidden elements as non-content and strips them entirely, causing mcp-server-fetch to return only the loading shell text with no indication that content was lost.

For example, fetching https://runtimeweb.com returns just "Unified Serverless Framework for Full-Stack TypeScript Applications" instead of the full page content.

Changes

src/fetch/src/mcp_server_fetch/server.py: Modified extract_content_from_html() to try three extraction stages, falling back only when the previous stage produces disproportionately little text
src/fetch/tests/test_server.py: Added 6 unit tests covering all fallback paths, threshold behavior, and no-regression for normal pages

Breaking Changes

None. The fix only activates when Readability extracts less than 1% of the HTML size as text. Normal sites where Readability works correctly are completely unaffected.

Test plan

All 6 new fallback tests pass
Existing tests unaffected (they test Readability directly via Node.js, independent of this change)
No new dependencies added

Fixes #3878

Add a three-stage extraction pipeline to extract_content_from_html(): 1. Readability (existing, best quality for standard pages) 2. readabilipy without Readability JS (less aggressive, no CSS visibility filtering) 3. Raw markdownify conversion (last resort) Stages 2 and 3 only activate when stage 1 produces text shorter than 1% of the input HTML length, which indicates Readability stripped meaningful content. This commonly happens with progressive SSR sites that deliver content in hidden containers (visibility:hidden, position:absolute) awaiting client-side hydration. No new dependencies. No behavior change for sites where Readability works correctly. Fixes modelcontextprotocol#3878

olaservo

This is the strongest of the three Readability fallback PRs (#3879, #3894, #3922):

3-stage pipeline (Readability → readabilipy without Readability → raw markdownify) gives a good quality gradient
Proportional 1% threshold scales with page size, unlike a fixed constant
Preserves the <error> return for truly empty pages (no behavior change)
6 fully-mocked deterministic tests with good edge case coverage
Smallest diff to production code (only 4 lines removed)

We'll close the other two PRs with credit to @morozow for filing the original issue (#3878).

This review was assisted by Claude Code.

morozow · 2026-04-14T08:58:51Z

@Christian-Sidak @olaservo I analyzed both implementations #3922, #3879 in the context of the MCP fetch server and the expected contract of a transport-level content extractor.
#3922 (current) variant introduces opinionated validation via length thresholds, .strip(), and a hard <error> state. This breaks neutrality: it can discard valid outputs (e.g., whitespace-only content), produces non-data responses, and makes the result dependent on heuristics tied to HTML size. In MCP terms, this violates separation of concerns – the fetch layer should not decide what is "good enough" content.

#3879 variant behaves as a proper extraction primitive: it attempts readability, falls back when needed, and always returns the extracted content without enforcing interpretation or artificial failure states. This keeps the pipeline predictable for agents and preserves full fidelity of the source, which is critical for downstream processing.

#3879 variant is the correct approach for MCP fetch, as it maintains a clean transport contract and avoids embedding policy/validation logic into the extraction layer.

PR needs either refactoring or revert to the implementation from #3879, #3947 to be eligible for merge and properly resolve the issue described in #3878.

morozow · 2026-04-14T12:31:23Z

@olaservo Updated tests for full edge-case coverage and reopened as #3947

olaservo approved these changes Apr 12, 2026

View reviewed changes

This was referenced Apr 12, 2026

fix(fetch): add fallback extraction for readability-stripped content #3879

Closed

fix(fetch): fall back to raw HTML conversion when Readability returns empty content #3894

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(fetch): fall back when Readability strips hidden SSR content#3922

fix(fetch): fall back when Readability strips hidden SSR content#3922
Christian-Sidak wants to merge 1 commit intomodelcontextprotocol:mainfrom
Christian-Sidak:fix/fetch-ssr-content-fallback

Christian-Sidak commented Apr 12, 2026

Uh oh!

olaservo left a comment

Uh oh!

morozow commented Apr 14, 2026 •

edited

Loading

Uh oh!

morozow commented Apr 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

Christian-Sidak commented Apr 12, 2026

Summary

Motivation

Changes

Breaking Changes

Test plan

Uh oh!

olaservo left a comment

Choose a reason for hiding this comment

Uh oh!

morozow commented Apr 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

morozow commented Apr 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

morozow commented Apr 14, 2026 •

edited

Loading