Bug report
Bug description:
Summary
_ssl__SSLSocket_read_impl and _ssl__SSLSocket_write_impl in Modules/_ssl.c do not call ERR_clear_error() before SSL_read_ex() / SSL_write_ex(). This allows stale entries on the per-thread OpenSSL error queue to corrupt the result of SSL_get_error(), causing spurious BrokenPipeError or OSError on healthy SSL connections.
Affected versions
All current CPython versions. Confirmed in 3.12 branch (line 2544) and main / 3.15-dev (line 2941).
Root cause
The do { ... } while() retry loop in _ssl__SSLSocket_read_impl (Modules/_ssl.c L2939-2942 on main):
do {
Py_BEGIN_ALLOW_THREADS;
retval = SSL_read_ex(self->ssl, mem, (size_t)len, &count);
err = _PySSL_errno(retval == 0, self->ssl, retval);
Py_END_ALLOW_THREADS;
// ...
} while (err.ssl == SSL_ERROR_WANT_READ || err.ssl == SSL_ERROR_WANT_WRITE);
_PySSL_errno() calls SSL_get_error(ssl, retcode), which internally calls ERR_peek_last_error(). Per the OpenSSL documentation:
In addition to ssl and ret, SSL_get_error() inspects the current thread's OpenSSL error queue. Thus, SSL_get_error() must be called in the same thread that performed the TLS/SSL I/O operation, and no other OpenSSL function calls should appear in between. The current thread's error queue must be empty before the TLS/SSL I/O operation is attempted, or SSL_get_error() will not work reliably.
If stale error entries are present on the queue from a prior SSL operation (on the same thread but a different SSL object), SSL_get_error() misattributes them and returns SSL_ERROR_SYSCALL instead of the correct SSL_ERROR_WANT_READ.
The same issue exists in _ssl__SSLSocket_write_impl.
When this manifests
This bug is invisible in multi-threaded programs because each OS thread has its own OpenSSL error queue. It becomes critical in cooperative multitasking frameworks (gevent, eventlet, asyncio with SSL) where multiple coroutines/greenlets share a single OS thread and thus a single OpenSSL error queue.
Concrete scenario (gevent):
- Greenlet A performs an SSL write on an HTTPS connection. The remote client has disconnected, so
SSL_write_ex() → send() fails with EPIPE. OpenSSL pushes an error entry onto the (per-thread) error queue. The greenlet handles the exception, but the error queue is not cleared.
- The gevent hub switches to Greenlet B, which is an AMQP consumer doing
SSL_read_ex() on a healthy RabbitMQ connection.
SSL_read_ex() → recv() returns EAGAIN (no data available — normal for a non-blocking socket).
SSL_get_error() finds the stale error from step 1 via ERR_peek_last_error() and returns SSL_ERROR_SYSCALL instead of SSL_ERROR_WANT_READ.
_PySSL_errno() captures errno = 32 (stale EPIPE from step 1).
- CPython exits the retry loop, enters
PySSL_SetError(), and raises BrokenPipeError(errno=32, "Broken pipe") on a perfectly healthy connection.
Evidence
- Disassembly: The compiled
_ssl.cpython-312-x86_64-linux-gnu.so confirms no ERR_clear_error (PLT 0x9050) before SSL_read_ex (PLT 0x93b0) at the call site.
- Production telemetry: At the moment of every
BrokenPipeError, getsockopt(SO_ERROR) returns 0 (no kernel-level error), and tcpdump shows no FIN/RST from the remote side — the TCP connection is healthy.
- Workaround validation: Calling
ERR_clear_error() (via ctypes) before every _sslobj.read() in a monkey-patched ssl.SSLSocket.read() completely eliminates the spurious errors. Tested for 15+ minutes under production load with zero errors, after months of constant failures every ~90 seconds.
Proposed fix
Add ERR_clear_error() before SSL_read_ex() and SSL_write_ex() in their respective retry loops:
do {
Py_BEGIN_ALLOW_THREADS;
ERR_clear_error(); /* Prevent stale errors from affecting SSL_get_error() */
retval = SSL_read_ex(self->ssl, mem, (size_t)len, &count);
err = _PySSL_errno(retval == 0, self->ssl, retval);
Py_END_ALLOW_THREADS;
// ...
This matches OpenSSL's documented requirement and is consistent with how CPython already calls ERR_clear_error() in other SSL functions (e.g., _ssl__SSLSocket_do_handshake_impl, _ssl_ctx_new).
Related
Reproducer
A minimal reproducer requires two SSL connections on the same OS thread. In pseudocode:
import ssl, socket, gevent
def writer_greenlet():
"""SSL connection that will fail, leaving stale error on queue"""
ctx = ssl.create_default_context()
sock = ctx.wrap_socket(socket.socket(), server_hostname="...")
sock.connect(...)
# Remote side disconnects
sock.write(b"data") # raises BrokenPipeError — leaves stale OpenSSL error
def reader_greenlet():
"""Healthy SSL connection that reads — gets spurious BrokenPipeError"""
ctx = ssl.create_default_context()
sock = ctx.wrap_socket(socket.socket(), server_hostname="...")
sock.connect(...)
# This should block waiting for data, but instead raises BrokenPipeError
sock.read(4096) # BrokenPipeError on a HEALTHY connection
gevent.joinall([
gevent.spawn(writer_greenlet),
gevent.spawn(reader_greenlet),
])
Versions
- CPython: 3.12.12, also present in
main (3.15-dev, commit d14e31e)
- OpenSSL: 3.5.1 (also reproducible with 3.0.x, 3.2.x)
- OS: RHEL 10.1
- gevent: 25.4.1 / 25.8.2
The pseudocode reproducer is schematic — in practice, the trigger requires precise greenlet switching timing. The production scenario (AMQP consumers + HTTPS server in gevent) triggers it reliably every ~90 seconds.
CPython versions tested on:
3.12
Operating systems tested on:
Linux
Linked PRs
Bug report
Bug description:
Summary
_ssl__SSLSocket_read_impland_ssl__SSLSocket_write_implinModules/_ssl.cdo not callERR_clear_error()beforeSSL_read_ex()/SSL_write_ex(). This allows stale entries on the per-thread OpenSSL error queue to corrupt the result ofSSL_get_error(), causing spuriousBrokenPipeErrororOSErroron healthy SSL connections.Affected versions
All current CPython versions. Confirmed in
3.12branch (line 2544) andmain/ 3.15-dev (line 2941).Root cause
The
do { ... } while()retry loop in_ssl__SSLSocket_read_impl(Modules/_ssl.cL2939-2942 on main):_PySSL_errno()callsSSL_get_error(ssl, retcode), which internally callsERR_peek_last_error(). Per the OpenSSL documentation:If stale error entries are present on the queue from a prior SSL operation (on the same thread but a different SSL object),
SSL_get_error()misattributes them and returnsSSL_ERROR_SYSCALLinstead of the correctSSL_ERROR_WANT_READ.The same issue exists in
_ssl__SSLSocket_write_impl.When this manifests
This bug is invisible in multi-threaded programs because each OS thread has its own OpenSSL error queue. It becomes critical in cooperative multitasking frameworks (gevent, eventlet, asyncio with SSL) where multiple coroutines/greenlets share a single OS thread and thus a single OpenSSL error queue.
Concrete scenario (gevent):
SSL_write_ex()→send()fails withEPIPE. OpenSSL pushes an error entry onto the (per-thread) error queue. The greenlet handles the exception, but the error queue is not cleared.SSL_read_ex()on a healthy RabbitMQ connection.SSL_read_ex()→recv()returnsEAGAIN(no data available — normal for a non-blocking socket).SSL_get_error()finds the stale error from step 1 viaERR_peek_last_error()and returnsSSL_ERROR_SYSCALLinstead ofSSL_ERROR_WANT_READ._PySSL_errno()captureserrno = 32(stale EPIPE from step 1).PySSL_SetError(), and raisesBrokenPipeError(errno=32, "Broken pipe")on a perfectly healthy connection.Evidence
_ssl.cpython-312-x86_64-linux-gnu.soconfirms noERR_clear_error(PLT0x9050) beforeSSL_read_ex(PLT0x93b0) at the call site.BrokenPipeError,getsockopt(SO_ERROR)returns 0 (no kernel-level error), andtcpdumpshows no FIN/RST from the remote side — the TCP connection is healthy.ERR_clear_error()(via ctypes) before every_sslobj.read()in a monkey-patchedssl.SSLSocket.read()completely eliminates the spurious errors. Tested for 15+ minutes under production load with zero errors, after months of constant failures every ~90 seconds.Proposed fix
Add
ERR_clear_error()beforeSSL_read_ex()andSSL_write_ex()in their respective retry loops:This matches OpenSSL's documented requirement and is consistent with how CPython already calls
ERR_clear_error()in other SSL functions (e.g.,_ssl__SSLSocket_do_handshake_impl,_ssl_ctx_new).Related
PySSL_SetErrorneeds to be updated #115627 (commit ea9a296) — ImprovedPySSL_SetErrorhandling forSSL_ERROR_SYSCALL, but only inmain; does not addERR_clear_error()before read/write calls.ERR_LIB_SYShandling improvement, backported to 3.12; does not address this issue.Reproducer
A minimal reproducer requires two SSL connections on the same OS thread. In pseudocode:
Versions
main(3.15-dev, commit d14e31e)The pseudocode reproducer is schematic — in practice, the trigger requires precise greenlet switching timing. The production scenario (AMQP consumers + HTTPS server in gevent) triggers it reliably every ~90 seconds.
CPython versions tested on:
3.12
Operating systems tested on:
Linux
Linked PRs