-
-
Notifications
You must be signed in to change notification settings - Fork 1k
Description
- Initially raised as discussion Percent encoding `|` in paths #3479
I got no response, so I'm opening this issue for more visibility.
OS: Windows 11
python --version: Python 3.12.8
httpx version: 0.28.1
I believe the | should be percent encoded in paths, which is not currently the case. If I'm understanding RFC3986 correctly, path characters are pchar, which can be unreserved, pct-encoded, sub-delims, ":", or "@". unreserved can be composed of ALPHA, DIGIT, "-", ".", "_", or "~". pct-encoded is the percent encoding sequences. sub-delims can be "!", "$", "&", "'", "(", ")", "*", "+", ",", ";", or "=". Nowhere in this set is the | character present, meaning it has to be percent-encoded.
Simplifying my problem, httpx seems to call its internal urlparse function to process urls. So, here's an example using that function. This function normally percent-encodes characters as needed, like spaces:
httpx._urlparse.urlparse('http://example.com/ ')will return
ParseResult(scheme='http', userinfo='', host='example.com', port=None, path='/%20', query=None, fragment=None)
However, this does not happen for |:
httpx._urlparse.urlparse('http://example.com/|')will return
ParseResult(scheme='http', userinfo='', host='example.com', port=None, path='/|', query=None, fragment=None)
In Firefox and Google Chrome, | is percent-encoded:
encodeURI('http://example.com/|') will return
"http://example.com/%7C"
In the requests library, | is also percent-encoded:
requests.utils.requote_uri('http://example.com/|')will return
'http://example.com/%7C'
The rfc3986 library also percent encodes |:
rfc3986.urlparse('http://example.com/|')will return
ParseResult(scheme='http', userinfo=None, host='example.com', port=None, path='/%7C', query=None, fragment=None)
Using urllib itself, | also seems to be percent-encoded for path components:
urllib.parse.quote('/|')will return
'/%7C'I'm fairly certain that I've interpreted this RFC right, and I think that | should be excluded from the PATH_SAFE set here. Here is its current value: "!$%&'()*+,-./0123456789:;=@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_abcdefghijklmnopqrstuvwxyz|~".
Potential Fix: nathaniel-daniel@a2f327f