Skip to content

Comments

gh-144015: Add portable SIMD optimization for bytes.hex()#143991

Merged
gpshead merged 25 commits intopython:mainfrom
gpshead:opt-pystrhex
Feb 23, 2026
Merged

gh-144015: Add portable SIMD optimization for bytes.hex()#143991
gpshead merged 25 commits intopython:mainfrom
gpshead:opt-pystrhex

Conversation

@gpshead
Copy link
Member

@gpshead gpshead commented Jan 18, 2026

Add SIMD optimization for bytes.hex(), bytearray.hex(), and binascii.hexlify() as well as hashlib .hexdigest() methods using platform-agnostic GCC/Clang vector extensions that compile to native SIMD instructions on our PEP-11 Tier 1 Linux and macOS platforms.

  • Up to 11x faster for large data (1KB+)
  • 1.1-3x faster for common small data (16-64 bytes, covering md5 through sha512 digest sizes)
  • Retains the existing scalar code for short inputs (<16 bytes) or platforms lacking SIMD instructions, no observable performance regressions there.

Supported platforms:

  • x86-64: the compiler generates SSE2 - always available, no flags or CPU feature checks needed
  • ARM64: NEON is always available, always available, no flags or CPU feature checks needed
  • ARM32: Requires NEON support and that appropriate compiler flags enable that (e.g., -march=native on a Raspberry Pi 3+) - while we could use runtime detection to allow neon when compiled without a recent enough -march= flag (cortex-a53 and later IIRC), there are diminishing returns in doing so. Anyone using 32-bit ARM in a situation where performance matters will already be compiling with such flags. (as opposed to 32-bit Raspbian compilation that defaults to aiming primarily for compatibility with rpi1&0 armv6 arch=armhf which lacks neon)
  • Windows/MSVC: Not supported in this PR. MSVC lacks __builtin_shufflevector, so the existing scalar path is used. Leaving it as an opportunity for the future for someone to figure out how to express the intent to their compiler.

are there SIMD risks?

This is compile time detection of features that are always available on the target architectures. No need for runtime feature inspection. Thus we do not require #125022.

The added platform ifdef's beyond checking if __builtin_shufflevector is available that you see in configure.ac are because compilers make it available regardless of whether it makes sense to use it or not. Building on aarch32 with the rpi's default armv6l-hf configuration for example would have it available, but resulted in 4x slower code that way. It is not until the correct -march= flag is passed that enables __ARM_NEON define and code generation that it produces faster code. So this check is tuned only to check for known good configs on our PEP-11 tier'ed architectures.

Performance details

Benchmarked using https://github.com/python/cpython/blob/0f94c061d49821a74096e57df8dff9617b80fad7/Tools/scripts/pystrhex_benchmark.py

Performance wins confirmed across the board on x86_64 (zen2), ARM64 (RPi4), ARM32 (RPi5 running 32-bit raspbian, using the -march=native compiler flags to enable it), ARM64 Apple M4.

The commit history on this branch contains earlier experiments for reference.

Example benchmark results (M4):

  1. bytes.hex() without separator: Scales extremely well - 1.02x at 16 bytes up to 9.8x at 4KB.
  2. hashlib hexdigest: Modest 7-15% improvement on the hex conversion portion. The hash computation dominates total time
Expand to see the M4 performance table:
  bytes.hex() (no separator)
  ┌────────────┬───────────┬───────────┬─────────┐
  │    Size    │ Baseline  │ Optimized │ Speedup │
  ├────────────┼───────────┼───────────┼─────────┤
  │ 16 bytes   │ 22.9 ns   │ 22.4 ns   │ 1.02x   │
  ├────────────┼───────────┼───────────┼─────────┤
  │ 32 bytes   │ 28.4 ns   │ 22.7 ns   │ 1.25x   │
  ├────────────┼───────────┼───────────┼─────────┤
  │ 64 bytes   │ 44.4 ns   │ 24.4 ns   │ 1.82x   │
  ├────────────┼───────────┼───────────┼─────────┤
  │ 256 bytes  │ 154.9 ns  │ 47.6 ns   │ 3.25x   │
  ├────────────┼───────────┼───────────┼─────────┤
  │ 4096 bytes │ 1969.2 ns │ 201.6 ns  │ 9.8x    │
  └────────────┴───────────┴───────────┴─────────┘
  hashlib hexdigest (hash + hex conversion)
  ┌───────────────────┬──────────┬───────────┬─────────┐
  │      Digest       │ Baseline │ Optimized │ Speedup │
  ├───────────────────┼──────────┼───────────┼─────────┤
  │ md5 (16 bytes)    │ 238.2 ns │ 231.7 ns  │ 1.03x   │
  ├───────────────────┼──────────┼───────────┼─────────┤
  │ sha1 (20 bytes)   │ 210.8 ns │ 197.3 ns  │ 1.07x   │
  ├───────────────────┼──────────┼───────────┼─────────┤
  │ sha256 (32 bytes) │ 214.6 ns │ 200.0 ns  │ 1.07x   │
  ├───────────────────┼──────────┼───────────┼─────────┤
  │ sha512 (64 bytes) │ 282.9 ns │ 255.9 ns  │ 1.11x   │
  └───────────────────┴──────────┴───────────┴─────────┘
More typical increase on a Raspberry Pi 5 (32-bit OS `CFLAGS=-march=native`) - _AMD Zen2 shows similar speedups on the small end and 8x on the large end_:
  bytes.hex() (no separator)
  ┌────────────┬───────────┬───────────┬─────────┐
  │    Size    │ Baseline  │ Optimized │ Speedup │
  ├────────────┼───────────┼───────────┼─────────┤
  │ 16 bytes   │ 99.9 ns   │ 75.6 ns   │ 1.3x    │
  ├────────────┼───────────┼───────────┼─────────┤
  │ 32 bytes   │ 123.6 ns  │ 82.2 ns   │ 1.5x    │
  ├────────────┼───────────┼───────────┼─────────┤
  │ 64 bytes   │ 172.7 ns  │ 91.1 ns   │ 1.9x    │
  ├────────────┼───────────┼───────────┼─────────┤
  │ 256 bytes  │ 535 ns    │ 195 ns    │ 2.7x    │
  ├────────────┼───────────┼───────────┼─────────┤
  │ 4096 bytes │ 6322 ns   │ 1131 ns   │ 5.6x    │
  └────────────┴───────────┴───────────┴─────────┘
and if you're curious about the path not taken by the end state of this PR using AVX, here that is on a zen4:
  bytes.hex() without separator
  ┌────────┬───────────┬─────────────────┬──────────────────┬──────────────────┐
  │  Size  │ Baseline  │     SIMD PR     │     AVX-512      │       AVX2       │
  ├────────┼───────────┼─────────────────┼──────────────────┼──────────────────┤
  │ 32 B   │ 44.7 ns   │ 27.4 ns (1.6x)  │ 29.2 ns (1.5x)   │ 29.0 ns (1.5x)   │
  ├────────┼───────────┼─────────────────┼──────────────────┼──────────────────┤
  │ 64 B   │ 64.5 ns   │ 28.3 ns (2.3x)  │ 29.2 ns (2.2x)   │ 29.4 ns (2.2x)   │
  ├────────┼───────────┼─────────────────┼──────────────────┼──────────────────┤
  │ 128 B  │ 104.8 ns  │ 31.7 ns (3.3x)  │ 29.0 ns (3.6x)   │ 30.8 ns (3.4x)   │
  ├────────┼───────────┼─────────────────┼──────────────────┼──────────────────┤
  │ 256 B  │ 185.8 ns  │ 45.0 ns (4.1x)  │ 35.9 ns (5.2x)   │ 40.4 ns (4.6x)   │
  ├────────┼───────────┼─────────────────┼──────────────────┼──────────────────┤
  │ 512 B  │ 361.1 ns  │ 75.3 ns (4.8x)  │ 55.0 ns (6.6x)   │ 61.4 ns (5.9x)   │
  ├────────┼───────────┼─────────────────┼──────────────────┼──────────────────┤
  │ 4096 B │ 2242.6 ns │ 278.1 ns (8.1x) │ 138.5 ns (16.2x) │ 174.0 ns (12.9x) │
  └────────┴───────────┴─────────────────┴──────────────────┴──────────────────┘
  The SIMD PR (SSE2/SSSE3) delivers strong speedups across the board, reaching 8x at 4KB.
  The AVX variants push further - AVX-512 hits 16x at 4KB, AVX2 achieves 13x.

gpshead and others added 16 commits January 18, 2026 02:04
Add AVX2-accelerated hexlify for the no-separator path when converting
bytes to hexadecimal strings. This processes 32 bytes per iteration
instead of 1, using:

- SIMD nibble extraction (shift + mask)
- Arithmetic nibble-to-hex conversion (branchless)
- Interleave operations for correct output ordering

Runtime CPU detection via CPUID ensures AVX2 is only used when
available. Falls back to scalar code for inputs < 32 bytes or when
AVX2 is not supported.

Performance improvement (bytes.hex() no separator):
- 32 bytes:   1.3x faster
- 64 bytes:   1.7x faster
- 128 bytes:  3.0x faster
- 256 bytes:  4.0x faster
- 512 bytes:  4.9x faster
- 4096 bytes: 11.9x faster

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Add AVX-512 accelerated hexlify for the no-separator path when
available. This processes 64 bytes per iteration using:

- AVX-512F, AVX-512BW for 512-bit operations
- AVX-512VBMI for efficient byte-level permutation (permutex2var_epi8)
- Masked blend for branchless nibble-to-hex conversion

Runtime detection via CPUID checks for all three required extensions.
Falls back to AVX2 for 32-63 byte remainders, then scalar for <32 bytes.

CPU hierarchy:
- AVX-512 (F+BW+VBMI): 64 bytes/iteration, uses for inputs >= 64 bytes
- AVX2: 32 bytes/iteration, uses for inputs >= 32 bytes
- Scalar: remaining bytes

Expected performance improvement over AVX2 for large inputs (4KB+)
due to doubled throughput per iteration.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Add NEON vectorized implementation for AArch64 that processes 16 bytes
per iteration using 128-bit NEON registers. Uses the same nibble-to-hex
arithmetic approach as AVX2/AVX-512 versions.

NEON is always available on AArch64, so no runtime detection is needed.
The implementation uses vzip1q_u8/vzip2q_u8 for interleaving high/low
nibbles into the correct output order.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Add SSE2 vectorized implementation that processes 16 bytes per iteration.
SSE2 is always available on x86-64 (part of AMD64 baseline), so no runtime
detection is needed.

This provides SIMD acceleration for all x86-64 machines, even those without
AVX2. The dispatch now cascades: AVX-512 (64+ bytes) → AVX2 (32+ bytes) →
SSE2 (16+ bytes) → scalar.

Benchmarks show ~5-6% improvement for 16-20 byte inputs, which is useful
for common hash digest sizes (MD5=16 bytes, SHA1=20 bytes).

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Benchmarks showed SSE2 performs nearly as well as AVX2 for most input
sizes (within 5% up to 256 bytes, within 8% at 512+ bytes). Since SSE2
is always available on x86-64 (part of the baseline), this eliminates:

- Runtime CPU feature detection via CPUID
- ~200 lines of AVX2/AVX-512 intrinsics code
- Maintenance burden of multiple SIMD implementations

The simpler SSE2-only approach provides most of the performance benefit
with significantly less code complexity.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
…sions

Replace separate platform-specific SSE2 and NEON implementations with a
single unified implementation using GCC/Clang vector extensions. The
portable code uses __builtin_shufflevector for interleave operations,
which compiles to native SIMD instructions:
- x86-64: punpcklbw/punpckhbw (SSE2)
- ARM64: zip1/zip2 (NEON)

This eliminates code duplication while maintaining SIMD performance.
Requires GCC 12+ or Clang 3.0+ on x86-64 or ARM64.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Extend the portable SIMD hexlify to handle separator cases where
bytes_per_sep >= 16. Uses in-place shuffle: SIMD hexlify to output
buffer, then work backwards to insert separators via memmove.

For 4096 bytes with sep=32: ~3.3µs (vs ~7.3µs for sep=1 scalar).
Useful for hex dump style output like bytes.hex('\n', 32).

Also adds benchmark for newline separator every 32 bytes.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Lower the threshold from abs_bytes_per_sep >= 16 to >= 8 for the SIMD
hexlify + memmove shuffle path. Benchmarks show this is worthwhile for
sep=8 and above, but memmove overhead negates benefits for smaller values.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
GCC's vector extensions generate inefficient code for unsigned byte
comparison (hi > nine): psubusb + pcmpeqb + pcmpeqb (3 instructions).

By casting to signed bytes before comparison, GCC generates the
efficient pcmpgtb instruction instead. This is safe because nibble
values (0-15) are within signed byte range.

This reduces the SIMD loop from 29 to 25 instructions, matching the
performance of explicit SSE2 intrinsics while keeping the portable
vector extensions approach.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Extract the scalar hexlify loop into _Py_hexlify_scalar() which is
shared between the SIMD fallback path and the main non-SIMD path.
Uses table lookup via Py_hexdigits for consistency.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Extend portable SIMD support to ARM32 when NEON is available.
The __builtin_shufflevector interleave compiles to vzip instructions
on ARMv7 NEON, similar to zip1/zip2 on ARM64.

NEON is optional on 32-bit ARM (unlike ARM64 where it's mandatory),
so we check for __ARM_NEON in addition to __arm__.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Add targeted tests for corner cases relevant to SIMD optimization:

- test_hex_simd_boundaries: Test lengths around the 16-byte SIMD
  threshold (14, 15, 16, 17, 31, 32, 33, 64, 65 bytes)

- test_hex_nibble_boundaries: Test the 9/10 nibble value boundary
  where digits become letters, verifying the signed comparison
  optimization works correctly

- test_hex_simd_separator: Test SIMD separator insertion path
  (triggered when sep >= 8 and len >= 16) with various group
  sizes and both positive/negative bytes_per_sep

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@gpshead gpshead self-assigned this Jan 18, 2026
@gpshead gpshead added the 🔨 test-with-buildbots Test PR w/ buildbots; report in status section label Jan 18, 2026
@bedevere-bot
Copy link

🤖 New build scheduled with the buildbot fleet by @gpshead for commit 5fc294c 🤖

Results will be shown at:

https://buildbot.python.org/all/#/grid?branch=refs%2Fpull%2F143991%2Fmerge

If you want to schedule another build, you need to add the 🔨 test-with-buildbots label again.

@bedevere-bot bedevere-bot removed the 🔨 test-with-buildbots Test PR w/ buildbots; report in status section label Jan 18, 2026
@gpshead
Copy link
Member Author

gpshead commented Jan 18, 2026

buildbot failures are all unrelated. test_capi, test__interpreters, or test_urllib2net etc.

@gpshead gpshead changed the title gh-XXXXXX: Add portable SIMD optimization for bytes.hex() gh-144015: Add portable SIMD optimization for bytes.hex() Jan 18, 2026
@gpshead gpshead marked this pull request as ready for review January 18, 2026 19:01
Copy link
Member

@serhiy-storchaka serhiy-storchaka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How important the performance of this operation? What are precedences of using SIMD instructions in CPython?

I think this is worth discussing with a larger audience. To me, the cost/benefit ratio seems too high.

@ndren
Copy link

ndren commented Jan 19, 2026

Doing something like:

        *dst++ = high + '0' + ((high > 9) ? 39 : 0);
        *dst++ = low  + '0' + ((low  > 9) ? 39 : 0);

seems to benefit the small case. I don't think gcc can see the near-linear relationship in the char array so it just emits a memory load. Godbolt
It looks about 3-4x faster for inputs between 16 bytes-1K without any slowdown in the smaller cases so maybe a better code complexity/performance tradeoff if manual SIMD is considered too complex?

@gpshead
Copy link
Member Author

gpshead commented Jan 22, 2026

Doing something like:

        *dst++ = high + '0' + ((high > 9) ? 39 : 0);
        *dst++ = low  + '0' + ((low  > 9) ? 39 : 0);

seems to benefit the small case. I don't think gcc can see the near-linear relationship in the char array so it just emits a memory load. Godbolt It looks about 3-4x faster for inputs between 16 bytes-1K without any slowdown in the smaller cases so maybe a better code complexity/performance tradeoff if manual SIMD is considered too complex?

I'm pretty sure I tried an iteration of this form and found no difference for the scalar code - but I'll recheck on a couple compilers and platforms.

@gpshead
Copy link
Member Author

gpshead commented Jan 22, 2026

How important the performance of this operation? What are precedences of using SIMD instructions in CPython?

I think this is worth discussing with a larger audience. To me, the cost/benefit ratio seems too high.

Valid questions. I've expanded the PR description. In this case, these are SIMD features being generated by the compiler that do not need CPU feature flags to be checked. They're really old instructions always available on the targeted architectures.

We do already use more advanced SIMD within the hashlib _hacl generated code built-in hash and hmac implementations. #125011 exists as something being investigated to centralize some of that existing detection logic but is not actually needed for this PRs purpose. If I go forward with a SIMD base64 I'll want to lean on that detection PR.

Importance? Hard to say, but it does speed up a common thing people do with hashlib hashes and the code is self contained and unlikely to ever need changes. Those are quite often used in hex form. If it hadn't shown signs of being useful on data that small, I wouldn't have bothered turning it into a PR. The larger data case speeds up a lot, but realistically I don't know of common practical uses for generating hex of large data.

I'm happy to remove the special case for accelerating the sep= codepath. There's probably less value in that and I could see that one being more of a maintenance question mark.

@gpshead gpshead requested a review from picnixz January 22, 2026 18:44
@vstinner
Copy link
Member

@serhiy-storchaka:

How important the performance of this operation? What are precedences of using SIMD instructions in CPython?
I think this is worth discussing with a larger audience. To me, the cost/benefit ratio seems too high.

I also have doubt that the speedup is worth it compared to the maintenance cost.

SIMD path for separator groups >= 8 bytes.

I expect that separator groups of 1 byte is the most common configuration. I don't think that it's worth it to optimize for separator groups >= 8 bytes.

Copy link

@cosmicexplorer cosmicexplorer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would like to understand the maintenance cost model that @serhiy-storchaka and @vstinner have in mind when evaluating this change. First off, I concur with the sentiment that hexing data seems like a very niche use case without any specific performance constraint to meet. However, that also makes the implementation very self-contained, as I argue in https://github.com/python/cpython/pull/143991/changes#r2837989606. If we think that incorporating SIMD into CPython is a long-term workstream, then this kind of self-contained change seems rather ideal.

I have an "MVP" in mind for demonstrating SIMD in CPython (see #125022 (comment)), constituting a really minimal set of string and byte set literal search/match operations for use cases like URL quoting, which currently rely on the implicit semantics of transitive C function calls for performance.

The SIMD-accelerated separator insertion code (for bytes_per_sep >= 8)
adds maintenance complexity with less clear benefit. Remove it along
with the test_hex_simd_separator test per reviewer feedback.

The no-separator SIMD path remains, which is the primary performance
win for bytes.hex(), bytearray.hex(), and binascii.hexlify().

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace the hardcoded arch/compiler preprocessor checks with a
configure.ac probe that tests __builtin_shufflevector with 128-bit
vectors directly. This follows the existing HAVE_BUILTIN_ATOMIC
pattern and lets configure determine what the compiler actually
supports rather than maintaining a list of arch/compiler combos.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…CTOR

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@gpshead gpshead added the 🔨 test-with-buildbots Test PR w/ buildbots; report in status section label Feb 22, 2026
@bedevere-bot
Copy link

bedevere-bot commented Feb 22, 2026

🤖 New build scheduled with the buildbot fleet by @gpshead for commit 7e67993 🤖

Results will be shown at:

https://buildbot.python.org/all/#/grid?branch=refs%2Fpull%2F143991%2Fmerge

If you want to schedule another build, you need to add the 🔨 test-with-buildbots label again.

Triage:

  • buildbot/AMD64 FreeBSD PR: unrelated
    • 1 test altered the execution environment (env changed): test.test_asyncio.test_events

@bedevere-bot bedevere-bot removed the 🔨 test-with-buildbots Test PR w/ buildbots; report in status section label Feb 22, 2026
@gpshead gpshead merged commit ad4ee7c into python:main Feb 23, 2026
116 of 118 checks passed
@gpshead
Copy link
Member Author

gpshead commented Feb 23, 2026

Thanks for all the reviewing everybody! Simplified and merged, with autoconf based feature detection.

I know there were concerns about the maintenance burden from a couple of you, that's on me. Feel free to tag me in on anything that ever comes up regarding this code. I believe this to basically be no-maintenance needed going forward for this corner of the codebase as it should just exist and work and not need changing. If those are famous last words I'll happily eat them. :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants