Skip to content

Add xsimd::get<>() for optimized compile-time element extraction#1294

Open
DiamonDinoia wants to merge 2 commits intoxtensor-stack:masterfrom
DiamonDinoia:feat/optimize-elem-extraction
Open

Add xsimd::get<>() for optimized compile-time element extraction#1294
DiamonDinoia wants to merge 2 commits intoxtensor-stack:masterfrom
DiamonDinoia:feat/optimize-elem-extraction

Conversation

@DiamonDinoia
Copy link
Copy Markdown
Contributor

Add a free function xsimd::get(batch) API mirroring std::get(tuple) for fast compile-time element extraction from SIMD batches.

Per-architecture optimized kernel::get overloads using the fastest available intrinsics:

  • SSE2: shuffle/shift + scalar convert
  • SSE4.1: pextrd/pextrq/pextrb/pextrw, bitcast + pextrd for float
  • AVX: vextractf128/vextracti128 + SSE4.1 delegate
  • AVX-512: vextracti64x4/vextractf32x4 + AVX delegate
  • NEON: vgetq_lane_* (single instruction for all types)
  • NEON64: vgetq_lane_f64

Also fixes a latent bug in the common fallback for complex batch compile-time get (wrong buffer type).

@DiamonDinoia DiamonDinoia force-pushed the feat/optimize-elem-extraction branch from b7725d8 to 0b6d85f Compare April 13, 2026 15:40
Add a free function xsimd::get<I>(batch) API mirroring std::get<I>(tuple)
for fast compile-time element extraction from SIMD batches.

Per-architecture optimized kernel::get overloads using the fastest
available intrinsics:
- SSE2: shuffle/shift + scalar convert
- SSE4.1: pextrd/pextrq/pextrb/pextrw, bitcast + pextrd for float
- AVX: vextractf128/vextracti128 + SSE4.1 delegate
- AVX-512: vextracti64x4/vextractf32x4 + AVX delegate
- NEON: vgetq_lane_* (single instruction for all types)
- NEON64: vgetq_lane_f64

Also fixes a latent bug in the common fallback for complex batch
compile-time get (wrong buffer type).
@DiamonDinoia DiamonDinoia force-pushed the feat/optimize-elem-extraction branch from 0b6d85f to c6dd311 Compare April 14, 2026 14:38
RVV only had runtime get(batch, size_t, requires_arch<rvv>) which
became ambiguous with the new compile-time get(batch, index<I>,
requires_arch<common>) because index<I> (std::integral_constant)
implicitly converts to size_t. Add index<I> overloads that delegate
to the runtime versions, matching the pattern used by SSE/AVX/NEON.
@DiamonDinoia
Copy link
Copy Markdown
Contributor Author

Nice thanks for fixing CI!

This is ready for review. Once approved I will rewrite the history. I don't want to trigger a useless CI run.

@DiamonDinoia DiamonDinoia marked this pull request as ready for review April 14, 2026 17:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant