buffer: improve performance of multiple Buffer operations#61871
buffer: improve performance of multiple Buffer operations#61871thisalihassan wants to merge 9 commits intonodejs:mainfrom
Conversation
|
Review requested:
|
d2ba38f to
495feb5
Compare
| void FastSwap16(Local<Value> receiver, | ||
| Local<Value> buffer_obj, | ||
| // NOLINTNEXTLINE(runtime/references) | ||
| FastApiCallbackOptions& options) { | ||
| HandleScope scope(options.isolate); | ||
| ArrayBufferViewContents<char> buffer(buffer_obj); | ||
| CHECK(nbytes::SwapBytes16(const_cast<char*>(buffer.data()), buffer.length())); | ||
| } |
There was a problem hiding this comment.
These fast callbacks are non-identical to the conventional callbacks they shadow.
- The existing callbacks validate their argument and throws to JS if invalid, whereas your fast callbacks hard-crash the process. It might be better to validate in the JS layer, then use the same unwrapping logic on both sides.
- Your fast callback cannot have a different return convention to the conventional callback. You will need to remove the return value from the conventional callback.
There was a problem hiding this comment.
Was removing the validation instead of moving it to js intentional?
Also fast paths can keep the validation and fall back to slow i think, so we can still validate on the src side?
There was a problem hiding this comment.
Also fast paths can keep the validation and fall back to slow i think, so we can still validate on the src side?
This is outdated, the fast API doesn't use fallback any more (since Node.js v23.x).
However, any validation in a JS wrapper should be shadowed by a CHECK or DCHECK in the C++ binding.
There was a problem hiding this comment.
SPREAD_BUFFER_ARG already contains CHECK((val)->IsArrayBufferView()) on all Slow & Fast swap methods
| void FastSwap16(Local<Value> receiver, | ||
| Local<Value> buffer_obj, | ||
| // NOLINTNEXTLINE(runtime/references) | ||
| FastApiCallbackOptions& options) { | ||
| HandleScope scope(options.isolate); | ||
| ArrayBufferViewContents<char> buffer(buffer_obj); | ||
| CHECK(nbytes::SwapBytes16(const_cast<char*>(buffer.data()), buffer.length())); | ||
| } |
There was a problem hiding this comment.
Fast callbacks should include debug tracking and call tests.
There was a problem hiding this comment.
Thanks for sharing these I will update the code
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #61871 +/- ##
========================================
Coverage 89.65% 89.65%
========================================
Files 676 676
Lines 206231 206365 +134
Branches 39505 39534 +29
========================================
+ Hits 184898 185019 +121
- Misses 13463 13475 +12
- Partials 7870 7871 +1
🚀 New features to boost your workflow:
|
1395d2f to
01ba74f
Compare
src/node_buffer.cc
Outdated
| FastApiCallbackOptions& options) { | ||
| TRACK_V8_FAST_API_CALL("buffer.swap16"); | ||
| HandleScope scope(options.isolate); | ||
| ArrayBufferViewContents<char> buffer(buffer_obj); |
There was a problem hiding this comment.
ArrayBufferViewContents is wrong here, as buffer.data() may be a stack-allocated copy of the byte data rather than the data itself. SPREAD_BUFFER_ARG is the correct macro to use here, as per the conventional callback.
There was a problem hiding this comment.
@Renegade334 replaced ArrayBufferViewContents with SPREAD_BUFFER_ARG in all three fast swap callbacks
There was a problem hiding this comment.
For toHex, wait until #61609, which improves native perf significantly (more than Uint8Array.prototype.toHex)
See also #60249 (comment)
| } else if (value.length === 1) { | ||
| // Fast path: If `value` fits into a single byte, use that numeric value. | ||
| if (normalizedEncoding === 'utf8') { | ||
| if (normalizedEncoding === 'utf8' || normalizedEncoding === 'ascii') { |
There was a problem hiding this comment.
Currently, ascii behaves exactly like latin1
Unsure if by design or accidentally
There was a problem hiding this comment.
yes, this is safe by design I am just extending the existing single byte numeric optimization to cover ASCII, since the guard already constrains it to the valid ASCII range.
|
Note on toBase64 / toBase64url: I also tried replacing the C++ base64Slice/base64urlSlice bindings with V8's Uint8Array.prototype.toBase64() (similar to the toHex change) but it caused a 35-54% regression across all buffer sizes so I reverted base64/base64url and kept only the toHex optimization which showed a clear +26-37% win. |
|
@thisalihassan toHex doesn't show a win anymore with Instead, it's ~3x slower. See nodejs/nbytes#12 |
|
Hi @ChALkeR thanks for flagging I was not aware. I benchmarked the nibble approach locally and it's indeed a much bigger win (~3x vs my ~30% with toHex). Reverted the toHex path entirely the other changes in this PR are unaffected. Should I include the nbytes nibble HexEncode optimization in this PR or keep them as separate PRs?
|
Hi @ChALkeR I see that your changes landed, congratualtions on that huge win. Is there benchmark-ci that we can run on this PR? I can rebase it |
|
I re-ran the benchmark against the latest main (which contains the nibble improvement) and found out there are few regressions caused by my code:
I tried fixing this regression but unavoidable Attaching Details below: buffer-benchmark-all-rebased.csv -- raw benchmark CSV |
|
compare-R-output.txt-- full output from Rscript benchmark/compare.R (shared as a file to avoid cluttering the PR comment). |
- copyBytesFrom: calculate byte offsets directly instead of
slicing into an intermediate typed array
- toString('hex'): use V8 Uint8Array.prototype.toHex() builtin
- fill: add single-char ASCII fast path
- indexOf: use indexOfString directly for ASCII encoding
- swap16/32/64: add V8 Fast API functions
- Guard ensureUint8ArrayToHex against --no-js-base-64 flag by falling back to C++ hexSlice when toHex is unavailable - Remove THROW_AND_RETURN_UNLESS_BUFFER and return value from slow Swap16/32/64 to match fast path conventions (JS validates) - Add TRACK_V8_FAST_API_CALL to FastSwap16/32/64 - Add test/parallel/test-buffer-swap-fast.js for fast API verification
Remove V8 Uint8Array.prototype.toHex() path for Buffer.toString('hex')
in favour of the upcoming nbytes HexEncode improvement (nodejs/nbytes#12)
which is ~3x faster through the existing C++ hexSlice path.
Refs: nodejs/nbytes#12
|
PS: I resolved some conflicts |
|
Hi @ChALkeR @anonrig @Renegade334 is this PR ready to land? can you also please check the latest benchmarks i have posted, I have compiled PDF for this benchmark in one my comments above for more detail
|
Qard
left a comment
There was a problem hiding this comment.
Generally LGTM, but one small nit.
|
Results from the CI, however it tells a slightly different story than my local benchmark at n=30 |
| return this; | ||
| } | ||
| return _swap16(this); | ||
| _swap16(this); |
There was a problem hiding this comment.
what would happen if it's called on not a TypeArrayView now?
There was a problem hiding this comment.
i.e. Buffer.prototype.swap16.apply(new Array(128))
There was a problem hiding this comment.
Good catch, I will add a JS side guard
There was a problem hiding this comment.
It changes the method to not being callable on a typedarray that is not an uint8array, but the previous behavior was inconsistent in that case anyway as the js part swapped elements and the src part swapped bytes
So I think explicitly blocking non-uint8arr should be fine and not a semver-major?
(perhaps cc @nodejs/lts just in case)
There was a problem hiding this comment.
you are right about that, I was pointing about the inconsistency about the different paths. as JS swapped the elements but alright I will work on the feedback
Thanks
There was a problem hiding this comment.
I'm not against changing the current state of things, but IMO that's a separate concern and should be discussed separately. Currently Buffer.prototype.swap16.call(new Float32Array(128)) does not throw, so if we want to make it throw let's open a separate semver-major PR.
My opinion is we should rely on TypedArrayPrototypeGetByteLength doing the identity check for us in this PR, better for performance and not worse than the status quo for DX/UX.
There was a problem hiding this comment.
Yes that's pretty valid @aduh95, I have used TypedArrayPrototypeGetByteLength I will open a seperate issue on this.
There was a problem hiding this comment.
Currently Buffer.prototype.swap16.call(new Float32Array(128)) does not throw,
It does not throw but returns garbage (as in, result is not predictable / does not follow any logic)
So anything can be done with it I think
Making it swap bytes is an option. Making it throw is another option.
The current behavior is length-dependent:
- 129 (516 bytes) throws and complains about non-even number of bytes
- 128 (512 bytes) does not throw (as you pointed out) and swaps bytes
- 127 (508 bytes) throws and complains about non-even number of bytes
- 126 (504 bytes) does not throw and swaps elements (groups of 4 bytes)
That is a mere oversight and not a proper api, and, more significantly, no one should be using that realistically as it doesn't work.
I don't think that making that a hard error is a semver-major.
There was a problem hiding this comment.
Passing a non-Buffer this is not documented, so it implicitly is undefined behavior.
Garbage out is IMO a good tradeoff for a more performant happy path, although I didn't run any benchmark.
no one should be using that realistically as it doesn't work.
100% agreed – and I would bet that absolutely no one is passing Float32Arrays to that API. Still, I don't think it's worth changing in this PR, which is about improving performance.
Failed to start CI⚠ Commits were pushed since the last approving review: ⚠ - buffer: improve performance of multiple Buffer operations ⚠ - buffer: address PR feedback ⚠ - buffer: revert toHex in favour of nbytes HexEncode update ⚠ - resolve feedback ⚠ - resolve feedback on fast callbacks ⚠ - added comments and updated tests ⚠ - resolve feedback ✘ Refusing to run CI on potentially unsafe PRhttps://github.com/nodejs/node/actions/runs/22977319444 |
Co-authored-by: Antoine du Hamel <duhamelantoine1995@gmail.com>




Multiple performance improvements to Buffer operations, verified with benchmarks (15-30 runs, comparing old vs new binaries built from same tree).
Buffer.copyBytesFrom()(+100-210%)Avoid intermediate
TypedArrayPrototypeSliceallocation by calculating byte offsets directly into the source TypedArray's underlying ArrayBuffer.Buffer.prototype.fill("t", "ascii")(+26-37%)ASCII
indexOf(+14-46%)Call
indexOfStringdirectly for ASCII encoding instead of first converting the search value to a Buffer viafromStringFastand then callingindexOfBuffer. ASCII and Latin-1 share the same byte values for characters 0-127.swap16/32/64(+3-38%)Add V8 Fast API C++ functions (
FastSwap16/32/64) alongside the existing slow path. Largest gains at len=256 (+35%).Benchmark results
Key results (15-30 runs,
***= p < 0.001):Attaching Details below:
detail.pdf -- visual breakdown
buffer-benchmark-all-rebased.csv -- raw benchmark CSV
compare-R-output.txt-- full output