Skip to content

[Java Extension] Ported the C extension parser to the Java and remove ragel generated parser.#1004

Open
samyron wants to merge 2 commits into
ruby:masterfrom
samyron:sm/jruby-parser-rewrite
Open

[Java Extension] Ported the C extension parser to the Java and remove ragel generated parser.#1004
samyron wants to merge 2 commits into
ruby:masterfrom
samyron:sm/jruby-parser-rewrite

Conversation

@samyron

@samyron samyron commented Jun 15, 2026

Copy link
Copy Markdown
Contributor

Overview

This PR is a port of the current C extension parser to Java. It removes the ragel generated parser.

Implementation Notes

  1. This uses the same frame-based parse loop as the C-based parser. However, since java doesn't support goto, it uses a while/switch dispatching mechanism.
  2. Building a JSON Object uses RubyHash#fastASet instead of RubyHash#op_aset as strings or symbols used as keys are explicitly frozen in this module. This contributed a significant performance boost to this Parser. The keys are frozen in cachedKey / internedKey which is called via parseString when isName=true.
  3. The Parser includes a SWAR and Vector API StringScanner. The Vector API is loaded at runtime if running on a supported JDK, the module is available and the ruby.json.useVectorizedParser=true JVM property is set. If the Vector API is not available or explicitly disabled the SWAR implementation is used.
  4. This also uses the same rvalue_cache style cache as a quick cache for object keys. However, since the cache is heap allocated in Java, the size is 128 entries.
  5. The SWAR consecutive digits decoder was ported over to this Java parser.
  6. It retains the Java extension Unicode character validation but moves the logic from the generated ragel parser to the StringDecoder.

Performance

These benchmarks were run on an M1 Macbook Air using jruby 10.0.5.0 and OpenJDK 64-Bit Server VM 24.0.1+9-30.

With the SWAR StringScanner

== Parsing activitypub.json (58160 bytes)
jruby 10.0.5.0 (3.4.5) 2026-04-06 5db1ba72f3 OpenJDK 64-Bit Server VM 24.0.1+9-30 on 24.0.1+9-30 +indy +jit [arm64-darwin]
Warming up --------------------------------------
               after   656.000 i/100ms
Calculating -------------------------------------
               after      6.719k (± 1.2%) i/s  (148.83 μs/i) -     34.112k in   5.076849s

Comparison:
before:     1351.3 i/s
 after:     6719.1 i/s - 4.97x  faster


== Parsing twitter.json (567916 bytes)
jruby 10.0.5.0 (3.4.5) 2026-04-06 5db1ba72f3 OpenJDK 64-Bit Server VM 24.0.1+9-30 on 24.0.1+9-30 +indy +jit [arm64-darwin]
Warming up --------------------------------------
               after    69.000 i/100ms
Calculating -------------------------------------
               after    706.750 (± 1.1%) i/s    (1.41 ms/i) -      3.588k in   5.076758s

Comparison:
before:      113.1 i/s
 after:      706.8 i/s - 6.25x  faster


== Parsing citm_catalog.json (1727030 bytes)
jruby 10.0.5.0 (3.4.5) 2026-04-06 5db1ba72f3 OpenJDK 64-Bit Server VM 24.0.1+9-30 on 24.0.1+9-30 +indy +jit [arm64-darwin]
Warming up --------------------------------------
               after    38.000 i/100ms
Calculating -------------------------------------
               after    390.007 (± 2.3%) i/s    (2.56 ms/i) -      1.976k in   5.066570s

Comparison:
before:       37.6 i/s
 after:      390.0 i/s - 10.39x  faster


== Parsing github_events.json (65130 bytes)
jruby 10.0.5.0 (3.4.5) 2026-04-06 5db1ba72f3 OpenJDK 64-Bit Server VM 24.0.1+9-30 on 24.0.1+9-30 +indy +jit [arm64-darwin]
Warming up --------------------------------------
               after   744.000 i/100ms
Calculating -------------------------------------
               after      7.261k (± 5.0%) i/s  (137.72 μs/i) -     36.456k in   5.020636s

Comparison:
before:     1160.6 i/s
 after:     7261.2 i/s - 6.26x  faster


== Parsing semanticscholar-corpus.json (8493528 bytes)
jruby 10.0.5.0 (3.4.5) 2026-04-06 5db1ba72f3 OpenJDK 64-Bit Server VM 24.0.1+9-30 on 24.0.1+9-30 +indy +jit [arm64-darwin]
Warming up --------------------------------------
               after     4.000 i/100ms
Calculating -------------------------------------
               after     53.332 (± 3.8%) i/s   (18.75 ms/i) -    268.000 in   5.025161s

Comparison:
before:        8.1 i/s
 after:       53.3 i/s - 6.62x  faster

With the Vector API based StringScanner

== Parsing activitypub.json (58160 bytes)
jruby 10.0.5.0 (3.4.5) 2026-04-06 5db1ba72f3 OpenJDK 64-Bit Server VM 24.0.1+9-30 on 24.0.1+9-30 +indy +jit [arm64-darwin]
Warming up --------------------------------------
               after   671.000 i/100ms
Calculating -------------------------------------
               after      6.880k (± 1.3%) i/s  (145.34 μs/i) -     34.892k in   5.071152s

Comparison:
before:     1361.7 i/s
 after:     6880.5 i/s - 5.05x  faster


== Parsing twitter.json (567916 bytes)
jruby 10.0.5.0 (3.4.5) 2026-04-06 5db1ba72f3 OpenJDK 64-Bit Server VM 24.0.1+9-30 on 24.0.1+9-30 +indy +jit [arm64-darwin]
Warming up --------------------------------------
               after    72.000 i/100ms
Calculating -------------------------------------
               after    736.863 (± 1.4%) i/s    (1.36 ms/i) -      3.744k in   5.080998s

Comparison:
before:      109.3 i/s
 after:      736.9 i/s - 6.74x  faster


== Parsing citm_catalog.json (1727030 bytes)
jruby 10.0.5.0 (3.4.5) 2026-04-06 5db1ba72f3 OpenJDK 64-Bit Server VM 24.0.1+9-30 on 24.0.1+9-30 +indy +jit [arm64-darwin]
Warming up --------------------------------------
               after    38.000 i/100ms
Calculating -------------------------------------
               after    394.387 (± 1.5%) i/s    (2.54 ms/i) -      1.976k in   5.010313s

Comparison:
before:       38.8 i/s
 after:      394.4 i/s - 10.17x  faster


== Parsing github_events.json (65130 bytes)
jruby 10.0.5.0 (3.4.5) 2026-04-06 5db1ba72f3 OpenJDK 64-Bit Server VM 24.0.1+9-30 on 24.0.1+9-30 +indy +jit [arm64-darwin]
Warming up --------------------------------------
               after   777.000 i/100ms
Calculating -------------------------------------
               after      7.681k (± 2.9%) i/s  (130.20 μs/i) -     38.850k in   5.058152s

Comparison:
before:     1097.3 i/s
 after:     7680.7 i/s - 7.00x  faster


== Parsing semanticscholar-corpus.json (8493528 bytes)
jruby 10.0.5.0 (3.4.5) 2026-04-06 5db1ba72f3 OpenJDK 64-Bit Server VM 24.0.1+9-30 on 24.0.1+9-30 +indy +jit [arm64-darwin]
Warming up --------------------------------------
               after     5.000 i/100ms
Calculating -------------------------------------
               after     56.260 (± 3.6%) i/s   (17.77 ms/i) -    285.000 in   5.065794s

Comparison:
before:        7.3 i/s
 after:       56.3 i/s - 7.67x  faster

@samyron

samyron commented Jun 15, 2026

Copy link
Copy Markdown
Contributor Author

Tagging @headius for a review on this PR.

@samyron

samyron commented Jun 15, 2026

Copy link
Copy Markdown
Contributor Author

Additional thought: We should probably add additional JRuby versions to CI and include the Vector API system properties to ensure the VectorizedStringScanner and VectorizedStringEncoder (from a previous PR) are thoroughly tested.

@byroot byroot requested a review from headius June 15, 2026 06:40
@byroot

byroot commented Jun 15, 2026

Copy link
Copy Markdown
Member

Is this the parser mentioned by @enebo in #983 (comment) or is it a concurrent effort?

@samyron

samyron commented Jun 15, 2026

Copy link
Copy Markdown
Contributor Author

Is this the parser mentioned by @enebo in #983 (comment) or is it a concurrent effort?

This is a concurrent effort. I missed that comment as I didn't refer back to that issue once #989 was opened.

@samyron

samyron commented Jun 16, 2026

Copy link
Copy Markdown
Contributor Author

If we are willing to accept invalid UTF-8 bytes / broken strings(#138) in the Java parser, this parser can be even faster:

These benchmarks compare the current commit on this PR with some work-in-progress code I have locally. Some of this new code could be applied to the commit on this branch which may help a bit too if we want to keep the broken string validation.

== Parsing activitypub.json (58160 bytes)
jruby 10.0.5.0 (3.4.5) 2026-04-06 5db1ba72f3 OpenJDK 64-Bit Server VM 24.0.1+9-30 on 24.0.1+9-30 +indy +jit [arm64-darwin]
Warming up --------------------------------------
               after   850.000 i/100ms
Calculating -------------------------------------
               after      8.785k (± 1.1%) i/s  (113.83 μs/i) -     44.200k in   5.031459s

Comparison:
before:     6572.8 i/s
 after:     8784.7 i/s - 1.34x  faster


== Parsing twitter.json (567916 bytes)
jruby 10.0.5.0 (3.4.5) 2026-04-06 5db1ba72f3 OpenJDK 64-Bit Server VM 24.0.1+9-30 on 24.0.1+9-30 +indy +jit [arm64-darwin]
Warming up --------------------------------------
               after    78.000 i/100ms
Calculating -------------------------------------
               after    777.467 (± 1.4%) i/s    (1.29 ms/i) -      3.900k in   5.016292s

Comparison:
before:      716.2 i/s
 after:      777.5 i/s - 1.09x  faster

This would allow removing the JRuby guard around this method.

@headius

headius commented Jun 16, 2026

Copy link
Copy Markdown
Contributor

This looks like great work! I will review today. It will be great to eliminate the ragel parser and finally get performance where it should be!

@byroot

byroot commented Jun 16, 2026

Copy link
Copy Markdown
Member

If we are willing to accept invalid UTF-8 bytes / broken strings(#138) in the Java parser,

I honestly don't really know what to do with #138. Logically we shouldn't validate the parsed strings encoding, but I suspect a lot of people rely on this, so it probably would need to be an option like allow_duplicate_key etc.

So it might actually make sense to let the Java version parse invalid encoding like the C version, simply for compatibility.

@headius headius left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is good work and I approve merging it. My review comments are mostly style-related but we can get more performance out of this by doing some additional profiling and reusing the parser stack and wrapper objects, using VarHandle to do long-stride reads rather than ByteBuffer, etc. That can all come after this lands, though.

// pretty-printed JSON is almost always followed by a run
// of indentation spaces, so skip them eight at a time.
while (cursor + 8 <= end) {
long x = chunks.getLong(cursor);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This getLong and similar calls in StringScanner could potentially be replaced with VarHandle, which allows reading a byte[] as long. However VarHandle is only available on Java 9+ which would make this code unusable on JRuby 9.4 on Java 8 (technically EOL but still in wide use). Not sure if we are ready to make a hard break in json yet.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know if it would be worth it / how HotSpot would handle it but I could create an interface DataView or something that uses VarHandle if available and otherwise falls back to ByteBuffer if not.

The call sites would be monomorphic once the implementation is selected so I would hope that HotSpot will handle that appropriately. Something I can test later...

Comment thread java/src/json/ext/Parser.java Outdated
byte[] ebuf = eb.getUnsafeBytes();
int ebeg = eb.begin();
for (int i = 0; i < len; i++) {
int cmp = (buf[off + i] & 0xFF) - (ebuf[ebeg + i] & 0xFF);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Masking with 0xFF can be replaced with the less arcane Byte.toUnsignedInt or Byte.toUnsignedLong.

Comment thread java/src/json/ext/Parser.java Outdated
// of indentation spaces, so skip them eight at a time.
while (cursor + 8 <= end) {
long x = chunks.getLong(cursor);
if (x == 0x2020202020202020L) {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This and other important literal values should probably be in static final long constants.

@headius

headius commented Jun 16, 2026

Copy link
Copy Markdown
Contributor

I honestly don't really know what to do with #138.

The strictness of the Java-based parser has not been a real barrier to anyone I know of.

We have had a few folks over the years switch to JRuby and discover they had bad string content coming in from some json source. Since we obviously never made the json parser allow such content, they all eventually fixed the bad source.

Do we really know how bad the breakage would be if the CRuby json ext started rejecting bad UTF-8 content? I'd rather not reduce the strictness of the Java parser just for a little bit of performance.

@byroot

byroot commented Jun 16, 2026

Copy link
Copy Markdown
Member

Do we really know how bad the breakage would be if the CRuby json ext started rejecting bad UTF-8 content?

#697 comes to mind, and it was a comparatively much simpler change.

But it's really hard to tell.

As for performance, on paper I agree with you, but history have shown that many users will likely use an alternative if has the reputation to be faster, regardless of whether that really matters or not.

@headius

headius commented Jun 16, 2026

Copy link
Copy Markdown
Contributor

many users will likely use an alternative if has the reputation to be faster

I'd like to be able to quantify that. @samyron have you already tested a version of the Java parser that is non-strict?

@byroot I assume when you suggest adding another configuration option, it would allow users to opt out of strictness.

Releasing a version that's strict by default would be the quickest way to find out how many people are affected, and having an opt-out config would give those users a temporary workaround. It would be nice for the Ruby json library to help eliminate bad json content rather than silently propagating it.

@byroot

byroot commented Jun 16, 2026

Copy link
Copy Markdown
Member

it would allow users to opt out of strictness.

No in, with a warning by default.

It would be nice for the Ruby json library to help eliminate bad json content rather than silently propagating it.

Given I'm on the receiving end of complaints, no thanks.

@samyron samyron force-pushed the sm/jruby-parser-rewrite branch from 16de692 to 9b3b483 Compare June 16, 2026 12:44
@samyron

samyron commented Jun 16, 2026

Copy link
Copy Markdown
Contributor Author

I'd like to be able to quantify that. @samyron have you already tested a version of the Java parser that is non-strict?

Yes, see the numbers in this comment. I have this work on a separate branch here.

There are two big changes on that branch:

  • Adding SWAR/Vectorized scanning to the StringDecoder#decode to skip the non-interesting bytes.
  • Removed the UTF-8 validation.

I no longer have these two changes isolated but if I recall correctly in UTF-8 heavy text the added SWAR/Vectorized overhead in the StringDecoder#decode was pretty significant. That's why I decided to experiment with matching the C Parser and removing that validation.

@headius

headius commented Jun 16, 2026

Copy link
Copy Markdown
Contributor

Definitely would like to get @enebo input here before merging this or combining with his parser.

@enebo

enebo commented Jun 16, 2026

Copy link
Copy Markdown
Contributor

I am going to compare and contrast a bit today on this.

@enebo

enebo commented Jun 16, 2026

Copy link
Copy Markdown
Contributor

Thus far, I think this version seems to be better with strings and my version is quite a bit faster for numbers (floats). The small array and hash benches favor my branch but this version uses a pretty large starting stack which might explain that difference. I am using raw byte[] and this uses ByteBuffer. I think we can mix some stuff together to get best of both worlds.

The other comment is since this maps more closely structurally to the C extension then it might have an advantage over mine from a maintenance standpoint. The parsers are pretty small overall so not a big problem but I can see if you are updating one it would be nicer for it to be closer.

I want to look a bit more but I think we can make something better than both versions as they stand. It should be trivial to plug in my numeric parsing stuff.

@enebo

enebo commented Jun 16, 2026

Copy link
Copy Markdown
Contributor

I am just dumping JSON broader test results from https://github.com/nst/JSONTestSuite.git between the two new parsers. I think both are fine as nearly all of the failures are undefined behavior and strange enough I don't think anyone can reasonably expect any particular behavior. This PR has one (trailing raw //) which needs to get fixed as I think it gets stuck in a loop but that should be simple to fix.

Screenshot From 2026-06-16 10-02-35

This PR vs MRI:
Screenshot From 2026-06-16 10-01-04

My parser branch vs MRI:

Screenshot From 2026-06-16 10-02-02

@byroot

byroot commented Jun 16, 2026

Copy link
Copy Markdown
Member

I am just dumping JSON broader test results from https://github.com/nst/JSONTestSuite.git

Note that the Ruby harness is a bit broken: nst/JSONTestSuite#145

But I've meant to vendor that test suite in ruby/json for a while now. I'll try to do it soon.

@enebo

enebo commented Jun 16, 2026

Copy link
Copy Markdown
Contributor

This PR running on my machine using the parser benchmark 'float parsing':

== Parsing float parsing (2251051 bytes)
jruby 10.1.1.0-SNAPSHOT (4.0.0) 2026-06-08 482a54b9cb OpenJDK 64-Bit Server VM 25.0.1+8-27 on 25.0.1+8-27 +indy +jit [x86_64-linux]
Warming up --------------------------------------
                json     7.000 i/100ms
Calculating -------------------------------------
                json     70.402 (± 1.4%) i/s   (14.20 ms/i) -    707.000 in  10.04508

but swapping in my parser's version of parseNumber:

== Parsing float parsing (2251051 bytes)
jruby 10.1.1.0-SNAPSHOT (4.0.0) 2026-06-08 482a54b9cb OpenJDK 64-Bit Server VM 25.0.1+8-27 on 25.0.1+8-27 +indy +jit [x86_64-linux]
Warming up --------------------------------------
                json    16.000 i/100ms
Calculating -------------------------------------
                json    175.713 (± 1.1%) i/s    (5.69 ms/i) -      1.760k in  10.018326s

Now what really confuses me is here is my parser running it:

== Parsing float parsing (2251051 bytes)
jruby 10.1.1.0-SNAPSHOT (4.0.0) 2026-06-08 482a54b9cb OpenJDK 64-Bit Server VM 25.0.1+8-27 on 25.0.1+8-27 +indy +jit [x86_64-linux]
Warming up --------------------------------------
                json    12.000 i/100ms
Calculating -------------------------------------
                json    124.782 (± 2.4%) i/s    (8.01 ms/i) -      1.248k in  10.006981s

So something magical is happening in this PRs parser which is not on mine. Yet, most of the benchmarks on mine will edges out this one until it hits multiple byte UTF-8 characters. Then this parser does quite a bit better again (citm_catalog and twitter).

@samyron and others watching this PR. I am sorry that we are in the position of two complete implementations but I think we will be better for it in the end. I am going to continue to examine the impls today and probably tomorrow. I really like the idea of using this one because it is so close to the C version (which is good for synch'ing when new things happen). That said, my parser is doing a little better in most of the benchmarks (twitter and citm_catalog being the exceptions --- and also probably the most real-world examples). I think if I can get this one's speed up to mine then we should use this one.

@enebo

enebo commented Jun 16, 2026

Copy link
Copy Markdown
Contributor

Parsing numbers enhancement branch is here https://github.com/enebo/json/tree/jruby-parser-rewrite-numbers

@samyron

samyron commented Jun 16, 2026

Copy link
Copy Markdown
Contributor Author

@enebo Are the screenshots backwards? I wasn't able to reproduce the {"a": "b"}// results locally. I ran the full test suite against my branch and CRuby and see this:

image

@enebo

enebo commented Jun 16, 2026

Copy link
Copy Markdown
Contributor

@samyron My apologies. I did mix the two up. I thought I was using dates for which was which and I tested my branch last week. I clearly reversed them. The new addition of comment tests preceeding element had me have to arrange some stuff and this was fallout from that I guess.

@samyron

samyron commented Jun 16, 2026

Copy link
Copy Markdown
Contributor Author

@samyron and others watching this PR. I am sorry that we are in the position of two complete implementations but I think we will be better for it in the end. I am going to continue to examine the impls today and probably tomorrow. I really like the idea of using this one because it is so close to the C version (which is good for synch'ing when new things happen). That said, my parser is doing a little better in most of the benchmarks (twitter and citm_catalog being the exceptions --- and also probably the most real-world examples). I think if I can get this one's speed up to mine then we should use this one.

My apologies to you @enebo! Had I seen your comment I wouldn't have created this one Parser. That said.. I'm totally cool combining efforts in an attempt to get the fastest parser possible.

@samyron

samyron commented Jun 16, 2026

Copy link
Copy Markdown
Contributor Author

So something magical is happening in this PRs parser which is not on mine. Yet, most of the benchmarks on mine will edges out this one until it hits multiple byte UTF-8 characters. Then this parser does quite a bit better again (citm_catalog and twitter).

It might be worth an attempt to incorporate the StringScanner and VectorizedStringScanner into your string parsing function to see if that helps. I can try that later tonight. I'm assuming this is the branch?

@enebo

enebo commented Jun 16, 2026

Copy link
Copy Markdown
Contributor

@samyron yeah. That's the branch.

The phase approach might actually improve my branch as well since I have a lot of state machine switch nonsense making sure I am not in the wrong place in many places. Seeing that and how I create array/hash at front vs back of element processing really tickles my brain. Like I spend cost of potentially regrowing an array as I add to it by using a RubyArray right away but the value stack in the PR means walking down it for hash sets where early access to it is probably quicker. Always fun to see different approaches.

@samyron

samyron commented Jun 16, 2026

Copy link
Copy Markdown
Contributor Author

@samyron yeah. That's the branch.

The phase approach might actually improve my branch as well since I have a lot of state machine switch nonsense making sure I am not in the wrong place in many places. Seeing that and how I create array/hash at front vs back of element processing really tickles my brain. Like I spend cost of potentially regrowing an array as I add to it by using a RubyArray right away but the value stack in the PR means walking down it for hash sets where early access to it is probably quicker. Always fun to see different approaches.

I wonder if this accounts for some of the difference in the float parsing benchmark you mentioned above. Looking at the data in canada.json, there are arrays of arrays of arrays.

Looking at the "second level" array sizes:

irb(main):013> coords = data['features'].first['geometry']['coordinates']
...
irb(main):014> coords.each { |c| puts c.length } 
14
33
18
23
10
28
9
28
279
221
11
86
28
19
26
23
24
38
24

When this branch is decoding an array, we know exactly how many elements it contains and allocate an array of exactly that size and use System.arraycopy to build the RubyArray.

Additionally, the code only resets the top index in the value stack array, so we hit that array with 279 element, we don't need to reallocate the value stack until we have an array with more elements.

@enebo

enebo commented Jun 16, 2026

Copy link
Copy Markdown
Contributor

"When this branch is decoding an array, we know exactly how many elements it contains and allocate an array of exactly that size and use System.arraycopy to build the RubyArray."

Yeah. That is most definitely an additional cost and it could definitely explain why my faster floating point is quite a bit slower than when I moved it to your parser. I could definitely special case arrays to use that technique though. It is just an extra field. The bigger question to me though is there a benefit for hashes to make a RubyHash upfront? Like if we waited we could specify a "good" amount of buckets (this PR has no bucket size provided but it could help). I am less clear on buckets because it becomes a space vs time thing and it is complicated by how much the keys distribute.

@enebo

enebo commented Jun 16, 2026

Copy link
Copy Markdown
Contributor

@samyron I shoe-horned a value array for just making arrays and it did speed up. How I did it was a little gross (I push a Ruby fixnum instead of a RubyArray up front and use that when I hit ']' to make proper array), but it did not make it to the result on this branch. I am now at:

 ~/work/json_newest/benchmark new_jruby_parser * 360% ONLY=json ~/work/jruby-10.1/bin/jruby -J-XX:+UseParallelGC -Xcompile.invokedynamic -I../lib parser.rb
== Parsing float parsing (2251051 bytes)
jruby 10.1.1.0-SNAPSHOT (4.0.0) 2026-06-08 482a54b9cb OpenJDK 64-Bit Server VM 25.0.1+8-27 on 25.0.1+8-27 +indy +jit [x86_64-linux]
Warming up --------------------------------------
                json    12.000 i/100ms
Calculating -------------------------------------
                json    142.612 (± 1.4%) i/s    (7.01 ms/i) -      1.428k in  10.016141s

Still, it is showing promise.

@samyron

samyron commented Jun 16, 2026

Copy link
Copy Markdown
Contributor Author

(this PR has no bucket size provided but it could help).

🤦 Doh! This started out much simpler and I was initially parsing Java objects while I was getting the overall shape of the parser put into place. Objects were a HashMap for a while with a capacity of (int) (size / 0.75f). When porting over to Ruby I must have dropped that. I need to dig in to the implementation of RubyHash and figure out how to compute the number of buckets needed to avoid resizes.

@headius

headius commented Jun 16, 2026

Copy link
Copy Markdown
Contributor

Fwiw JRuby 10.1 has been laying the groundwork to start replacing the old linked buckets hash with a smaller more efficient implementation. It may not be super valuable to try to game the buckets but there may be some quick wins.

@samyron

samyron commented Jun 17, 2026

Copy link
Copy Markdown
Contributor Author

@enebo I cherry picked your parseNumber commit onto a local copy of this PR. It is significantly faster. I can push it here unless you'd like to contribute it directly.

Additionally, with respect to the other benchmarks... I don't believe your parser does UTF-8 validation. I have a branch that removes that validation. You can find those benchmarks in this comment above.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants