[Java Extension] Ported the C extension parser to the Java and remove ragel generated parser.#1004
[Java Extension] Ported the C extension parser to the Java and remove ragel generated parser.#1004samyron wants to merge 2 commits into
Conversation
|
Tagging @headius for a review on this PR. |
|
Additional thought: We should probably add additional JRuby versions to CI and include the Vector API system properties to ensure the |
|
Is this the parser mentioned by @enebo in #983 (comment) or is it a concurrent effort? |
This is a concurrent effort. I missed that comment as I didn't refer back to that issue once #989 was opened. |
|
If we are willing to accept invalid UTF-8 bytes / broken strings(#138) in the Java parser, this parser can be even faster: These benchmarks compare the current commit on this PR with some work-in-progress code I have locally. Some of this new code could be applied to the commit on this branch which may help a bit too if we want to keep the broken string validation. This would allow removing the JRuby guard around this method. |
|
This looks like great work! I will review today. It will be great to eliminate the ragel parser and finally get performance where it should be! |
I honestly don't really know what to do with #138. Logically we shouldn't validate the parsed strings encoding, but I suspect a lot of people rely on this, so it probably would need to be an option like So it might actually make sense to let the Java version parse invalid encoding like the C version, simply for compatibility. |
headius
left a comment
There was a problem hiding this comment.
This is good work and I approve merging it. My review comments are mostly style-related but we can get more performance out of this by doing some additional profiling and reusing the parser stack and wrapper objects, using VarHandle to do long-stride reads rather than ByteBuffer, etc. That can all come after this lands, though.
| // pretty-printed JSON is almost always followed by a run | ||
| // of indentation spaces, so skip them eight at a time. | ||
| while (cursor + 8 <= end) { | ||
| long x = chunks.getLong(cursor); |
There was a problem hiding this comment.
This getLong and similar calls in StringScanner could potentially be replaced with VarHandle, which allows reading a byte[] as long. However VarHandle is only available on Java 9+ which would make this code unusable on JRuby 9.4 on Java 8 (technically EOL but still in wide use). Not sure if we are ready to make a hard break in json yet.
There was a problem hiding this comment.
I don't know if it would be worth it / how HotSpot would handle it but I could create an interface DataView or something that uses VarHandle if available and otherwise falls back to ByteBuffer if not.
The call sites would be monomorphic once the implementation is selected so I would hope that HotSpot will handle that appropriately. Something I can test later...
| byte[] ebuf = eb.getUnsafeBytes(); | ||
| int ebeg = eb.begin(); | ||
| for (int i = 0; i < len; i++) { | ||
| int cmp = (buf[off + i] & 0xFF) - (ebuf[ebeg + i] & 0xFF); |
There was a problem hiding this comment.
Masking with 0xFF can be replaced with the less arcane Byte.toUnsignedInt or Byte.toUnsignedLong.
| // of indentation spaces, so skip them eight at a time. | ||
| while (cursor + 8 <= end) { | ||
| long x = chunks.getLong(cursor); | ||
| if (x == 0x2020202020202020L) { |
There was a problem hiding this comment.
This and other important literal values should probably be in static final long constants.
The strictness of the Java-based parser has not been a real barrier to anyone I know of. We have had a few folks over the years switch to JRuby and discover they had bad string content coming in from some json source. Since we obviously never made the json parser allow such content, they all eventually fixed the bad source. Do we really know how bad the breakage would be if the CRuby json ext started rejecting bad UTF-8 content? I'd rather not reduce the strictness of the Java parser just for a little bit of performance. |
#697 comes to mind, and it was a comparatively much simpler change. But it's really hard to tell. As for performance, on paper I agree with you, but history have shown that many users will likely use an alternative if has the reputation to be faster, regardless of whether that really matters or not. |
I'd like to be able to quantify that. @samyron have you already tested a version of the Java parser that is non-strict? @byroot I assume when you suggest adding another configuration option, it would allow users to opt out of strictness. Releasing a version that's strict by default would be the quickest way to find out how many people are affected, and having an opt-out config would give those users a temporary workaround. It would be nice for the Ruby json library to help eliminate bad json content rather than silently propagating it. |
No in, with a warning by default.
Given I'm on the receiving end of complaints, no thanks. |
16de692 to
9b3b483
Compare
Yes, see the numbers in this comment. I have this work on a separate branch here. There are two big changes on that branch:
I no longer have these two changes isolated but if I recall correctly in UTF-8 heavy text the added SWAR/Vectorized overhead in the |
|
Definitely would like to get @enebo input here before merging this or combining with his parser. |
|
I am going to compare and contrast a bit today on this. |
|
Thus far, I think this version seems to be better with strings and my version is quite a bit faster for numbers (floats). The small array and hash benches favor my branch but this version uses a pretty large starting stack which might explain that difference. I am using raw byte[] and this uses ByteBuffer. I think we can mix some stuff together to get best of both worlds. The other comment is since this maps more closely structurally to the C extension then it might have an advantage over mine from a maintenance standpoint. The parsers are pretty small overall so not a big problem but I can see if you are updating one it would be nicer for it to be closer. I want to look a bit more but I think we can make something better than both versions as they stand. It should be trivial to plug in my numeric parsing stuff. |
|
I am just dumping JSON broader test results from https://github.com/nst/JSONTestSuite.git between the two new parsers. I think both are fine as nearly all of the failures are undefined behavior and strange enough I don't think anyone can reasonably expect any particular behavior. This PR has one (trailing raw //) which needs to get fixed as I think it gets stuck in a loop but that should be simple to fix.
My parser branch vs MRI:
|
Note that the Ruby harness is a bit broken: nst/JSONTestSuite#145 But I've meant to vendor that test suite in ruby/json for a while now. I'll try to do it soon. |
|
This PR running on my machine using the parser benchmark 'float parsing': but swapping in my parser's version of parseNumber: Now what really confuses me is here is my parser running it: So something magical is happening in this PRs parser which is not on mine. Yet, most of the benchmarks on mine will edges out this one until it hits multiple byte UTF-8 characters. Then this parser does quite a bit better again (citm_catalog and twitter). @samyron and others watching this PR. I am sorry that we are in the position of two complete implementations but I think we will be better for it in the end. I am going to continue to examine the impls today and probably tomorrow. I really like the idea of using this one because it is so close to the C version (which is good for synch'ing when new things happen). That said, my parser is doing a little better in most of the benchmarks (twitter and citm_catalog being the exceptions --- and also probably the most real-world examples). I think if I can get this one's speed up to mine then we should use this one. |
|
Parsing numbers enhancement branch is here https://github.com/enebo/json/tree/jruby-parser-rewrite-numbers |
|
@enebo Are the screenshots backwards? I wasn't able to reproduce the
|
|
@samyron My apologies. I did mix the two up. I thought I was using dates for which was which and I tested my branch last week. I clearly reversed them. The new addition of comment tests preceeding element had me have to arrange some stuff and this was fallout from that I guess. |
My apologies to you @enebo! Had I seen your comment I wouldn't have created this one Parser. That said.. I'm totally cool combining efforts in an attempt to get the fastest parser possible. |
It might be worth an attempt to incorporate the |
|
@samyron yeah. That's the branch. The phase approach might actually improve my branch as well since I have a lot of state machine switch nonsense making sure I am not in the wrong place in many places. Seeing that and how I create array/hash at front vs back of element processing really tickles my brain. Like I spend cost of potentially regrowing an array as I add to it by using a RubyArray right away but the value stack in the PR means walking down it for hash sets where early access to it is probably quicker. Always fun to see different approaches. |
I wonder if this accounts for some of the difference in the float parsing benchmark you mentioned above. Looking at the data in canada.json, there are arrays of arrays of arrays. Looking at the "second level" array sizes: When this branch is decoding an array, we know exactly how many elements it contains and allocate an array of exactly that size and use Additionally, the code only resets the top index in the value stack array, so we hit that array with 279 element, we don't need to reallocate the value stack until we have an array with more elements. |
|
"When this branch is decoding an array, we know exactly how many elements it contains and allocate an array of exactly that size and use System.arraycopy to build the RubyArray." Yeah. That is most definitely an additional cost and it could definitely explain why my faster floating point is quite a bit slower than when I moved it to your parser. I could definitely special case arrays to use that technique though. It is just an extra field. The bigger question to me though is there a benefit for hashes to make a RubyHash upfront? Like if we waited we could specify a "good" amount of buckets (this PR has no bucket size provided but it could help). I am less clear on buckets because it becomes a space vs time thing and it is complicated by how much the keys distribute. |
|
@samyron I shoe-horned a value array for just making arrays and it did speed up. How I did it was a little gross (I push a Ruby fixnum instead of a RubyArray up front and use that when I hit ']' to make proper array), but it did not make it to the result on this branch. I am now at: Still, it is showing promise. |
🤦 Doh! This started out much simpler and I was initially parsing Java objects while I was getting the overall shape of the parser put into place. Objects were a |
|
Fwiw JRuby 10.1 has been laying the groundwork to start replacing the old linked buckets hash with a smaller more efficient implementation. It may not be super valuable to try to game the buckets but there may be some quick wins. |
|
@enebo I cherry picked your parseNumber commit onto a local copy of this PR. It is significantly faster. I can push it here unless you'd like to contribute it directly. Additionally, with respect to the other benchmarks... I don't believe your parser does UTF-8 validation. I have a branch that removes that validation. You can find those benchmarks in this comment above. |




Overview
This PR is a port of the current C extension parser to Java. It removes the ragel generated parser.
Implementation Notes
RubyHash#fastASetinstead ofRubyHash#op_asetas strings or symbols used as keys are explicitly frozen in this module. This contributed a significant performance boost to this Parser. The keys are frozen incachedKey/internedKeywhich is called viaparseStringwhenisName=true.ruby.json.useVectorizedParser=trueJVM property is set. If the Vector API is not available or explicitly disabled the SWAR implementation is used.rvalue_cachestyle cache as a quick cache for object keys. However, since the cache is heap allocated in Java, the size is 128 entries.Performance
These benchmarks were run on an M1 Macbook Air using jruby 10.0.5.0 and OpenJDK 64-Bit Server VM 24.0.1+9-30.
With the SWAR StringScanner
With the Vector API based StringScanner