@@ -9,6 +9,9 @@ multiple GiB/s — several × faster than a `HashMap` fold table — while using
99form, as defined by the Unicode [ CaseFolding.txt] [ cf ] data file restricted to
1010the ** simple** (1-to-1) folds (statuses ` C ` and ` S ` ). Full multi-character
1111folds (` F ` , e.g. ` ß ` → ` ss ` ) and Turkic locale folds (` T ` ) are not supported.
12+ The crate also provides [ ` index_fold ` ] ( #single-byte-index-fold ) , which projects
13+ every character — ASCII or multibyte — onto a single byte, a handy primitive for
14+ case-insensitive n-gram indexing.
1215
1316[ cf ] : https://www.unicode.org/Public/UCD/latest/ucd/CaseFolding.txt
1417
@@ -25,6 +28,61 @@ assert_eq!(simple_fold("Hello, WORLD!".to_string()), "hello, world!");
2528assert_eq! (simple_fold (" ÜBER" . to_string ()), " über" );
2629```
2730
31+ To fold a single code point, use ` simple_fold_char(c: char) -> char ` , which
32+ returns the same character ` simple_fold ` would emit for that input (ASCII is
33+ lowercased; a character with a simple fold maps to its folded code point; every
34+ other character is returned unchanged).
35+
36+ ## Single-byte index fold
37+
38+ ` index_fold(s: String) -> Vec<u8> ` applies the ** same** simple fold as
39+ ` simple_fold ` , then collapses ** every character to exactly one byte** :
40+
41+ - ASCII characters become their plain lowercased byte (high bit clear).
42+ - Every multibyte character becomes ` 0x80 | (cp & 0x7F) ` — the low 7 bits of its
43+ * folded* code point, with the high bit set. The high bit is set
44+ unconditionally, so even a multibyte character that folds to ASCII (e.g.
45+ U+212A KELVIN SIGN → ` k ` ) yields ` 0x80 | b'k' ` , never a bare ASCII byte.
46+
47+ ``` rust
48+ use casefold :: index_fold;
49+ assert_eq! (index_fold (" Hi!" . to_string ()), b " hi!" );
50+ assert_eq! (index_fold (" Ü" . to_string ()), & [0xFC ]); // ü → 0x80 | (0xFC & 0x7F)
51+ assert_eq! (index_fold (" 中" . to_string ()), & [0x80 | 0x2D ]);
52+ ```
53+
54+ The result is fixed-width (one byte per character) and is therefore ** not**
55+ valid UTF-8. To fold a single code point, use ` index_fold_char(c: char) -> u8 ` ,
56+ which returns the same byte ` index_fold ` would emit for that character.
57+
58+ ### Why one byte per character?
59+
60+ This is a building block for ** case-insensitive n-gram indexing** . When every
61+ character — ASCII or not — is reduced to a single byte, a fixed * k* -gram is just
62+ * k* contiguous bytes: byte n-grams are trivial to slice, hash, and store, they
63+ are already case-folded so lookups are case-insensitive for free, and a document
64+ of * n* characters yields exactly * n* index bytes. ASCII keeps its natural byte,
65+ and multibyte scripts are projected onto the high half (` 0x80–0xFF ` ) so they
66+ never collide with ASCII.
67+
68+ The projection is intentionally ** lossy** — distinct code points that share the
69+ same low 7 bits map to the same byte (most CJK, for instance, lands in a narrow
70+ band). That is fine for an index: use ` index_fold ` as a cheap * candidate filter*
71+ that never produces false negatives for a case-insensitive match, then verify
72+ exact hits against the original text afterwards.
73+
74+ Mechanically it reuses the whole fold table; the only addition is a per-run
75+ 7-bit ` INDEX_DELTA ` . By modular arithmetic the folded low 7 bits are
76+ ` ((cp & 0x7F) + (delta & 0x7F)) mod 128 ` , so the fold is a single
77+ ` wrapping_add ` — no UTF-8 reconstruction, no decode, no encode (the stray carry
78+ bit is overwritten by the unconditional ` 0x80 | ` ). Because the output is never
79+ longer than the input, it runs fully in place in the input's own buffer, and
80+ pure-ASCII input is returned untouched. It shares ` simple_fold ` 's
81+ auto-vectorized ASCII pass (~ 46 GiB/s) and, since it emits one byte per
82+ character, runs * faster* than ` simple_fold ` on folding-heavy input (e.g. ~ 1.9
83+ vs ~ 1.3 GiB/s on length-changing folds, ~ 1.1 vs ~ 0.9 GiB/s on mixed BMP) and a
84+ little slower on pure-reject CJK/symbols due to character collapsing.
85+
2886## Why does this crate exist?
2987
3088Unicode 16.0 defines 1484 simple-fold mappings. Common ways to store them:
@@ -68,7 +126,7 @@ query:
68126 ` wrapping_add ` , one 4-byte store — no decode, no encode. Writing fewer/more
69127 bytes than were read handles length-changing folds (` K ` →` k ` , ` Ⱥ ` →` ⱥ ` ).
70128
71- ### Table layout (1776 B total)
129+ ### Table layout (2014 B total)
72130
73131| Component | Bytes |
74132| -------------------------------------------------| ------:|
@@ -78,8 +136,11 @@ query:
78136| ` RUN_END_LOW[238 + 8]: u8 ` (clean scan key, ` end & 0x3F ` ; +8 SWAR pad) | 246 |
79137| ` RUN_START_STRIDE[238]: u8 ` (` start & 0x3F ` \| stride bit) | 238 |
80138| ` BYTE_DELTA[238]: u32 ` (little-endian fold delta per run) | 952 |
81- | ** Total** | ** 1776** |
139+ | ` INDEX_DELTA[238]: u8 ` (7-bit per-run fold delta, ` index_fold ` only) | 238 |
140+ | ** Total** | ** 2014** |
82141
142+ The ` simple_fold ` path uses 1776 B of this; the 238 B ` INDEX_DELTA ` side table
143+ powers [ ` index_fold ` ] ( #single-byte-index-fold ) only.
83144(Splitting runs at byte-delta boundaries raises the run count from 227 to 238.)
84145The data file is parsed at build time by ` build.rs ` , which emits the packed
85146` static ` tables to ` OUT_DIR/table.rs ` .
0 commit comments