Skip to content

Commit d9e8db6

Browse files
authored
Merge pull request #124 from github/aneubeck/casefold
index folding
2 parents fe58e52 + d3cce8d commit d9e8db6

6 files changed

Lines changed: 843 additions & 354 deletions

File tree

crates/casefold/README.md

Lines changed: 63 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -9,6 +9,9 @@ multiple GiB/s — several × faster than a `HashMap` fold table — while using
99
form, as defined by the Unicode [CaseFolding.txt][cf] data file restricted to
1010
the **simple** (1-to-1) folds (statuses `C` and `S`). Full multi-character
1111
folds (`F`, e.g. `ß``ss`) and Turkic locale folds (`T`) are not supported.
12+
The crate also provides [`index_fold`](#single-byte-index-fold), which projects
13+
every character — ASCII or multibyte — onto a single byte, a handy primitive for
14+
case-insensitive n-gram indexing.
1215

1316
[cf]: https://www.unicode.org/Public/UCD/latest/ucd/CaseFolding.txt
1417

@@ -25,6 +28,61 @@ assert_eq!(simple_fold("Hello, WORLD!".to_string()), "hello, world!");
2528
assert_eq!(simple_fold("ÜBER".to_string()), "über");
2629
```
2730

31+
To fold a single code point, use `simple_fold_char(c: char) -> char`, which
32+
returns the same character `simple_fold` would emit for that input (ASCII is
33+
lowercased; a character with a simple fold maps to its folded code point; every
34+
other character is returned unchanged).
35+
36+
## Single-byte index fold
37+
38+
`index_fold(s: String) -> Vec<u8>` applies the **same** simple fold as
39+
`simple_fold`, then collapses **every character to exactly one byte**:
40+
41+
- ASCII characters become their plain lowercased byte (high bit clear).
42+
- Every multibyte character becomes `0x80 | (cp & 0x7F)` — the low 7 bits of its
43+
*folded* code point, with the high bit set. The high bit is set
44+
unconditionally, so even a multibyte character that folds to ASCII (e.g.
45+
U+212A KELVIN SIGN → `k`) yields `0x80 | b'k'`, never a bare ASCII byte.
46+
47+
```rust
48+
use casefold::index_fold;
49+
assert_eq!(index_fold("Hi!".to_string()), b"hi!");
50+
assert_eq!(index_fold("Ü".to_string()), &[0xFC]); // ü → 0x80 | (0xFC & 0x7F)
51+
assert_eq!(index_fold("".to_string()), &[0x80 | 0x2D]);
52+
```
53+
54+
The result is fixed-width (one byte per character) and is therefore **not**
55+
valid UTF-8. To fold a single code point, use `index_fold_char(c: char) -> u8`,
56+
which returns the same byte `index_fold` would emit for that character.
57+
58+
### Why one byte per character?
59+
60+
This is a building block for **case-insensitive n-gram indexing**. When every
61+
character — ASCII or not — is reduced to a single byte, a fixed *k*-gram is just
62+
*k* contiguous bytes: byte n-grams are trivial to slice, hash, and store, they
63+
are already case-folded so lookups are case-insensitive for free, and a document
64+
of *n* characters yields exactly *n* index bytes. ASCII keeps its natural byte,
65+
and multibyte scripts are projected onto the high half (`0x80–0xFF`) so they
66+
never collide with ASCII.
67+
68+
The projection is intentionally **lossy** — distinct code points that share the
69+
same low 7 bits map to the same byte (most CJK, for instance, lands in a narrow
70+
band). That is fine for an index: use `index_fold` as a cheap *candidate filter*
71+
that never produces false negatives for a case-insensitive match, then verify
72+
exact hits against the original text afterwards.
73+
74+
Mechanically it reuses the whole fold table; the only addition is a per-run
75+
7-bit `INDEX_DELTA`. By modular arithmetic the folded low 7 bits are
76+
`((cp & 0x7F) + (delta & 0x7F)) mod 128`, so the fold is a single
77+
`wrapping_add` — no UTF-8 reconstruction, no decode, no encode (the stray carry
78+
bit is overwritten by the unconditional `0x80 |`). Because the output is never
79+
longer than the input, it runs fully in place in the input's own buffer, and
80+
pure-ASCII input is returned untouched. It shares `simple_fold`'s
81+
auto-vectorized ASCII pass (~46 GiB/s) and, since it emits one byte per
82+
character, runs *faster* than `simple_fold` on folding-heavy input (e.g. ~1.9
83+
vs ~1.3 GiB/s on length-changing folds, ~1.1 vs ~0.9 GiB/s on mixed BMP) and a
84+
little slower on pure-reject CJK/symbols due to character collapsing.
85+
2886
## Why does this crate exist?
2987

3088
Unicode 16.0 defines 1484 simple-fold mappings. Common ways to store them:
@@ -68,7 +126,7 @@ query:
68126
`wrapping_add`, one 4-byte store — no decode, no encode. Writing fewer/more
69127
bytes than were read handles length-changing folds (`K``k`, `Ⱥ```).
70128

71-
### Table layout (1776 B total)
129+
### Table layout (2014 B total)
72130

73131
| Component | Bytes |
74132
|-------------------------------------------------|------:|
@@ -78,8 +136,11 @@ query:
78136
| `RUN_END_LOW[238 + 8]: u8` (clean scan key, `end & 0x3F`; +8 SWAR pad) | 246 |
79137
| `RUN_START_STRIDE[238]: u8` (`start & 0x3F` \| stride bit) | 238 |
80138
| `BYTE_DELTA[238]: u32` (little-endian fold delta per run) | 952 |
81-
| **Total** | **1776** |
139+
| `INDEX_DELTA[238]: u8` (7-bit per-run fold delta, `index_fold` only) | 238 |
140+
| **Total** | **2014** |
82141

142+
The `simple_fold` path uses 1776 B of this; the 238 B `INDEX_DELTA` side table
143+
powers [`index_fold`](#single-byte-index-fold) only.
83144
(Splitting runs at byte-delta boundaries raises the run count from 227 to 238.)
84145
The data file is parsed at build time by `build.rs`, which emits the packed
85146
`static` tables to `OUT_DIR/table.rs`.

crates/casefold/benchmarks/conversion.rs

Lines changed: 11 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,8 @@
11
//! Benchmarks for `casefold::simple_fold`, comparing it against several
2-
//! baselines on representative inputs. Each input is run through six variants:
2+
//! baselines on representative inputs. Each input is run through these variants:
33
//!
44
//! - `casefold::simple_fold` — the implementation under test.
5+
//! - `casefold::index_fold` — the one-byte-per-character index fold.
56
//! - `HashMap::fold_into_bytes` — a HashMap-based case fold over raw UTF-8.
67
//! - `str::to_lowercase` — straightforward Unicode lowercasing baseline.
78
//! - `chars().flat_map(to_lowercase)` — the per-char flat-map variant.
@@ -13,7 +14,7 @@
1314
//! cases (e.g. `Σ` final-sigma context, `İ` → `i\u{0307}`). These benchmarks
1415
//! are about throughput on equivalent workloads, not output equality.
1516
16-
use casefold::{simple_fold, utf8_len};
17+
use casefold::{index_fold, simple_fold, utf8_len};
1718
use casefold_benchmarks::{hashmap_fold_utf8, reference_map_utf8, FoldHashMap};
1819
use criterion::{criterion_group, criterion_main, BenchmarkId, Criterion, Throughput};
1920
use std::hint::black_box;
@@ -156,6 +157,14 @@ fn bench_conversion(c: &mut Criterion, name: &str, input: &str) {
156157
},
157158
);
158159

160+
group.bench_function(BenchmarkId::new("Casefold::index_fold", input.len()), |b| {
161+
b.iter_batched(
162+
|| input.to_string(),
163+
|s| index_fold(black_box(s)),
164+
criterion::BatchSize::SmallInput,
165+
);
166+
});
167+
159168
let fold_map = reference_map_utf8();
160169
group.bench_function(
161170
BenchmarkId::new("HashMap::fold_into_bytes (UTF-8 u32)", input.len()),

crates/casefold/build.rs

Lines changed: 14 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -307,9 +307,21 @@ fn emit_tables(folds: &[Fold], runs: &[Run]) -> String {
307307
.max()
308308
.unwrap_or(0);
309309

310+
// Parallel 7-bit index deltas, one per run, for `index_fold`. The fold
311+
// collapses each code point to `cp & 0x7F`; by modular arithmetic the folded
312+
// low-7-bit value is `((cp & 0x7F) + (delta & 0x7F)) & 0x7F`, so storing the
313+
// code-point delta reduced mod 128 lets `index_fold` derive the folded index
314+
// byte with one `wrapping_add` + mask — no UTF-8 reconstruction. The high
315+
// bit is added unconditionally at write time, so only 7 bits are stored.
316+
let index_deltas: Vec<u8> = runs.iter().map(|r| (r.delta & 0x7F) as u8).collect();
317+
310318
// Sanity: size accounting (printed as build warnings for visibility).
311319
let index_bytes = page_bitmap.len() * 8 + popcnt_samples.len() + page_offset.len();
312-
let total = index_bytes + run_end_low.len() + run_start_stride.len() + byte_deltas.len() * 4;
320+
let total = index_bytes
321+
+ run_end_low.len()
322+
+ run_start_stride.len()
323+
+ byte_deltas.len() * 4
324+
+ index_deltas.len();
313325
if env::var_os("CASEFOLD_BUILD_INFO").is_some() {
314326
println!(
315327
"cargo:warning=casefold table: {} fold entries, {} runs, {} populated pages, {} bytes total ({:.2} bits/entry), max |delta| = {}, max |byte_delta| = {}",
@@ -338,6 +350,7 @@ fn emit_tables(folds: &[Fold], runs: &[Run]) -> String {
338350
emit_u8_array(&mut s, "RUN_END_LOW", &run_end_low);
339351
emit_u8_array(&mut s, "RUN_START_STRIDE", &run_start_stride);
340352
emit_u32_array(&mut s, "BYTE_DELTA", &byte_deltas);
353+
emit_u8_array(&mut s, "INDEX_DELTA", &index_deltas);
341354

342355
s
343356
}

0 commit comments

Comments
 (0)