Skip to content

Port 'Adjust placement of paragraph markers' from machine.py#435

Open
Copilot wants to merge 2 commits into
masterfrom
copilot/port-adjust-placement-of-paragraph-markers
Open

Port 'Adjust placement of paragraph markers' from machine.py#435
Copilot wants to merge 2 commits into
masterfrom
copilot/port-adjust-placement-of-paragraph-markers

Conversation

Copilot AI commented Jun 24, 2026

Copy link
Copy Markdown
Contributor

Ports machine.py#298 — after alignment-based placement of paragraph markers, apply small boundary adjustments to produce more natural splits (e.g. keeping a trailing comma with its sentence rather than letting it open the next paragraph).

New: SegmentBoundaryAdjuster

Two new classes in SegmentBoundaryAdjuster.cs:

  • TokenRejoiner — reconstructs token lists into strings with correct punctuation spacing (no space before ,/./closing quotes, no space after opening brackets/quotes).
  • SegmentBoundaryAdjuster — adjusts a segment boundary by:
    • Moving prohibited segment-starting characters (, ; . ? ! closing quotes/brackets) from the head of the next segment to the tail of the current one
    • Moving prohibited segment-ending characters (opening brackets/quotes) from the tail of the current segment to the head of the next one
    • Correcting late sentence starts (capitalized words that crossed the boundary too early)
    • Correcting early sentence ends (words + terminal punctuation that crossed the boundary too late)
    • AdjustTokenizedSegmentPairBoundaries(int boundary, IReadOnlyList<string> tokens) — token-index variant used by the handler

Change: PlaceMarkersUsfmUpdateBlockHandler

After PredictMarkerLocation, paragraph markers now go through AdjustTokenizedSegmentPairBoundaries before their string index is resolved:

// If inserting a paragraph marker, make small adjustments to place it in a more natural location
if (element.Type == UsfmUpdateBlockElementType.Paragraph)
{
    adjacentTargetToken = _segmentBoundaryAdjuster.AdjustTokenizedSegmentPairBoundaries(
        adjacentTargetToken,
        targetTokens
    );
}

Before: alignment places \p before , → paragraph opens with , y esta prueba…
After: comma stays in the preceding paragraph → Este texto está en inglés, / \p y esta prueba…


This change is Reviewable

Copilot AI changed the title [WIP] Port relevant changes from PR #298 to machine Port 'Adjust placement of paragraph markers' from machine.py Jun 24, 2026
Copilot AI requested a review from ddaspit June 24, 2026 20:26
@ddaspit ddaspit marked this pull request as ready for review June 25, 2026 13:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants