fix: consecutive partial numbers wrongly merged + ipynb string source loses title by Sahilalgo8 · Pull Request #2113 · microsoft/markitdown

Sahilalgo8 · 2026-06-12T10:29:57Z

Summary

Two unreported bugs found by code audit — neither appears in any existing issue or PR.

Bug 1: `_merge_partial_numbering_lines()` merges consecutive partial numbers together

File: packages/markitdown/src/markitdown/converters/_pdf_converter.py

When a PDF has two partial numbers on consecutive lines (e.g. .1 immediately followed by .2), the function merges .1 with .2, producing the nonsense token .1 .2 and assigning all subsequent text to the wrong numbered items.

Example — input:
.1 .2 Contractor shall furnish all materials. .3 Work shall comply with local codes.
Buggy output:
.1 .2 Contractor shall furnish all materials. .3 Work shall comply with local codes.

Root cause: Line 47 merges the current partial number with the next non-empty line, but never checks if that next line is itself a partial number.

Fix: One guard added — skip merging when the next non-empty line also matches PARTIAL_NUMBERING_PATTERN.

Bug 2: `IpynbConverter` silently loses document title when cell `source` is a string

File: packages/markitdown/src/markitdown/converters/_ipynb_converter.py

The nbformat spec allows cell source to be either a list of strings or a plain string. When source is a string, for line in source_lines iterates character-by-character, so line.startswith('# ') never matches and result.title is always None — silently, with no error raised.

Example:
`python

source as list → title = 'My Report' ✓

'source': ['# My Report\n', '\n', 'Content']

source as string → title = None ✗ (same content, different format)

'source': '# My Report\n\nContent'
`

Fix: Normalise string source to a list via splitlines(keepends=True) before processing.

Tests

9 new tests in tests/test_bug_fixes.py covering both fixes and edge cases (consecutive numbers, mixed patterns, string vs list source parity, no-heading case).

…e loss

Sahilalgo8 · 2026-06-12T10:34:44Z

@microsoft-github-policy-service agree

ericsondea-collab

ericson de almeirda teixeira

fix: consecutive partial numbers merging and ipynb string source titl…

c676647

…e loss

This was referenced Jun 12, 2026

bug: consecutive partial numbers (.1 followed by .2) wrongly merged into '.1 .2' #2114

Open

bug: IpynbConverter loses document title when cell source is a string instead of list #2115

Open

ericsondea-collab reviewed Jun 12, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: consecutive partial numbers wrongly merged + ipynb string source loses title#2113

fix: consecutive partial numbers wrongly merged + ipynb string source loses title#2113
Sahilalgo8 wants to merge 1 commit into
microsoft:mainfrom
Sahilalgo8:fix/consecutive-partial-numbering-and-ipynb-source-string

Sahilalgo8 commented Jun 12, 2026

Uh oh!

Sahilalgo8 commented Jun 12, 2026

Uh oh!

ericsondea-collab left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Sahilalgo8 commented Jun 12, 2026

Summary

Bug 1: _merge_partial_numbering_lines() merges consecutive partial numbers together

Bug 2: IpynbConverter silently loses document title when cell source is a string

source as list → title = 'My Report' ✓

source as string → title = None ✗ (same content, different format)

Tests

Uh oh!

Sahilalgo8 commented Jun 12, 2026

Uh oh!

ericsondea-collab left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Bug 1: `_merge_partial_numbering_lines()` merges consecutive partial numbers together

Bug 2: `IpynbConverter` silently loses document title when cell `source` is a string