Skip to content

fix: consecutive partial numbers wrongly merged + ipynb string source loses title#2113

Open
Sahilalgo8 wants to merge 1 commit into
microsoft:mainfrom
Sahilalgo8:fix/consecutive-partial-numbering-and-ipynb-source-string
Open

fix: consecutive partial numbers wrongly merged + ipynb string source loses title#2113
Sahilalgo8 wants to merge 1 commit into
microsoft:mainfrom
Sahilalgo8:fix/consecutive-partial-numbering-and-ipynb-source-string

Conversation

@Sahilalgo8

Copy link
Copy Markdown

Summary

Two unreported bugs found by code audit — neither appears in any existing issue or PR.

Bug 1: _merge_partial_numbering_lines() merges consecutive partial numbers together

File: packages/markitdown/src/markitdown/converters/_pdf_converter.py

When a PDF has two partial numbers on consecutive lines (e.g. .1 immediately followed by .2), the function merges .1 with .2, producing the nonsense token .1 .2 and assigning all subsequent text to the wrong numbered items.

Example — input:
.1 .2 Contractor shall furnish all materials. .3 Work shall comply with local codes.
Buggy output:
.1 .2 Contractor shall furnish all materials. .3 Work shall comply with local codes.

Root cause: Line 47 merges the current partial number with the next non-empty line, but never checks if that next line is itself a partial number.

Fix: One guard added — skip merging when the next non-empty line also matches PARTIAL_NUMBERING_PATTERN.


Bug 2: IpynbConverter silently loses document title when cell source is a string

File: packages/markitdown/src/markitdown/converters/_ipynb_converter.py

The nbformat spec allows cell source to be either a list of strings or a plain string. When source is a string, for line in source_lines iterates character-by-character, so line.startswith('# ') never matches and result.title is always None — silently, with no error raised.

Example:
`python

source as list → title = 'My Report' ✓

'source': ['# My Report\n', '\n', 'Content']

source as string → title = None ✗ (same content, different format)

'source': '# My Report\n\nContent'
`

Fix: Normalise string source to a list via splitlines(keepends=True) before processing.


Tests

9 new tests in tests/test_bug_fixes.py covering both fixes and edge cases (consecutive numbers, mixed patterns, string vs list source parity, no-heading case).

@Sahilalgo8

Copy link
Copy Markdown
Author

@microsoft-github-policy-service agree

@ericsondea-collab ericsondea-collab left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ericson de almeirda teixeira

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants