Skip to content

Commit a01d74d

Browse files
feat: Add Azure Content Understanding converter (#1865)
* inital version * improve mime type detection * prebuilt-image custom analzyer route to image * enhance cu priority over di * fix: apply black formatting * update cache of known prebuilt name and README improvement * add test cases, run black * update readme and deriving content_type from the resolved file_type * update readme
1 parent a51f725 commit a01d74d

7 files changed

Lines changed: 1667 additions & 1 deletion

File tree

README.md

Lines changed: 78 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -107,6 +107,7 @@ At the moment, the following optional dependencies are available:
107107
* `[pdf]` Installs dependencies for PDF files
108108
* `[outlook]` Installs dependencies for Outlook messages
109109
* `[az-doc-intel]` Installs dependencies for Azure Document Intelligence
110+
* `[az-content-understanding]` Installs dependencies for Azure Content Understanding
110111
* `[audio-transcription]` Installs dependencies for audio transcription of wav and mp3 files
111112
* `[youtube-transcription]` Installs dependencies for fetching YouTube video transcription
112113

@@ -158,6 +159,83 @@ If no `llm_client` is provided the plugin still loads, but OCR is silently skipp
158159

159160
See [`packages/markitdown-ocr/README.md`](packages/markitdown-ocr/README.md) for detailed documentation.
160161

162+
### Azure Content Understanding
163+
164+
[Azure Content Understanding](https://learn.microsoft.com/azure/ai-services/content-understanding/) provides higher-quality conversion with structured field extraction (YAML front matter), multi-modal support (documents, images, audio, video), and configurable analyzers.
165+
166+
Install: `pip install 'markitdown[az-content-understanding]'`
167+
168+
#### When to use Content Understanding
169+
170+
Content Understanding is ideal when you need capabilities beyond what built-in or Document Intelligence converters provide:
171+
172+
- **Audio and video files** — CU is the only option for video, and the higher-quality cloud option for audio. Built-in converters have no video support and only basic audio transcription.
173+
- **Structured field extraction**[Prebuilt](https://learn.microsoft.com/azure/ai-services/content-understanding/concepts/prebuilt-analyzers) or [custom-built](https://learn.microsoft.com/azure/ai-services/content-understanding/how-to/customize-analyzer-content-understanding-studio?tabs=portal) analyzers extract domain-specific fields (invoice amounts, receipt dates, contract clauses) serialized as YAML front matter. Neither built-in nor Doc Intel integration exposes fields.
174+
- **Higher-quality document extraction** — Cloud-based layout analysis and OCR for scanned PDFs, complex tables, and multi-page documents.
175+
- **Single API for all modalities** — One `cu_endpoint` handles documents, images, audio, and video with automatic analyzer routing.
176+
177+
| Capability | Built-in converters | Azure Document Intelligence | Azure Content Understanding |
178+
|------------|---------------------|-----------------------------|-----------------------------|
179+
| Document conversion | Offline, format-specific extraction | Cloud layout extraction | Cloud multimodal extraction |
180+
| Structured fields | Not available | Not exposed by this integration | YAML front matter from analyzer fields |
181+
| Custom analyzers | Not available | Not configurable in this integration | Supported with `cu_analyzer_id` |
182+
| Audio and video | Basic audio, no video | Not supported | Audio and video analyzers |
183+
| Cost | Local compute only | Billable Azure API calls | Billable Azure API calls |
184+
185+
**CLI:**
186+
187+
```bash
188+
markitdown path-to-file.pdf --use-cu --cu-endpoint "<content_understanding_endpoint>"
189+
```
190+
191+
**Python API:**
192+
193+
```python
194+
from markitdown import MarkItDown
195+
196+
# Zero-config — auto-selects analyzer per file type
197+
md = MarkItDown(cu_endpoint="<content_understanding_endpoint>")
198+
result = md.convert("report.pdf") # documents → prebuilt-documentSearch
199+
result = md.convert("meeting.mp4") # video → prebuilt-videoSearch
200+
result = md.convert("call.wav") # audio → prebuilt-audioSearch
201+
print(result.markdown)
202+
```
203+
204+
**With a custom analyzer** (for domain-specific field extraction):
205+
206+
```python
207+
md = MarkItDown(
208+
cu_endpoint="<content_understanding_endpoint>",
209+
cu_analyzer_id="my-invoice-analyzer",
210+
)
211+
result = md.convert("invoice.pdf")
212+
print(result.markdown)
213+
# Output includes YAML front matter with extracted fields:
214+
# ---
215+
# contentType: document
216+
# fields:
217+
# VendorName: CONTOSO LTD.
218+
# InvoiceDate: '2019-11-15'
219+
# ---
220+
# <!-- page 1 -->
221+
# ...
222+
```
223+
224+
When `cu_analyzer_id` is set, the converter automatically scopes it to compatible file types based on the analyzer's modality. Incompatible types (e.g., audio files with a document analyzer) auto-route to default prebuilt analyzers.
225+
226+
**Cost note:** Each `convert()` call for a CU-routed format is a billable Azure API call. Use `cu_file_types` to restrict which formats route to CU:
227+
228+
```python
229+
from markitdown.converters import ContentUnderstandingFileType
230+
231+
md = MarkItDown(
232+
cu_endpoint="<content_understanding_endpoint>",
233+
cu_file_types=[ContentUnderstandingFileType.PDF], # only PDFs use CU
234+
)
235+
```
236+
237+
More information about Azure Content Understanding can be found [here](https://learn.microsoft.com/azure/ai-services/content-understanding/).
238+
161239
### Azure Document Intelligence
162240

163241
To use Microsoft Document Intelligence for conversion:

packages/markitdown/pyproject.toml

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -47,6 +47,7 @@ all = [
4747
"SpeechRecognition",
4848
"youtube-transcript-api~=1.0.0",
4949
"azure-ai-documentintelligence",
50+
"azure-ai-contentunderstanding>=1.2.0b1",
5051
"azure-identity",
5152
]
5253
pptx = ["python-pptx"]
@@ -58,6 +59,8 @@ outlook = ["olefile"]
5859
audio-transcription = ["pydub", "SpeechRecognition"]
5960
youtube-transcription = ["youtube-transcript-api"]
6061
az-doc-intel = ["azure-ai-documentintelligence", "azure-identity"]
62+
# >=1.2.0b1 required for to_llm_input() helper used by ContentUnderstandingConverter
63+
az-content-understanding = ["azure-ai-contentunderstanding>=1.2.0b1", "azure-identity"]
6164

6265
[project.urls]
6366
Documentation = "https://github.com/microsoft/markitdown#readme"

packages/markitdown/src/markitdown/__main__.py

Lines changed: 59 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -4,6 +4,7 @@
44
import argparse
55
import sys
66
import codecs
7+
from typing import Any, Dict
78
from textwrap import dedent
89
from importlib.metadata import entry_points
910
from .__about__ import __version__
@@ -77,20 +78,47 @@ def main():
7778
help="Provide a hint about the file's charset (e.g, UTF-8).",
7879
)
7980

80-
parser.add_argument(
81+
cloud_group = parser.add_mutually_exclusive_group()
82+
cloud_group.add_argument(
8183
"-d",
8284
"--use-docintel",
8385
action="store_true",
8486
help="Use Document Intelligence to extract text instead of offline conversion. Requires a valid Document Intelligence Endpoint.",
8587
)
8688

89+
cloud_group.add_argument(
90+
"--use-cu",
91+
"--use-content-understanding",
92+
action="store_true",
93+
dest="use_cu",
94+
help="Use Azure Content Understanding to extract text. Requires --cu-endpoint.",
95+
)
96+
8797
parser.add_argument(
8898
"-e",
8999
"--endpoint",
90100
type=str,
91101
help="Document Intelligence Endpoint. Required if using Document Intelligence.",
92102
)
93103

104+
parser.add_argument(
105+
"--cu-endpoint",
106+
type=str,
107+
help="Content Understanding Endpoint. Required if using --use-cu.",
108+
)
109+
110+
parser.add_argument(
111+
"--cu-analyzer",
112+
type=str,
113+
help="Content Understanding analyzer ID. If not specified, auto-selects by file type.",
114+
)
115+
116+
parser.add_argument(
117+
"--cu-file-types",
118+
type=str,
119+
help="Comma-separated list of file types to route to Content Understanding (e.g., pdf,jpeg,mp4). If omitted, all supported types are routed.",
120+
)
121+
94122
parser.add_argument(
95123
"-p",
96124
"--use-plugins",
@@ -183,6 +211,36 @@ def main():
183211
markitdown = MarkItDown(
184212
enable_plugins=args.use_plugins, docintel_endpoint=args.endpoint
185213
)
214+
elif args.use_cu:
215+
if args.cu_endpoint is None:
216+
_exit_with_error(
217+
"Content Understanding Endpoint (--cu-endpoint) is required when using --use-cu."
218+
)
219+
elif args.filename is None:
220+
_exit_with_error("Filename is required when using Content Understanding.")
221+
222+
cu_kwargs: Dict[str, Any] = {
223+
"cu_endpoint": args.cu_endpoint,
224+
}
225+
if args.cu_analyzer is not None:
226+
cu_kwargs["cu_analyzer_id"] = args.cu_analyzer
227+
if args.cu_file_types is not None:
228+
# Parse comma-separated file types into ContentUnderstandingFileType list
229+
from .converters import ContentUnderstandingFileType
230+
231+
type_names = [
232+
t.strip().lower() for t in args.cu_file_types.split(",") if t.strip()
233+
]
234+
cu_types = []
235+
for name in type_names:
236+
# Try matching by value (e.g., "pdf", "jpeg", "mp4")
237+
try:
238+
cu_types.append(ContentUnderstandingFileType(name))
239+
except ValueError:
240+
_exit_with_error(f"Unknown file type: {name}")
241+
cu_kwargs["cu_file_types"] = cu_types
242+
243+
markitdown = MarkItDown(enable_plugins=args.use_plugins, **cu_kwargs)
186244
else:
187245
markitdown = MarkItDown(enable_plugins=args.use_plugins)
188246

packages/markitdown/src/markitdown/_markitdown.py

Lines changed: 23 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -38,6 +38,7 @@
3838
ZipConverter,
3939
EpubConverter,
4040
DocumentIntelligenceConverter,
41+
ContentUnderstandingConverter,
4142
CsvConverter,
4243
)
4344

@@ -225,6 +226,28 @@ def enable_builtins(self, **kwargs) -> None:
225226
DocumentIntelligenceConverter(**docintel_args),
226227
)
227228

229+
# Register Content Understanding converter at the top of the stack if endpoint is provided
230+
cu_endpoint = kwargs.get("cu_endpoint")
231+
if cu_endpoint is not None:
232+
cu_args: Dict[str, Any] = {}
233+
cu_args["endpoint"] = cu_endpoint
234+
235+
cu_credential = kwargs.get("cu_credential")
236+
if cu_credential is not None:
237+
cu_args["credential"] = cu_credential
238+
239+
cu_analyzer_id = kwargs.get("cu_analyzer_id")
240+
if cu_analyzer_id is not None:
241+
cu_args["analyzer_id"] = cu_analyzer_id
242+
243+
cu_file_types = kwargs.get("cu_file_types")
244+
if cu_file_types is not None:
245+
cu_args["file_types"] = cu_file_types
246+
247+
self.register_converter(
248+
ContentUnderstandingConverter(**cu_args),
249+
)
250+
228251
self._builtins_enabled = True
229252
else:
230253
warn("Built-in converters are already enabled.", RuntimeWarning)

packages/markitdown/src/markitdown/converters/__init__.py

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -21,6 +21,10 @@
2121
DocumentIntelligenceConverter,
2222
DocumentIntelligenceFileType,
2323
)
24+
from ._cu_converter import (
25+
ContentUnderstandingConverter,
26+
ContentUnderstandingFileType,
27+
)
2428
from ._epub_converter import EpubConverter
2529
from ._csv_converter import CsvConverter
2630

@@ -43,6 +47,8 @@
4347
"ZipConverter",
4448
"DocumentIntelligenceConverter",
4549
"DocumentIntelligenceFileType",
50+
"ContentUnderstandingConverter",
51+
"ContentUnderstandingFileType",
4652
"EpubConverter",
4753
"CsvConverter",
4854
]

0 commit comments

Comments
 (0)