You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
* inital version
* improve mime type detection
* prebuilt-image custom analzyer route to image
* enhance cu priority over di
* fix: apply black formatting
* update cache of known prebuilt name and README improvement
* add test cases, run black
* update readme and deriving content_type from the resolved file_type
* update readme
Copy file name to clipboardExpand all lines: README.md
+78Lines changed: 78 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -107,6 +107,7 @@ At the moment, the following optional dependencies are available:
107
107
*`[pdf]` Installs dependencies for PDF files
108
108
*`[outlook]` Installs dependencies for Outlook messages
109
109
*`[az-doc-intel]` Installs dependencies for Azure Document Intelligence
110
+
*`[az-content-understanding]` Installs dependencies for Azure Content Understanding
110
111
*`[audio-transcription]` Installs dependencies for audio transcription of wav and mp3 files
111
112
*`[youtube-transcription]` Installs dependencies for fetching YouTube video transcription
112
113
@@ -158,6 +159,83 @@ If no `llm_client` is provided the plugin still loads, but OCR is silently skipp
158
159
159
160
See [`packages/markitdown-ocr/README.md`](packages/markitdown-ocr/README.md) for detailed documentation.
160
161
162
+
### Azure Content Understanding
163
+
164
+
[Azure Content Understanding](https://learn.microsoft.com/azure/ai-services/content-understanding/) provides higher-quality conversion with structured field extraction (YAML front matter), multi-modal support (documents, images, audio, video), and configurable analyzers.
Content Understanding is ideal when you need capabilities beyond what built-in or Document Intelligence converters provide:
171
+
172
+
-**Audio and video files** — CU is the only option for video, and the higher-quality cloud option for audio. Built-in converters have no video support and only basic audio transcription.
173
+
-**Structured field extraction** — [Prebuilt](https://learn.microsoft.com/azure/ai-services/content-understanding/concepts/prebuilt-analyzers) or [custom-built](https://learn.microsoft.com/azure/ai-services/content-understanding/how-to/customize-analyzer-content-understanding-studio?tabs=portal) analyzers extract domain-specific fields (invoice amounts, receipt dates, contract clauses) serialized as YAML front matter. Neither built-in nor Doc Intel integration exposes fields.
174
+
-**Higher-quality document extraction** — Cloud-based layout analysis and OCR for scanned PDFs, complex tables, and multi-page documents.
175
+
-**Single API for all modalities** — One `cu_endpoint` handles documents, images, audio, and video with automatic analyzer routing.
result = md.convert("report.pdf") # documents → prebuilt-documentSearch
199
+
result = md.convert("meeting.mp4") # video → prebuilt-videoSearch
200
+
result = md.convert("call.wav") # audio → prebuilt-audioSearch
201
+
print(result.markdown)
202
+
```
203
+
204
+
**With a custom analyzer** (for domain-specific field extraction):
205
+
206
+
```python
207
+
md = MarkItDown(
208
+
cu_endpoint="<content_understanding_endpoint>",
209
+
cu_analyzer_id="my-invoice-analyzer",
210
+
)
211
+
result = md.convert("invoice.pdf")
212
+
print(result.markdown)
213
+
# Output includes YAML front matter with extracted fields:
214
+
# ---
215
+
# contentType: document
216
+
# fields:
217
+
# VendorName: CONTOSO LTD.
218
+
# InvoiceDate: '2019-11-15'
219
+
# ---
220
+
# <!-- page 1 -->
221
+
# ...
222
+
```
223
+
224
+
When `cu_analyzer_id` is set, the converter automatically scopes it to compatible file types based on the analyzer's modality. Incompatible types (e.g., audio files with a document analyzer) auto-route to default prebuilt analyzers.
225
+
226
+
**Cost note:** Each `convert()` call for a CU-routed format is a billable Azure API call. Use `cu_file_types` to restrict which formats route to CU:
227
+
228
+
```python
229
+
from markitdown.converters import ContentUnderstandingFileType
230
+
231
+
md = MarkItDown(
232
+
cu_endpoint="<content_understanding_endpoint>",
233
+
cu_file_types=[ContentUnderstandingFileType.PDF], # only PDFs use CU
234
+
)
235
+
```
236
+
237
+
More information about Azure Content Understanding can be found [here](https://learn.microsoft.com/azure/ai-services/content-understanding/).
238
+
161
239
### Azure Document Intelligence
162
240
163
241
To use Microsoft Document Intelligence for conversion:
0 commit comments