feat: Add Azure Content Understanding converter (#1865)

chienyuanchang · web-flow · commit a01d74dda70d · 2026-05-21T21:59:41.000-07:00
* inital version

* improve mime type detection

* prebuilt-image custom analzyer route to image

* enhance cu priority over di

* fix: apply black formatting

* update cache of known prebuilt name and README improvement

* add test cases, run black

* update readme and deriving content_type from the resolved file_type

* update readme
diff --git a/README.md b/README.md
@@ -107,6 +107,7 @@ At the moment, the following optional dependencies are available:
 * `[pdf]` Installs dependencies for PDF files
 * `[outlook]` Installs dependencies for Outlook messages
 * `[az-doc-intel]` Installs dependencies for Azure Document Intelligence
+* `[az-content-understanding]` Installs dependencies for Azure Content Understanding
 * `[audio-transcription]` Installs dependencies for audio transcription of wav and mp3 files
 * `[youtube-transcription]` Installs dependencies for fetching YouTube video transcription
 
@@ -158,6 +159,83 @@ If no `llm_client` is provided the plugin still loads, but OCR is silently skipp
 
 See [`packages/markitdown-ocr/README.md`](packages/markitdown-ocr/README.md) for detailed documentation.
 
+### Azure Content Understanding
+
+[Azure Content Understanding](https://learn.microsoft.com/azure/ai-services/content-understanding/) provides higher-quality conversion with structured field extraction (YAML front matter), multi-modal support (documents, images, audio, video), and configurable analyzers.
+
+Install: `pip install 'markitdown[az-content-understanding]'`
+
+#### When to use Content Understanding
+
+Content Understanding is ideal when you need capabilities beyond what built-in or Document Intelligence converters provide:
+
+- **Audio and video files** — CU is the only option for video, and the higher-quality cloud option for audio. Built-in converters have no video support and only basic audio transcription.
+- **Structured field extraction** — [Prebuilt](https://learn.microsoft.com/azure/ai-services/content-understanding/concepts/prebuilt-analyzers) or [custom-built](https://learn.microsoft.com/azure/ai-services/content-understanding/how-to/customize-analyzer-content-understanding-studio?tabs=portal) analyzers extract domain-specific fields (invoice amounts, receipt dates, contract clauses) serialized as YAML front matter. Neither built-in nor Doc Intel integration exposes fields.
+- **Higher-quality document extraction** — Cloud-based layout analysis and OCR for scanned PDFs, complex tables, and multi-page documents.
+- **Single API for all modalities** — One `cu_endpoint` handles documents, images, audio, and video with automatic analyzer routing.
+
+| Capability | Built-in converters | Azure Document Intelligence | Azure Content Understanding |
+|------------|---------------------|-----------------------------|-----------------------------|
+| Document conversion | Offline, format-specific extraction | Cloud layout extraction | Cloud multimodal extraction |
+| Structured fields | Not available | Not exposed by this integration | YAML front matter from analyzer fields |
+| Custom analyzers | Not available | Not configurable in this integration | Supported with `cu_analyzer_id` |
+| Audio and video | Basic audio, no video | Not supported | Audio and video analyzers |
+| Cost | Local compute only | Billable Azure API calls | Billable Azure API calls |
+
+**CLI:**
+
+```bash
+markitdown path-to-file.pdf --use-cu --cu-endpoint "<content_understanding_endpoint>"
+```
+
+**Python API:**
+
+```python
+from markitdown import MarkItDown
+
+# Zero-config — auto-selects analyzer per file type
+md = MarkItDown(cu_endpoint="<content_understanding_endpoint>")
+result = md.convert("report.pdf")   # documents → prebuilt-documentSearch
+result = md.convert("meeting.mp4")  # video → prebuilt-videoSearch
+result = md.convert("call.wav")     # audio → prebuilt-audioSearch
+print(result.markdown)
+```
+
+**With a custom analyzer** (for domain-specific field extraction):
+
+```python
+md = MarkItDown(
+    cu_endpoint="<content_understanding_endpoint>",
+    cu_analyzer_id="my-invoice-analyzer",
+)
+result = md.convert("invoice.pdf")
+print(result.markdown)
+# Output includes YAML front matter with extracted fields:
+# ---
+# contentType: document
+# fields:
+#   VendorName: CONTOSO LTD.
+#   InvoiceDate: '2019-11-15'
+# ---
+# <!-- page 1 -->
+# ...
+```
+
+When `cu_analyzer_id` is set, the converter automatically scopes it to compatible file types based on the analyzer's modality. Incompatible types (e.g., audio files with a document analyzer) auto-route to default prebuilt analyzers.
+
+**Cost note:** Each `convert()` call for a CU-routed format is a billable Azure API call. Use `cu_file_types` to restrict which formats route to CU:
+
+```python
+from markitdown.converters import ContentUnderstandingFileType
+
+md = MarkItDown(
+    cu_endpoint="<content_understanding_endpoint>",
+    cu_file_types=[ContentUnderstandingFileType.PDF],  # only PDFs use CU
+)
+```
+
+More information about Azure Content Understanding can be found [here](https://learn.microsoft.com/azure/ai-services/content-understanding/).
+
 ### Azure Document Intelligence
 
 To use Microsoft Document Intelligence for conversion:
diff --git a/packages/markitdown/pyproject.toml b/packages/markitdown/pyproject.toml
@@ -47,6 +47,7 @@ all = [
   "SpeechRecognition",
   "youtube-transcript-api~=1.0.0",
   "azure-ai-documentintelligence",
+  "azure-ai-contentunderstanding>=1.2.0b1",
   "azure-identity",
 ]
 pptx = ["python-pptx"]
@@ -58,6 +59,8 @@ outlook = ["olefile"]
 audio-transcription = ["pydub", "SpeechRecognition"]
 youtube-transcription = ["youtube-transcript-api"]
 az-doc-intel = ["azure-ai-documentintelligence", "azure-identity"]
+# >=1.2.0b1 required for to_llm_input() helper used by ContentUnderstandingConverter
+az-content-understanding = ["azure-ai-contentunderstanding>=1.2.0b1", "azure-identity"]
 
 [project.urls]
 Documentation = "https://github.com/microsoft/markitdown#readme"
diff --git a/packages/markitdown/src/markitdown/__main__.py b/packages/markitdown/src/markitdown/__main__.py
@@ -4,6 +4,7 @@
 import argparse
 import sys
 import codecs
+from typing import Any, Dict
 from textwrap import dedent
 from importlib.metadata import entry_points
 from .__about__ import __version__
@@ -77,20 +78,47 @@ def main():
         help="Provide a hint about the file's charset (e.g, UTF-8).",
     )
 
-    parser.add_argument(
+    cloud_group = parser.add_mutually_exclusive_group()
+    cloud_group.add_argument(
         "-d",
         "--use-docintel",
         action="store_true",
         help="Use Document Intelligence to extract text instead of offline conversion. Requires a valid Document Intelligence Endpoint.",
     )
 
+    cloud_group.add_argument(
+        "--use-cu",
+        "--use-content-understanding",
+        action="store_true",
+        dest="use_cu",
+        help="Use Azure Content Understanding to extract text. Requires --cu-endpoint.",
+    )
+
     parser.add_argument(
         "-e",
         "--endpoint",
         type=str,
         help="Document Intelligence Endpoint. Required if using Document Intelligence.",
     )
 
+    parser.add_argument(
+        "--cu-endpoint",
+        type=str,
+        help="Content Understanding Endpoint. Required if using --use-cu.",
+    )
+
+    parser.add_argument(
+        "--cu-analyzer",
+        type=str,
+        help="Content Understanding analyzer ID. If not specified, auto-selects by file type.",
+    )
+
+    parser.add_argument(
+        "--cu-file-types",
+        type=str,
+        help="Comma-separated list of file types to route to Content Understanding (e.g., pdf,jpeg,mp4). If omitted, all supported types are routed.",
+    )
+
     parser.add_argument(
         "-p",
         "--use-plugins",
@@ -183,6 +211,36 @@ def main():
         markitdown = MarkItDown(
             enable_plugins=args.use_plugins, docintel_endpoint=args.endpoint
         )
+    elif args.use_cu:
+        if args.cu_endpoint is None:
+            _exit_with_error(
+                "Content Understanding Endpoint (--cu-endpoint) is required when using --use-cu."
+            )
+        elif args.filename is None:
+            _exit_with_error("Filename is required when using Content Understanding.")
+
+        cu_kwargs: Dict[str, Any] = {
+            "cu_endpoint": args.cu_endpoint,
+        }
+        if args.cu_analyzer is not None:
+            cu_kwargs["cu_analyzer_id"] = args.cu_analyzer
+        if args.cu_file_types is not None:
+            # Parse comma-separated file types into ContentUnderstandingFileType list
+            from .converters import ContentUnderstandingFileType
+
+            type_names = [
+                t.strip().lower() for t in args.cu_file_types.split(",") if t.strip()
+            ]
+            cu_types = []
+            for name in type_names:
+                # Try matching by value (e.g., "pdf", "jpeg", "mp4")
+                try:
+                    cu_types.append(ContentUnderstandingFileType(name))
+                except ValueError:
+                    _exit_with_error(f"Unknown file type: {name}")
+            cu_kwargs["cu_file_types"] = cu_types
+
+        markitdown = MarkItDown(enable_plugins=args.use_plugins, **cu_kwargs)
     else:
         markitdown = MarkItDown(enable_plugins=args.use_plugins)
 
diff --git a/packages/markitdown/src/markitdown/_markitdown.py b/packages/markitdown/src/markitdown/_markitdown.py
@@ -38,6 +38,7 @@
     ZipConverter,
     EpubConverter,
     DocumentIntelligenceConverter,
+    ContentUnderstandingConverter,
     CsvConverter,
 )
 
@@ -225,6 +226,28 @@ def enable_builtins(self, **kwargs) -> None:
                     DocumentIntelligenceConverter(**docintel_args),
                 )
 
+            # Register Content Understanding converter at the top of the stack if endpoint is provided
+            cu_endpoint = kwargs.get("cu_endpoint")
+            if cu_endpoint is not None:
+                cu_args: Dict[str, Any] = {}
+                cu_args["endpoint"] = cu_endpoint
+
+                cu_credential = kwargs.get("cu_credential")
+                if cu_credential is not None:
+                    cu_args["credential"] = cu_credential
+
+                cu_analyzer_id = kwargs.get("cu_analyzer_id")
+                if cu_analyzer_id is not None:
+                    cu_args["analyzer_id"] = cu_analyzer_id
+
+                cu_file_types = kwargs.get("cu_file_types")
+                if cu_file_types is not None:
+                    cu_args["file_types"] = cu_file_types
+
+                self.register_converter(
+                    ContentUnderstandingConverter(**cu_args),
+                )
+
             self._builtins_enabled = True
         else:
             warn("Built-in converters are already enabled.", RuntimeWarning)
diff --git a/packages/markitdown/src/markitdown/converters/__init__.py b/packages/markitdown/src/markitdown/converters/__init__.py
@@ -21,6 +21,10 @@
     DocumentIntelligenceConverter,
     DocumentIntelligenceFileType,
 )
+from ._cu_converter import (
+    ContentUnderstandingConverter,
+    ContentUnderstandingFileType,
+)
 from ._epub_converter import EpubConverter
 from ._csv_converter import CsvConverter
 
@@ -43,6 +47,8 @@
     "ZipConverter",
     "DocumentIntelligenceConverter",
     "DocumentIntelligenceFileType",
+    "ContentUnderstandingConverter",
+    "ContentUnderstandingFileType",
     "EpubConverter",
     "CsvConverter",
 ]
diff --git a/packages/markitdown/src/markitdown/converters/_cu_converter.py b/packages/markitdown/src/markitdown/converters/_cu_converter.py
diff --git a/packages/markitdown/tests/test_cu_converter.py b/packages/markitdown/tests/test_cu_converter.py