Skip to content

Do not retry on certification errors#211

Open
ashwgit wants to merge 1 commit into
release-engineering:mainfrom
ashwinifork:fail-on-cert-error
Open

Do not retry on certification errors#211
ashwgit wants to merge 1 commit into
release-engineering:mainfrom
ashwinifork:fail-on-cert-error

Conversation

@ashwgit

@ashwgit ashwgit commented Jun 18, 2026

Copy link
Copy Markdown
Collaborator

@lslebodn @JAVGan PTAL. This one prevents cloudpub from retrying when there's a certification error.

Summary by Sourcery

Handle Azure certification failures as non-retriable errors and adjust error modeling accordingly.

Bug Fixes:

  • Prevent Azure job polling and publish operations from retrying when failures are caused by certification errors.

Enhancements:

  • Model Azure configure job errors as structured dictionaries and introduce helpers to detect certification-related failures.
  • Introduce a dedicated CertificationError type and integrate it into Azure job status and publish flows to surface clearer failure reasons.

Tests:

  • Extend Azure service and utils tests with fixtures and cases covering certification error detection, non-retry behavior, and updated error formats.

@sourcery-ai

sourcery-ai Bot commented Jun 18, 2026

Copy link
Copy Markdown

Reviewer's Guide

Introduce explicit handling of Azure Marketplace certification failures by detecting structured certification errors in job responses, preventing retries for these permanent failures, and updating models and tests to work with structured error objects instead of plain strings.

Sequence diagram for Azure publish flow without retries on certification errors

sequenceDiagram
    actor Caller
    participant RetryWrapper
    participant AzureService as MsAzureService
    participant AzureAPI as AzureMarketplace

    Caller->>RetryWrapper: _publish_live(product, product_name)
    RetryWrapper->>AzureService: _publish_live(product, product_name)
    AzureService->>AzureAPI: submit_to_status(product_id, status='live')
    AzureAPI-->>AzureService: ConfigureStatus(job_result='failed', errors)
    AzureService->>AzureService: is_certification_error(errors)
    alt [errors indicate certification failure]
        AzureService->>RetryWrapper: raise CertificationError
        RetryWrapper-->>Caller: propagate CertificationError (no retry)
    else [non certification failure]
        AzureService->>RetryWrapper: raise RuntimeError
        RetryWrapper->>AzureService: retry _publish_live(product, product_name)
    end
Loading

File-Level Changes

Change Details Files
Detect Azure certification errors from structured job error payloads and expose a helper API for callers.
  • Add a recursive helper _contains_certification_error to inspect error dicts and nested details for invalidState certification messages.
  • Add is_certification_error(errors) utility that returns True when any error indicates a certification failure.
  • Add unit test coverage for is_certification_error with certification and non-certification error inputs.
cloudpub/ms_azure/utils.py
tests/ms_azure/test_utils.py
Prevent retries and raise a dedicated CertificationError when Azure jobs fail due to certification issues during status polling and publish operations.
  • Extend query_job_status to raise CertificationError instead of InvalidStateError when job_details.errors indicate certification failure.
  • Update _publish_preview and _publish_live to stop joining errors as strings, use the structured errors list in failure messages, and raise CertificationError on certification failures.
  • Configure tenacity retry decorators on _publish_preview and _publish_live to not retry when CertificationError is raised.
  • Adjust publish preview/live failure tests to work with structured error objects and to assert behavior with and without certification errors, including no-retry semantics and expected exception types.
cloudpub/ms_azure/service.py
tests/ms_azure/test_service.py
Represent Azure configure job errors as structured objects and introduce a dedicated CertificationError exception type.
  • Change ConfigureStatus.errors from List[str] to List[Dict[str, Any]] to align with Azure error payloads and support nested error details.
  • Introduce CertificationError as a subclass of InvalidStateError to model permanent certification failures.
  • Add fixtures for certification-style error payloads used across tests.
cloudpub/models/ms_azure.py
cloudpub/error.py
tests/ms_azure/conftest.py

Tips and commands

Interacting with Sourcery

  • Trigger a new review: Comment @sourcery-ai review on the pull request.
  • Continue discussions: Reply directly to Sourcery's review comments.
  • Generate a GitHub issue from a review comment: Ask Sourcery to create an
    issue from a review comment by replying to it. You can also reply to a
    review comment with @sourcery-ai issue to create an issue from it.
  • Generate a pull request title: Write @sourcery-ai anywhere in the pull
    request title to generate a title at any time. You can also comment
    @sourcery-ai title on the pull request to (re-)generate the title at any time.
  • Generate a pull request summary: Write @sourcery-ai summary anywhere in
    the pull request body to generate a PR summary at any time exactly where you
    want it. You can also comment @sourcery-ai summary on the pull request to
    (re-)generate the summary at any time.
  • Generate reviewer's guide: Comment @sourcery-ai guide on the pull
    request to (re-)generate the reviewer's guide at any time.
  • Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
    pull request to resolve all Sourcery comments. Useful if you've already
    addressed all the comments and don't want to see them anymore.
  • Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
    request to dismiss all existing Sourcery reviews. Especially useful if you
    want to start fresh with a new review - don't forget to comment
    @sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

  • Enable or disable review features such as the Sourcery-generated pull request
    summary, the reviewer's guide, and others.
  • Change the review language.
  • Add, remove or edit custom review instructions.
  • Adjust other review settings.

Getting Help

@sourcery-ai sourcery-ai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey - I've found 4 issues, and left some high level feedback:

  • There is a leftover print("errors : ", res.errors) in _publish_preview; consider removing or replacing it with proper logging to avoid noisy stdout in production.
  • Now that ConfigureStatus.errors is a List[Dict[str, Any]], error messages in _publish_preview and _publish_live rely on the raw list str() representation; consider adding a small formatter to produce clearer, stable error strings instead of exposing the raw structure.
  • In _publish_preview you use self._raise_error(CertificationError, ...) while _publish_live directly raises CertificationError; consider using a consistent error-raising path in both methods for easier handling and logging.
Prompt for AI Agents
Please address the comments from this code review:

## Overall Comments
- There is a leftover `print("errors : ", res.errors)` in `_publish_preview`; consider removing or replacing it with proper logging to avoid noisy stdout in production.
- Now that `ConfigureStatus.errors` is a `List[Dict[str, Any]]`, error messages in `_publish_preview` and `_publish_live` rely on the raw list `str()` representation; consider adding a small formatter to produce clearer, stable error strings instead of exposing the raw structure.
- In `_publish_preview` you use `self._raise_error(CertificationError, ...)` while `_publish_live` directly raises `CertificationError`; consider using a consistent error-raising path in both methods for easier handling and logging.

## Individual Comments

### Comment 1
<location path="cloudpub/ms_azure/service.py" line_range="201-202" />
<code_context>
         """
         job_details = self._query_job_details(job_id=job_id)
         if job_details.job_result == "failed":
             error_message = f"Job {job_id} failed: \n{job_details.errors}"
+            if is_certification_error(job_details.errors):
+                self._raise_error(CertificationError, error_message)
             self._raise_error(InvalidStateError, error_message)
</code_context>
<issue_to_address>
**issue:** Using the raw list of error dicts in the failure message may hurt readability and regress prior behavior.

`job_details.errors` is now a list of dicts, so interpolating it directly will show the raw Python dicts instead of a readable message and may change existing logs/UX. Consider converting the list to a human-readable string (e.g., extracting the `message` fields) before formatting, and align this behavior with any other call sites that include `errors` in messages.
</issue_to_address>

### Comment 2
<location path="cloudpub/ms_azure/service.py" line_range="729-735" />
<code_context>
             product.id, state="preview"
         ):
-            errors = "\n".join(res.errors)
+            errors = res.errors
+            print("errors : ", res.errors)
             failure_msg = (
                 f"Failed to submit the product {product_name} ({product.id}) to preview. "
                 f"Status: {res.job_result} Errors: {errors}"
             )
+            if is_certification_error(res.errors):
+                self._raise_error(CertificationError, failure_msg)
             raise RuntimeError(failure_msg)
</code_context>
<issue_to_address>
**issue:** The plain `print` call and raw `errors` formatting are inconsistent with the surrounding logging behavior.

This block now uses `print("errors : ", res.errors)` and includes the raw `errors` list (list of dicts) directly in the message. Please avoid `print` in this context and use the existing logging approach (e.g., `log.*`) or rely on `failure_msg`. If you need to surface structured error details, consider logging a JSON-serialized version or selected key fields instead of the full list of dicts.
</issue_to_address>

### Comment 3
<location path="cloudpub/ms_azure/service.py" line_range="735-744" />
<code_context>
                 f"Failed to submit the product {product_name} ({product.id}) to preview. "
                 f"Status: {res.job_result} Errors: {errors}"
             )
+            if is_certification_error(res.errors):
+                self._raise_error(CertificationError, failure_msg)
             raise RuntimeError(failure_msg)
</code_context>
<issue_to_address>
**suggestion:** Use of `self._raise_error` for `CertificationError` would align behavior with other failure paths.

`query_job_status` uses `self._raise_error(CertificationError, error_message)`, but here `_publish_live` calls `CertificationError` via `_raise_error` as well. To keep error handling consistent and ensure any shared logic in `_raise_error` (logging, metrics, error mapping) always applies, use `self._raise_error(CertificationError, failure_msg)` instead of raising `CertificationError` directly.

Suggested implementation:

```python
            if is_certification_error(res.errors):
                self._raise_error(CertificationError, failure_msg)
            raise RuntimeError(failure_msg)

```

```python
    def _publish_live(self, product: Product, product_name: str) -> None:
        res = self.submit_to_status(product_id=product.id, status='live')

        # Example structure of the failure handling logic that should
        # now be using `self._raise_error` for CertificationError:
        if res.job_result != 'succeeded' or not self.get_submission_state(
            product.id, state="live"
        ):
            errors = res.errors
            failure_msg = (
                f"Failed to submit the product {product_name} ({product.id}) to live. "
                f"Status: {res.job_result} Errors: {errors}"
            )
            if is_certification_error(res.errors):
                self._raise_error(CertificationError, failure_msg)
            raise RuntimeError(failure_msg)

```

I only see the beginning of `_publish_live` in your snippet, so you will need to adjust the replacement block to match your existing implementation. Specifically:

1. Locate the existing error-handling code inside `_publish_live` (likely similar to the preview path) that currently does something like:
   ```python
   if is_certification_error(res.errors):
       raise CertificationError(failure_msg)
   ```
2. Replace that `raise CertificationError(...)` with:
   ```python
   self._raise_error(CertificationError, failure_msg)
   ```
3. Ensure the `failure_msg` variable name and the condition (`res.job_result`, `get_submission_state`, etc.) match your actual implementation; the block I provided is a template to show how to integrate `_raise_error`, not a strict structural change to your business logic.

The key behavioral change is: all `CertificationError` raises in `_publish_live` should now go through `self._raise_error`.
</issue_to_address>

### Comment 4
<location path="tests/ms_azure/test_service.py" line_range="989-994" />
<code_context>
         azure_service._publish_preview.retry.sleep = mock.Mock()  # type: ignore
         expected_err = (
             f"Failed to submit the product test-product \\({product_obj.id}\\) to preview. "
-            "Status: failed Errors: failure1\nfailure2"
</code_context>
<issue_to_address>
**suggestion (testing):** Tighten the regex assertion to verify the actual error payload rather than using a broad wildcard

In `test_publish_live_fail_on_retry`, the regex `Status: failed Errors: *` is too permissive and will match almost any error content. Since `ConfigureStatus.errors` is now a structured list of dicts, make the assertion stricter by matching the serialized errors (e.g., using `re.escape(str(err_resp.errors))`, as in the preview test) or by including a specific portion of the error structure in the pattern so regressions in formatting or type are caught.
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

Comment on lines 201 to +202
error_message = f"Job {job_id} failed: \n{job_details.errors}"
if is_certification_error(job_details.errors):

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue: Using the raw list of error dicts in the failure message may hurt readability and regress prior behavior.

job_details.errors is now a list of dicts, so interpolating it directly will show the raw Python dicts instead of a readable message and may change existing logs/UX. Consider converting the list to a human-readable string (e.g., extracting the message fields) before formatting, and align this behavior with any other call sites that include errors in messages.

Comment on lines +729 to +735
errors = res.errors
print("errors : ", res.errors)
failure_msg = (
f"Failed to submit the product {product_name} ({product.id}) to preview. "
f"Status: {res.job_result} Errors: {errors}"
)
if is_certification_error(res.errors):

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue: The plain print call and raw errors formatting are inconsistent with the surrounding logging behavior.

This block now uses print("errors : ", res.errors) and includes the raw errors list (list of dicts) directly in the message. Please avoid print in this context and use the existing logging approach (e.g., log.*) or rely on failure_msg. If you need to surface structured error details, consider logging a JSON-serialized version or selected key fields instead of the full list of dicts.

Comment on lines +735 to 744
if is_certification_error(res.errors):
self._raise_error(CertificationError, failure_msg)
raise RuntimeError(failure_msg)

@retry(
wait=wait_fixed(wait=60),
stop=stop_after_attempt(3),
retry=retry_if_not_exception_type(CertificationError),
reraise=True,
)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: Use of self._raise_error for CertificationError would align behavior with other failure paths.

query_job_status uses self._raise_error(CertificationError, error_message), but here _publish_live calls CertificationError via _raise_error as well. To keep error handling consistent and ensure any shared logic in _raise_error (logging, metrics, error mapping) always applies, use self._raise_error(CertificationError, failure_msg) instead of raising CertificationError directly.

Suggested implementation:

            if is_certification_error(res.errors):
                self._raise_error(CertificationError, failure_msg)
            raise RuntimeError(failure_msg)
    def _publish_live(self, product: Product, product_name: str) -> None:
        res = self.submit_to_status(product_id=product.id, status='live')

        # Example structure of the failure handling logic that should
        # now be using `self._raise_error` for CertificationError:
        if res.job_result != 'succeeded' or not self.get_submission_state(
            product.id, state="live"
        ):
            errors = res.errors
            failure_msg = (
                f"Failed to submit the product {product_name} ({product.id}) to live. "
                f"Status: {res.job_result} Errors: {errors}"
            )
            if is_certification_error(res.errors):
                self._raise_error(CertificationError, failure_msg)
            raise RuntimeError(failure_msg)

I only see the beginning of _publish_live in your snippet, so you will need to adjust the replacement block to match your existing implementation. Specifically:

  1. Locate the existing error-handling code inside _publish_live (likely similar to the preview path) that currently does something like:
    if is_certification_error(res.errors):
        raise CertificationError(failure_msg)
  2. Replace that raise CertificationError(...) with:
    self._raise_error(CertificationError, failure_msg)
  3. Ensure the failure_msg variable name and the condition (res.job_result, get_submission_state, etc.) match your actual implementation; the block I provided is a template to show how to integrate _raise_error, not a strict structural change to your business logic.

The key behavioral change is: all CertificationError raises in _publish_live should now go through self._raise_error.

Comment on lines 989 to 994
expected_err = (
f"Failed to submit the product test-product \\({product_obj.id}\\) to preview. "
"Status: failed Errors: failure1\nfailure2"
f"Status: failed Errors: {re.escape(str(err_resp.errors))}"
)

# Test

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion (testing): Tighten the regex assertion to verify the actual error payload rather than using a broad wildcard

In test_publish_live_fail_on_retry, the regex Status: failed Errors: * is too permissive and will match almost any error content. Since ConfigureStatus.errors is now a structured list of dicts, make the assertion stricter by matching the serialized errors (e.g., using re.escape(str(err_resp.errors)), as in the preview test) or by including a specific portion of the error structure in the pattern so regressions in formatting or type are caught.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant