Do not retry on certification errors#211
Conversation
Reviewer's GuideIntroduce explicit handling of Azure Marketplace certification failures by detecting structured certification errors in job responses, preventing retries for these permanent failures, and updating models and tests to work with structured error objects instead of plain strings. Sequence diagram for Azure publish flow without retries on certification errorssequenceDiagram
actor Caller
participant RetryWrapper
participant AzureService as MsAzureService
participant AzureAPI as AzureMarketplace
Caller->>RetryWrapper: _publish_live(product, product_name)
RetryWrapper->>AzureService: _publish_live(product, product_name)
AzureService->>AzureAPI: submit_to_status(product_id, status='live')
AzureAPI-->>AzureService: ConfigureStatus(job_result='failed', errors)
AzureService->>AzureService: is_certification_error(errors)
alt [errors indicate certification failure]
AzureService->>RetryWrapper: raise CertificationError
RetryWrapper-->>Caller: propagate CertificationError (no retry)
else [non certification failure]
AzureService->>RetryWrapper: raise RuntimeError
RetryWrapper->>AzureService: retry _publish_live(product, product_name)
end
File-Level Changes
Tips and commandsInteracting with Sourcery
Customizing Your ExperienceAccess your dashboard to:
Getting Help
|
There was a problem hiding this comment.
Hey - I've found 4 issues, and left some high level feedback:
- There is a leftover
print("errors : ", res.errors)in_publish_preview; consider removing or replacing it with proper logging to avoid noisy stdout in production. - Now that
ConfigureStatus.errorsis aList[Dict[str, Any]], error messages in_publish_previewand_publish_liverely on the raw liststr()representation; consider adding a small formatter to produce clearer, stable error strings instead of exposing the raw structure. - In
_publish_previewyou useself._raise_error(CertificationError, ...)while_publish_livedirectly raisesCertificationError; consider using a consistent error-raising path in both methods for easier handling and logging.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- There is a leftover `print("errors : ", res.errors)` in `_publish_preview`; consider removing or replacing it with proper logging to avoid noisy stdout in production.
- Now that `ConfigureStatus.errors` is a `List[Dict[str, Any]]`, error messages in `_publish_preview` and `_publish_live` rely on the raw list `str()` representation; consider adding a small formatter to produce clearer, stable error strings instead of exposing the raw structure.
- In `_publish_preview` you use `self._raise_error(CertificationError, ...)` while `_publish_live` directly raises `CertificationError`; consider using a consistent error-raising path in both methods for easier handling and logging.
## Individual Comments
### Comment 1
<location path="cloudpub/ms_azure/service.py" line_range="201-202" />
<code_context>
"""
job_details = self._query_job_details(job_id=job_id)
if job_details.job_result == "failed":
error_message = f"Job {job_id} failed: \n{job_details.errors}"
+ if is_certification_error(job_details.errors):
+ self._raise_error(CertificationError, error_message)
self._raise_error(InvalidStateError, error_message)
</code_context>
<issue_to_address>
**issue:** Using the raw list of error dicts in the failure message may hurt readability and regress prior behavior.
`job_details.errors` is now a list of dicts, so interpolating it directly will show the raw Python dicts instead of a readable message and may change existing logs/UX. Consider converting the list to a human-readable string (e.g., extracting the `message` fields) before formatting, and align this behavior with any other call sites that include `errors` in messages.
</issue_to_address>
### Comment 2
<location path="cloudpub/ms_azure/service.py" line_range="729-735" />
<code_context>
product.id, state="preview"
):
- errors = "\n".join(res.errors)
+ errors = res.errors
+ print("errors : ", res.errors)
failure_msg = (
f"Failed to submit the product {product_name} ({product.id}) to preview. "
f"Status: {res.job_result} Errors: {errors}"
)
+ if is_certification_error(res.errors):
+ self._raise_error(CertificationError, failure_msg)
raise RuntimeError(failure_msg)
</code_context>
<issue_to_address>
**issue:** The plain `print` call and raw `errors` formatting are inconsistent with the surrounding logging behavior.
This block now uses `print("errors : ", res.errors)` and includes the raw `errors` list (list of dicts) directly in the message. Please avoid `print` in this context and use the existing logging approach (e.g., `log.*`) or rely on `failure_msg`. If you need to surface structured error details, consider logging a JSON-serialized version or selected key fields instead of the full list of dicts.
</issue_to_address>
### Comment 3
<location path="cloudpub/ms_azure/service.py" line_range="735-744" />
<code_context>
f"Failed to submit the product {product_name} ({product.id}) to preview. "
f"Status: {res.job_result} Errors: {errors}"
)
+ if is_certification_error(res.errors):
+ self._raise_error(CertificationError, failure_msg)
raise RuntimeError(failure_msg)
</code_context>
<issue_to_address>
**suggestion:** Use of `self._raise_error` for `CertificationError` would align behavior with other failure paths.
`query_job_status` uses `self._raise_error(CertificationError, error_message)`, but here `_publish_live` calls `CertificationError` via `_raise_error` as well. To keep error handling consistent and ensure any shared logic in `_raise_error` (logging, metrics, error mapping) always applies, use `self._raise_error(CertificationError, failure_msg)` instead of raising `CertificationError` directly.
Suggested implementation:
```python
if is_certification_error(res.errors):
self._raise_error(CertificationError, failure_msg)
raise RuntimeError(failure_msg)
```
```python
def _publish_live(self, product: Product, product_name: str) -> None:
res = self.submit_to_status(product_id=product.id, status='live')
# Example structure of the failure handling logic that should
# now be using `self._raise_error` for CertificationError:
if res.job_result != 'succeeded' or not self.get_submission_state(
product.id, state="live"
):
errors = res.errors
failure_msg = (
f"Failed to submit the product {product_name} ({product.id}) to live. "
f"Status: {res.job_result} Errors: {errors}"
)
if is_certification_error(res.errors):
self._raise_error(CertificationError, failure_msg)
raise RuntimeError(failure_msg)
```
I only see the beginning of `_publish_live` in your snippet, so you will need to adjust the replacement block to match your existing implementation. Specifically:
1. Locate the existing error-handling code inside `_publish_live` (likely similar to the preview path) that currently does something like:
```python
if is_certification_error(res.errors):
raise CertificationError(failure_msg)
```
2. Replace that `raise CertificationError(...)` with:
```python
self._raise_error(CertificationError, failure_msg)
```
3. Ensure the `failure_msg` variable name and the condition (`res.job_result`, `get_submission_state`, etc.) match your actual implementation; the block I provided is a template to show how to integrate `_raise_error`, not a strict structural change to your business logic.
The key behavioral change is: all `CertificationError` raises in `_publish_live` should now go through `self._raise_error`.
</issue_to_address>
### Comment 4
<location path="tests/ms_azure/test_service.py" line_range="989-994" />
<code_context>
azure_service._publish_preview.retry.sleep = mock.Mock() # type: ignore
expected_err = (
f"Failed to submit the product test-product \\({product_obj.id}\\) to preview. "
- "Status: failed Errors: failure1\nfailure2"
</code_context>
<issue_to_address>
**suggestion (testing):** Tighten the regex assertion to verify the actual error payload rather than using a broad wildcard
In `test_publish_live_fail_on_retry`, the regex `Status: failed Errors: *` is too permissive and will match almost any error content. Since `ConfigureStatus.errors` is now a structured list of dicts, make the assertion stricter by matching the serialized errors (e.g., using `re.escape(str(err_resp.errors))`, as in the preview test) or by including a specific portion of the error structure in the pattern so regressions in formatting or type are caught.
</issue_to_address>Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
| error_message = f"Job {job_id} failed: \n{job_details.errors}" | ||
| if is_certification_error(job_details.errors): |
There was a problem hiding this comment.
issue: Using the raw list of error dicts in the failure message may hurt readability and regress prior behavior.
job_details.errors is now a list of dicts, so interpolating it directly will show the raw Python dicts instead of a readable message and may change existing logs/UX. Consider converting the list to a human-readable string (e.g., extracting the message fields) before formatting, and align this behavior with any other call sites that include errors in messages.
| errors = res.errors | ||
| print("errors : ", res.errors) | ||
| failure_msg = ( | ||
| f"Failed to submit the product {product_name} ({product.id}) to preview. " | ||
| f"Status: {res.job_result} Errors: {errors}" | ||
| ) | ||
| if is_certification_error(res.errors): |
There was a problem hiding this comment.
issue: The plain print call and raw errors formatting are inconsistent with the surrounding logging behavior.
This block now uses print("errors : ", res.errors) and includes the raw errors list (list of dicts) directly in the message. Please avoid print in this context and use the existing logging approach (e.g., log.*) or rely on failure_msg. If you need to surface structured error details, consider logging a JSON-serialized version or selected key fields instead of the full list of dicts.
| if is_certification_error(res.errors): | ||
| self._raise_error(CertificationError, failure_msg) | ||
| raise RuntimeError(failure_msg) | ||
|
|
||
| @retry( | ||
| wait=wait_fixed(wait=60), | ||
| stop=stop_after_attempt(3), | ||
| retry=retry_if_not_exception_type(CertificationError), | ||
| reraise=True, | ||
| ) |
There was a problem hiding this comment.
suggestion: Use of self._raise_error for CertificationError would align behavior with other failure paths.
query_job_status uses self._raise_error(CertificationError, error_message), but here _publish_live calls CertificationError via _raise_error as well. To keep error handling consistent and ensure any shared logic in _raise_error (logging, metrics, error mapping) always applies, use self._raise_error(CertificationError, failure_msg) instead of raising CertificationError directly.
Suggested implementation:
if is_certification_error(res.errors):
self._raise_error(CertificationError, failure_msg)
raise RuntimeError(failure_msg) def _publish_live(self, product: Product, product_name: str) -> None:
res = self.submit_to_status(product_id=product.id, status='live')
# Example structure of the failure handling logic that should
# now be using `self._raise_error` for CertificationError:
if res.job_result != 'succeeded' or not self.get_submission_state(
product.id, state="live"
):
errors = res.errors
failure_msg = (
f"Failed to submit the product {product_name} ({product.id}) to live. "
f"Status: {res.job_result} Errors: {errors}"
)
if is_certification_error(res.errors):
self._raise_error(CertificationError, failure_msg)
raise RuntimeError(failure_msg)I only see the beginning of _publish_live in your snippet, so you will need to adjust the replacement block to match your existing implementation. Specifically:
- Locate the existing error-handling code inside
_publish_live(likely similar to the preview path) that currently does something like:if is_certification_error(res.errors): raise CertificationError(failure_msg)
- Replace that
raise CertificationError(...)with:self._raise_error(CertificationError, failure_msg)
- Ensure the
failure_msgvariable name and the condition (res.job_result,get_submission_state, etc.) match your actual implementation; the block I provided is a template to show how to integrate_raise_error, not a strict structural change to your business logic.
The key behavioral change is: all CertificationError raises in _publish_live should now go through self._raise_error.
| expected_err = ( | ||
| f"Failed to submit the product test-product \\({product_obj.id}\\) to preview. " | ||
| "Status: failed Errors: failure1\nfailure2" | ||
| f"Status: failed Errors: {re.escape(str(err_resp.errors))}" | ||
| ) | ||
|
|
||
| # Test |
There was a problem hiding this comment.
suggestion (testing): Tighten the regex assertion to verify the actual error payload rather than using a broad wildcard
In test_publish_live_fail_on_retry, the regex Status: failed Errors: * is too permissive and will match almost any error content. Since ConfigureStatus.errors is now a structured list of dicts, make the assertion stricter by matching the serialized errors (e.g., using re.escape(str(err_resp.errors)), as in the preview test) or by including a specific portion of the error structure in the pattern so regressions in formatting or type are caught.
@lslebodn @JAVGan PTAL. This one prevents cloudpub from retrying when there's a certification error.
Summary by Sourcery
Handle Azure certification failures as non-retriable errors and adjust error modeling accordingly.
Bug Fixes:
Enhancements:
Tests: