☑️ Ensure Workflow retry policy catches relevant errors only
retryStrategy
is an optional field of the Workflow CRD that provides control over how to retry a workflow step. One of its fields is retryPolicy
, which defines the policy of NodePhase statuses that will be retried (NodePhase is the condition of a node at the current time).
The value of retryPolicy
can be either: Always, OnError, or OnTransientError. In addition, the user can use the expression key to control more of the retries.
What’s the catch?
retryPolicy=Always is too much. The user only wants to retry on system-level errors (eg, the node dying or being preempted), but not on errors occurring in user-level code since these failures indicate a bug. In addition, this option is more suitable for long-running containers than workflows which are jobs.
retryPolicy=OnError doesn't handle preemptions: retryPolicy=OnError handles some system-level errors like the node disappearing or the pod being deleted. However, during graceful Pod termination, the kubelet assigns a Failed status and a Shutdown reason to the terminated Pods. As a result, node preemptions result in node status "Failure", not "Error" so preemptions aren't retried.
retryPolicy=OnError doesn't handle transient errors: Classify preemption failure message as a transient error is allowed however, this requires retryPolicy=OnTransientError. (see also, TRANSIENT_ERROR_PATTERN).
It is recommended to use Always, along with this expression that filters out unnecessary errors:
lastRetry.status == "Error" or (lastRetry.status == "Failed" and asInt(lastRetry.exitCode) not in [0])
Targeted objects by this rule (types of kind): Workflow
Complexity: medium (What does this mean?)
Policy as code identifier: ARGO_WORKFLOW_ENSURE_RETRY_ON_BOTH_ERROR_AND_TRANSIENT_ERROR
This rule will fail
If the retryPolicy key is set with a value of Always, but the aforementioned expression is not set:
kind: Workflow
spec:
templates:
- retryStrategy:
retryPolicy: "Always"
Rule output in the CLI
$ datree test *.yaml
>> File: failExample.yaml
❌ Ensure Workflow retry policy catches relevant errors only [1 occurrence]
💡 Incorrect value for key `retryPolicy` - the expression should include retry on steps that failed either on transient or Argo controller errors
How to fix this failure
When using a retryPolicy
of Always
, set the expression
key with the following value:
kind: Workflow
spec:
templates:
- retryStrategy:
retryPolicy: "Always"
expression: 'lastRetry.status == "Error" or (lastRetry.status == "Failed" and asInt(lastRetry.exitCode) not in [0])'