Rust Foundation Series: Case for Effective Error handling!

Context

For a refresher – an error is a distress signal by a function or a method to express a prohibitive state that prevented it from producing an expected outcome. Occurrence of an error impacts system reliability, while inadequacy in its definition impacts both clarity and consistent experience. In this article, I present an approach to build foundational support for effective error handling, using the semantic support i.e., constructs offered by the Rust’s standard library.

Initial Guidance

Before proceeding further in the article, suggest reviewing the following initial guidance:

  • In case you are on lookout for using something as an external support, do checkout existing crates i.e., libraries.
  • However, if you are someone with specific needs in driving better quantitative observability metrics that call for a custom error handling framework, do continue further.

Expectation

The following formatted JSON-text represents a sample expected serialized version of an error:

gBoxError {
reason: "Expected a valid input",
category: Functional,
cause: ParseIntError {kind: InvalidDigit },
code: "TEST_001",
location: Location {
file: "tests/gBoxTestResultWithErr.rs",
line: 47,
col: 17
},
timeOfOccurrence: SystemTime {
tv_sec: 1746469109,
tv_nsec: 988007986
}
}

Subsequent sections will describe the composite parts of the above error composition, and how each of them can help with better error handling.

Requirements

Wellness of an error definition – by its type and based on its composites serve as actionable inputs for being clear about the cause, as well help implement effective error handling measures. End users require clarity associated with failure cause, while reliability controls require type-based quantitative insights.

The following are some commonly expected definition requirements of a well-defined error:

Concise ReasoningAn adequately concise description of the error cause. For example – “Failed to login. Expected valid token. Ref – UUID”.
Error ClassificationTag error with high-level categorization labels such as Systemic, Functional, etc. Helps produce roll-up aggregates or to track errors by certain category.
Error CodeA recommended approach for programmatic usage. For example, a code such as DbWriteFailure-404 indicates an attempt to write to a non-existent table.
Time of OccurrenceA system time to indicate when the error is raised. Along with additional metadata can be a very helpful metric to derive reliability metrics such as Mean-time-between-failure (MTBF).
Signature clarityClear signatures not just by naming, but with type specificity on both inputs and return values, will help bake-in compile-time type safety, as well as allow call-sites to consider covering for possible terminal states. Avoid ambiguity!
ImmutabilityErrors are snapshots generated to represent the underlying cause. As they get exchanged, their state mustn’t be allowed for modifications.
Serialization SupportAbility to serialize to user-friendlier formats such as JSON-text will help surface errors with better user experience in consuming them.

Semantic Leverage

Rust standard library support offers the following two modes to signal an error state:

  • The `panic!` macro – often suggested for quick prototyping, offers a simpler approach to raise an error, where the default behavior is to print the cause and immediately terminate the process. Language semantics support adding custom hooks!
  • The Result – offers a graceful representation of two possible terminal states – a Success or a Failure. For those coming from Scala, this is like the Try[_] monad.

The following is some quick guidance about choosing one over the other:

  • Error state cannot be recovered. Raise a panic! Similar to an unhandled exception in case of Java or Scala, where a System.exit(...) is invoked with an appropriate terminal code. By registering a custom hook, a panic can be intercepted and attempted for a possible recovery.
  • A recovery is possible or that you’d like to delegate error handling task to the call-site. Return the Result type.

Some additional thoughts when scoping for an error handling foundation:

  • From reliability standpoint all errors must be accounted for. Both approaches enable such possibility! However, Result is more expressive in composing clear contracts that are statically resolved at compile-time. Panics must be treated for specificity by the registered hook.
  • Panics are a nice tool to leverage, in early prototyping phase. However, recommend to bake-in a minimalistic error handling approach from get-go!
  • Suggest having a panic interceptor (which is a system wide hook for the duration of the runtime) to deal with external dependencies that can cause panics.

Note – subsequent sections focus on a design approach based on the Result type. Subsequent efforts will blend-in panic handling, by leveraging the initial foundations.

Result with Standard Error

Rust is a statically typed language, where types get resolved at compile-time. While this is true for the Result type and its variants, the definition itself doesn’t impose any Error trait-bounds. For effective error handling, reinforcing the error variant type to be of some standard Error trait bound will help enforce expected behaviors such as ability to describe, produce mix-ins based on call-site needs, or recognize certain error types to define necessary type-specific fallback actions, etc.

Error Trait – Core or Std lib?

Before progressing further, let’s briefly discuss the reason behind choosing the Error trait from the standard library namespace, and not one from the core. As explained in the language’s code base here, error handling support is moved under standard lib, instead of its initial presence in the core-lib as a workaround in resolving challenges associated with coherence checks on Trait impl. Another motivation came from its wide usage across the language’s code base. For example, the std::io::Error has implementation support for the standard error trait. The online reference Rust By Example, a commonly used addendum to primary language references suggests it’s use as a trait bound in wrapping errors for a generic representation.

Building Blocks

To define a foundational layer for effective error handling, I’ve come up with the following building blocks:

gBoxError<eT>A generic error defined as a Struct type, where `eT` is some concrete error type. Its composite parts i.e., fields will be:
1. Reason – a concise description of the failure cause.
2. Category – to help classify an error into a high-level bucket.
3. Code – a short system usable code. Example – DbWriteFailure-404.
4. Location – A place in code that raised the error.
5. Time of Occurrence – local OS time when the error is raised.
gBoxErrorBuilderA progressive builder to initialize and build an error instance of type gBoxError<eT>, where eT is either the underlying root cause type or the self-type i.e., an error all by itself without an underlying cause that triggered the failure.
gBoxErrorCategoryAn enumeration type with few common error categories such as Systemic, Functional, etc.
gBoxResult<rT, eT>A Type-alias for Result type, with the error variant’s type is a gBoxError<eT>. Being an alias, supports all standard operations of a Result type, including use of the question-mark (?) operator.
AugmentErrA trait to introduce interface support to pad a Result type with additional error context. The augmented result is returned as a Result<rT, gBoxError<eT>> type.

Behind the Scenes

For the context, let’s consider the following usage example:

gBoxResult Sample

A few implicit i.e., non-functional requirements considered in here include:

Preserve the Out-of-the-box supportRetain existing idiomatic expressiveness of the Result type, and ability to propagate potential errors up the call-chain.
Propagate errors during result extractionIn certain cases, as in the above closure definition, an attempt to lift the positive value using the question-mark operator results in an inferred Result type with an `{unknown}` error variant type. Check this Rust language GitHub Issue, where a similar scenario using the question mark is discussed. While the main issue is still open, I’ve shared a comment with a workaround that is used in here. The approach replaces the signature Ok(<Result>?) with an associated method – extract. Its sole purpose is to preserve concrete types for both variants on the Result type and eliminate a need to explicitly annotate the types to avoid inference errors.
Allow padding the error with additional contextAs you’d notice, the line#47 invokes a utility method introduced by adding implementation support over Result type, to augment the cause with additional triage pointers.
Support coercionsEnable implicit coercion from Result<rT, eT> –> Result<rT, gBoxError<eT>>, where the source concrete type information is preserved for runtime. Similarly, any error that implements the standard Error trait can be coerced to gBoxError<eT>.
Progressive builderAs you’d notice in the line#47 above, a call to errorContext method is made. Internally, it invokes a progressive error builder that consumes provided contextual references and returns a Result with an error variant of type gBoxError<eT> type.
Preserve source error typeThis is the primary reason to build with generics, all of the components listed here. At runtime having access to concrete type specifics can be very helpful to build code that is sound and safe.

Being explicit about the contracts in the signature has two immediate benefits:

  • Clarity on possible terminal state types, with types resolved at compile-time. No need to box into some dynamic error type reference.
  • Favor wellness of types as downstream usage consumes the return values. Avoids call-site ambiguity, sort of enforces developer discipline, unless one chooses to panic 😊.

Avoiding explicit annotations

The extract() method introduced earlier abstracts the need to explicitly annotate the return value types, such that both the variant types are resolved at compile-time, and there’s no ambiguity that can fail compilation or prohibit runtime interpretations. More-over the behavior of the question mark operator cannot be replaced with a custom impl of sorts. As detailed here, it’s an unary postfix operator support provided for Result and Option types. It invokes the implicit coercion based on the available impl (From | Into) support in-scope.  An attempt would lend itself to an ambiguous scenario as shown in the following snapshot:

Ambiguous `From` Impl scenario

Above ambiguous scenario is caused when a `From` conversion support is added over a Resul type, and an extraction is attempted as shown in the following code snippet:

?-operator requiring explicit annotation support

Call-site Examples

Case-match

Here’s an example of a case match, where the returned value is resolved by inference at compile-time:

Error handling - Case-match example

Chaining with Result

Chaining with Result

As you’d notice, the return value resolves to a Result type, with success of i8 value type, and the failure set to gBoxError<ParseIntError> type. The map_err method is a partial function over some error, if the result contains an error variant value. In such case, the logic demonstrates the ability to register or log the errors to some external sink. If the Result is of a success value variant, then the map_err method yields to the map invocation. In the line#68, the outcome of map is extracted with an else clause to avoid a case for any potential panic. As you’d notice, it’s a failure case, where the expectation is the call to fnParse yields an error, and not the value 16 (which is the case if input string is a valid numeric-4).

Trace the root-cause

Continuing with the previous code sample, as shared below one can fetch a serialized error trace:

traceSink.iter().for_each(|trace| println!("{trace}"));

Error trace is returned as a Vec type, where the composite is a String reference. Values are registered in the bottoms-up order, where the bottom of the vector i.e., end references the root cause of the failure. Trace is lazily computed on demand. The only reference in the gBoxError state until then is the cause field, which is some standard error type.

Error trace

While the above trace retains the specificity, the source method of a standard error doesn’t provide a concrete type as the return value. Instead, it renders an Option<&(dyn Error + ‘static)> type. Internals of the trace function provided by the gBoxError uses the describe interface over each error along the trace sequence. The details of each error are rendered by respective Error’s Display support. The details do meet the objectives of a given downstream observability use case, which needs some degree quantifiable data. Besides the above serialized output, support for concrete type-based introspection of the gBoxError type helps in identification and extraction of quantitative metrics. For example, the SystemTime, which can be used to derive time-based metrics such as frequency of certain error occurrence.

Looking forward!

Error handling doesn’t just stop at defining the foundations based on graceful types. The next step is to blend in the panics for the last resort recovery methods. For example, use the panic reason as a hand-off to recognize type of recovery, attempt it and if..else, terminate the process with an expected and configured exit code!