Understanding data quality checks in LUSID

Prev Next

This article explains how data quality checks (DQ checks) work in LUSID, including the key components to configure a check, and the types of results you can expect from a run.

Configuring a check definition

Each DQ check in LUSID consists of the following key components:

Check definition

A check definition is a LUSID entity identified by a unique scope and code. It specifies:

  • The type of data the check applies to

  • One or more rulesets containing rules that define your validation logic

Ruleset

A ruleset is a collection of rules with a filter that allows you to segregate the data being checked.

Each ruleset contains:

  • One or more rules

  • Optionally, a filter to specify which items the ruleset applies to

For example, to apply a set of rules only to equity instruments, you could set the following filter:

"ruleSetFilter": "instrumentDefinition.instrumentType eq 'Equity'"

Rule

A rule is a formula that LUSID checks entities against. Each rule contains:

  • A formula that uses derived property syntax to determine a value of true or false for the data check

  • A numeric severity level indicating the importance of the rule

For example, to check an entity is decorated with the Instrument/industry/nace property, you could set the following formula:

"ruleFormula": "Properties[Instrument/industry/nace] exists"

Running a DQ check

Once defined, you can run checks via workers in the Workflow Service, or via API. When running a check, you must provide:

  • The check definition scope and code

  • Details of the instruments to be checked, for example:

    • The instrument scope

    • An asAtModifiedSince date, so LUSID only checks instruments that have been modified since a particular date (such as the previous check run)

    • A selector attribute and value to use as an explicit “check me“ tag, such as a telemetry property added during an integration run

    • A preferred identifier to return for instruments that breach rules

  • A limit for the number of breaches per rule to prevent the inefficency of checking every instrument for a widespread issue, for example if a property is missing from the latest integration run for every instrument

    Note

    Above the limit, further breaches are grouped into a single result.

(Coming soon) Follow our tutorial on running a simple DQ check using the CheckDefinitions API. Note that running checks via the API is only recommended for testing a DQ check before including it in your workflow.

LUSID provides a pre-built DQ check worker that you can use in your task definitions and map the results to exception tasks for further investigation.

Coming soon.

Understanding DQ check results

DQ checks return four types of results:

Result type

Description

LUSID returns…

Example

Ruleset invalid

Returned if LUSID could not evaluate a ruleset.

One result per invalid ruleset

A property key referenced in the filter string was deleted.

Rule invalid

Returned if LUSID could not evaluate an individual rule within a valid ruleset.

One result per invalid rule

A property key referenced in the rule formula was deleted.

Rule breached

This is the primary expected result for a breach (that is, the rule formula evaluates to false).

One result per item in the dataset that breaches a rule

An instrument does not have a property required by the check.

Rule breaches over limit

Returned if the number of breaches for any rule exceeds the limit specified in the run request.

One summary result for all breaches exceeding the limit

A property required by the check is missing for more than 100 instruments.