LUSID service continuity

LUSID is a mission-critical service for our clients, demanding 24x7 availability. Our goal is zero downtime, which we aim to achieve through:

Resilience: LUSID is architected to be robust to individual failures.
Contingency: If the system does suffer a problem, we have established plans for recovery.
Testing: We regularly validate and practice that our service continuity strategy works.
Improvement: Learn from our experiences and evaluate new technology and best practices.

Approach

LUSID is architected to have no single points of failure and exhibit comprehensive redundancy.

Hosted in AWS or Azure, we distribute our estate across all “availability zones” (AZs) in the regions in which we operate. All data is replicated, and LUSID is tested to be tolerant to multiple failure modes. This includes individual container failures, loss of server instances, network degradation, entire AZs outages, and external DDoS attacks.

We routinely inject failures in our environments - a practice known as “chaos engineering” - and verify there is no client impact by examining our monitoring and telemetry sources.

In addition, we regularly test our automated recovery mechanisms, including database failover and AZ annexation.

Finally, we conduct tests of our disaster recovery processes, whereby we may be forced to restore the entire service in a new region, from backups, in the event of a catastrophic geographical outage.

Monitoring

We have extensive monitoring and telemetry that provides us a real-time view on the health of LUSID. This data is continuously monitored and feeds into our internal alerting tools, ensuring we are immediately aware of any service degradation.

In addition, we track the availability of all LUSID APIs using externally hosted tools that regularly and systematically exercise the various features of the API. The results of these tests are used to determine our uptime statistics, which we make available to all clients.

Data durability and backup

LUSID maintains real-time replicas of all data, to ensure all data changes are durable and resilient to failures in our data storage systems. These replicas exist in all availability zones in the primary region, as well as an asynchronous replica being maintained in the disaster recovery region.

In addition to these replicas, we also take backups of the data via two mechanisms:

Real-time backups which facilitate point-in-time recovery to any time in the past 30 days.
Daily snapshots which are moved to a different cloud provider region, to facilitate disaster recovery.

These provisions ensure that once data is successfully written to LUSID, we have very high confidence in its ongoing availability.

Performance and capacity management

We target the following SLAs for LUSID API performance:

99% of all requests to return in under 2 seconds.
99.9% of all requests to return in under 5 seconds.

To achieve these targets, LUSID dynamically adjusts the number of available servers to respond to changes in client workload. If necessary, LUSID throttles incoming requests to temper surges in load, or to constrain ‘bad actors’, in order to ensure the stability of the service. This manifests as “429 - Too Many Requests” response codes being returned by our APIs.

We validate the scalability of the platform using stress-test simulations which are run as part of our release cycle.

Service levels

The table below outlines our Recovery Time and Recovery Point Objectives for various failure modes.

Recovery Time Objective (RTO) describes the time it takes to bring the service back to full operational health.
Recovery Point Objective (RPO) describes the maximum amount of data loss incurred in the case of that failure (measured in time).

Failure mode	Impact	Recovery action	RTO	RPO
Container crash	Failure of any in-flight API requests.	Restart container.	Zero	Zero
EC2 server failure	Failure of multiple containers.	Reprovision server.	Zero	Zero
Network degradation	Slow and/or disrupted API requests.	Annex affected infra.	Zero	Zero
AZ issue	Temporary loss of capacity. Potential disruption to requests.	Annex AZ. Increase capacity in other AZs.	Zero	Zero
AZ outage	Failure of any in-flight API requests. Temporary loss of capacity.	Increase capacity in other AZs.	Zero	Zero
Database failure	Service disrupted.	Failover to replica.	< 60 seconds	Zero
Cloud provider regional outage	Service disrupted.	Invoke DR procedure. Restore service to different region.	< 4 hours	Seconds