Skip to main content

Disaster Recovery and High Availability for OutSystems Platform servers

OutSystems

Disaster Recovery and High Availability for OutSystems Platform servers

Once you design your infrastructure to supportOutSystems you need to understand if you are going to have an High Availability setup and Disaster Recovery plan or not.

High Availability vs Disaster Recovery

To understand the difference in each approach, mind the following:

  • High Availability - is a system that is expected to be continuously delivering for a long time. Availability can be measured relative to "100% operational" or "never failing". A widely-held but difficult-to-achieve standard of availability for a system or product is known as "five 9s" (99.999 percent) availability.
  • Disaster Recovery - defines a set of policies and procedures to assure the recovery or continuation of vital technology infrastructure and systems, following a natural or human-induced disaster.

These setups can live together but they require different planning and different implementations.

High Availability

High Availability is all about uptime when facing hazardous events. The goal is to create a redundancy in your infrastructure that can respond to events, through horizontal scalability, preventing them from impacting your business. Depending on your configuration, it can respond to events at a server level, an entire datacenter or even in geographical areas. Is your business risk level and customer landscape that will determine to what point OutSystems has to scale, but we present the following examples to show typical scenarios that can be put in place.

Keep in mind that this examples are technology agnostic. They just show a simple layout on how to achieve High Availability.

 

Example 1 - Localized High Availability Design

This design example prevents localized system events. By having redundancy you are able to balance your load to respond to any existing events.

                                                                                       image3.png

 

Example 2 - Geographic High Availability Design

This design example prevents geographic system events. By having redundancy both at the datacenter level and within the datacenter you can cope with geographical system events:

 

                         image5.png

Disaster Recovery

A disaster recovery happens after you had any kind of disaster. Here you are defining how fast can you recover from that disaster and what is the amount of information you are willing to lose in it. To help you creating the right decision there are two important concepts to keep in mind:

  • Recovery point objective (RPO) - is defined by business continuity planning. It is the maximum targeted period in which data might be lost from an IT service due to a major incident.
  • Recovery time objective (RTO) - is the targeted time frame and service level within a business process that it must be restored after a disaster (or disruption) in order to avoid unacceptable consequences associated with a break in business continuity.

                                                  image6.png

Optimally, the more critical is your business process the less RPO vs RTO is to be aimed. This is achieved through a conjunction of an High Availability design and a set of policies and procedures contained in a Disaster Recovery Plan.

To help you achieving success, whether there’s an existing business continuity plan or a new one is to be aligned, we share a very simple disaster recovery plan for an OutSystems Infrastructure.

Building a Disaster Recovery Plan

The following plan is a baseline that helps you to define your own plan. Keep in mind that this is the minimum acceptable plan for a disaster recovery situation and you should invest in creating a plan tailored to your specific needs.

Communication Plan

All communications before and after each work item are to be done by e-mail. The handover from a action item to the next action item should be done by phone to the next action item responsible contact or by the department 24/7 contact.

Personnel

List of people/departments that take part in the plan.

Responsibility Name E-mail Contact Department 24/7 contact
Network Diane Man diane.man@acme.corp +1 432 7434 +1 432 7434
Database Jack Lang jack.lang@ext.acme.corp +1 378 4320 +1 479 4321
Server Operations Lee Connan lee.conan@ext.acme.corp +62 444 9254 +62 229 5478
Global Infrastructure Rob Roy rob.roy@acme.corp +1 334 7845 +1 334 3402
Applications Audrey Branch audrey.branch@acme.corp +1 434 1121 +1 432 001
Backups Cindy Krall cindy.krall@acme.corp +1 432 5009 +1 332 887

 

Systems

List of devices that directly take part of the OutSystems Factory

System

Hostname

IP Address

Responsibility

Load Balancer 1

syslb1

81.253.167.3

Network

Frontend 1

srvappout1

10.56.1.11

Server Operations

Frontend 2

srvappout2

10.56.1.12

Server Operations

Frontend 3

srvappout3

10.59.1.21

Server Operations

Frontend 4

srvappout4

10.56.1.22

Server Operations

Database 1

oradbout1

10.80.1.33

Database

 

Support Contracts

Information regarding supplier’s support and SLA over the systems.

System

Quantity

Brand

Support Type

Support Contact

OutSystems

-

OutSystems

24/7

+1 888 707 2957

Storage

1

Eximax

24x7

0800 569 112

Virtual Machine

6

Microsoul

24x7

+1 378 4320

Database

3

Oracle

24x7

+62 444 9254

Load Balancer

2

RapidFire

8x5

+1 434 1121

 

Backups

You must configure the following automatic backups for both Servers and Databases.

Servers

  • Daily virtual machine snapshot at 4am

  • 5 snapshots kept (5 days)

Database

  • Oracle Archive Logs every 2 hours starting at 0h

  • Daily Differential at 1am

  • Weekly Full at 1am

  • Monthly Full at 1am

 

Front-End Recovery

Action items required to recover from an hazard event affecting a server or servers.

#

Action

Responsibility

SLA

1

Remove affected server from load balancer

Network Operations

15m

2

Restore virtual machine snapshot

Server Operations

2h

3

Access the server and start all OutSystems services.

Server Operations

15

4

Verify OutSystems status by accessing http://localhost/ServiceCenter

Server Operations

5m

5

Confirm internal application access

Applications

15m

 

Database Restore

Action items required to recover from a database hazard event.

#

Action

Responsibility

SLA

1

Setup maintenance web page

Server Operations

15m

2

Access each frontend and stop all OutSystems services

Server Operations

30m

3

Database Restore

Backups

4h

4

Verify database

Database

 

5

Access each frontend and start all OutSystems services

Server Operations

1h

6

Verify runtime of applications and execute smoke tests

Applications

1h

7

Unset maintenance web page and internet access test.

Server Operations

15m

 

Full Recovery

Action items required to achieve full recovery.

#

Action

Responsibility

SLA

1

Redirect Load Balancers to a web server with maintenance web page

Server Operations

30m

2

Database Restore

Backups

4h

3

Verify database

Database

15m

4

Restore virtual machine snapshot

Server Operations

4h

5

Access the server and start all OutSystems services.
 

Server Operations

1h

6

Verify OutSystems status by accessing http://localhost/ServiceCenter from each restored server

Server Operations

30m

7

Verify runtime of applications and execute smoke tests

Applications

1h

8

Unset maintenance web page and internet access test.

Server Operations

15

More Information

To learn more about how to set up your OutSystems Platform check the Designing OutSystems infrastructures guide.

Important: The information in this article applies only to OutSystems Platform on-premises or private cloud deployments.