Back-to-Basics: Understanding Backup and Recovery plus High Availability(Part I)

The constant evolution of IT has, among other things, affected the role of a database administrator (DBA). Today the DBA is not merely a Database Administrator anymore but is morphing more into the Database Architect role. If you want to become a successful DBA and be more competitive in the market, you should have a different skill set than what was normally required in the past. You need to have a wide range of understanding in architectural design, network, storage, licensing, and much more. The more knowledge you have, the better opportunities you will find.

The main idea of this back-to-basics post is to introduce you to some basic concepts regarding backup and recovery plus high availability, giving you a general overview of the most important methods and tools available for you to achieve your data availability goals. Therefore, in this first post, we will cover the following topics:

Understanding the need for creating backups
Getting familiar with the different backup types
An overview of backup strategy

Purpose of Backup and Recovery

As a DBA, you are the person responsible for recovering the data and guarding
the business continuity of your organization. Consequently, you have the key responsibility for developing, deploying, and managing an efficient backup and recovery strategy for your institution or clients, allowing them to easily recover from any possible disastrous situation. Remember, data is one of your company’s most important assets. Most organizations would not survive after the loss of this important asset.

Testing Backups

It’s incredible how many corporations worldwide do not have a proper Disaster Recovery Plan (DRP) in place, and what is worse, many DBAs never even test their backups. Most of the time when auditing Oracle environments for clients, I ask the following question to the DBA team:

Are you 100 percent sure that you can trust your backups? For this question I generally receive answers like:

I’m not 100 percent sure since we do not recover from backups too often;
We do not test our backups, so I cannot guarantee the recoverability of them.

Another good question is the following:

Do you know how long a full restore and recovery of your database will take? Common responses to this question are:

Probably anything between 6 and 12 hours;
I don’t know, because I’ve never fully restored and recovered my database.

As you can see, a simple implementation of a procedure to proactively test your backups randomly will allow you to:

Test your backups and ensure that they are valid and recoverable:
I have been called several times to help clients because their current backups are not viable. Once I was called to help a client and discovered that their backup-to-disk started every night at 10 P.M. and ending by 2 A.M. Afterwards, the backup files are copied to a tape manually by a system administrator every morning at 4 A.M. The problem here was that when this process was implemented, the database size was only 500 GB. Still, after a few months, the database’s size had grown to over 1 TB. Consequently, the backup that was initially finishing before 2 A.M. was now finishing at 5 A.M., but the copy to a tape was still being triggered at 4 A.M. by the system administrator. As a result, all backups to a tape were unusable.
Know your recovery process in detail: If you test your backups, you will have the knowledge to answer questions regarding how long full restore and recovery will take. Answering that your full recovery will take around three and a half hours, but you prefer to say five hours just in case of any unexpected problem that you will come across, showing your professionalism. This will let me know that you really know what you are talking about.

Document and improve your recovery process: The complete process needs to be documented. Suppose the process is documented and you also allow your team to practice on a rotation basis. In that case, this will ensure that they are familiar with the course of action and will have all the knowledge necessary to know what to do in case of a disaster. You will now be able to rest in your home at night without being disturbed because now you are not the only person in the team with the experience required to perform this important task.

Good for you if you have a solid backup and recovery plan in place. But have you tested that plan? Have you verified your ability to recover?

Protecting Data

As the main person responsible for the recovery and availability of the data, you need to fully understand how to protect your data against all possible situations you could come across in your daily job. The most common situations you could see are:

Media failure
Hardware failure
Human error
Application error

Lets take a closer look at each of these situations.

Media Failure

Media failure occurs when a system is unable to write or read from a physical storage device such as disk or a tape due to a defect on the recording surface. This kind of failure can be easily overcome by ensuring that your data is saved on more than one disk (mirrored) using a solution such as RAID (Redundant Array of Independent Disks) or ASM (Automatic Storage Management). In the case of tapes, ensure that your backups are saved in more than one tape and as mentioned earlier, test their recoverability. Another good advantage of ASM is that it uses raw disks, giving you some extra protection against ransomware as it cannot corrupt raw disks like ASM. Another good way to protect your environment against this type of failure is by having access to the Automatic Media Block Recovery” functionality that comes with using Oracle Active Data Guard technology (Check my blog post about this topic here https://oraclemaa.com/playing-with-active-data-guard-dml-redirection-and-automatic-block-media-recovery/)

Hardware Failure

Hardware failure is when a failure occurs on a physical component of your hardware such as when your server motherboard, CPU, or any other component stops working. To overcome this kind of situation, you will need to have a high-availability solution in place as part of your disaster and recovery strategy. This could include solutions such as Oracle RAC, a standby database (Using Data Guard or Active Data Guard), using Oracle engineering systems (they are fully hardware redundant) or even having replacement hardware on the premises.

Human Errors

Human error, also known as user error, is when a user interacting directly or through an application causes damage to the data stored in the database or to the database itself. The most frequent examples of human error involve changing or deleting data and even files by mistake. It is likely that this kind of error is the greatest single cause of database downtime in a company.

No one is immune to user error. Even an experienced DBA or system administrator can delete a redo log file with the extension .log as a mistake when taking it as a simple log file to be deleted to release space. Fortunately, user error can often be solved by using physical backups, logical backups, and even Oracle Flashback technology.

Application Error

An application error happens when a software malfunction causes data corruption at the logical or physical levels. A bug in the code can easily damage data or even corrupt a data block. This kind of problem can be solved using Oracle block media recovery and Edition-Based Redefinition (EBR, a revolutionary capability that allows online application upgrades with uninterrupted availability of the application). It is important to have a proper test done before promoting an application change to any production environment.

Tip: Always do a backup before and after a change is implemented in a production environment. A before backup will allow you to roll back to the previous state in case something goes wrong. An after-the-backup will protect you to avoid redoing the change in case of a failure, as it was not included in the previous backup available.

Types of Backups

Now that you understand all types of possible failures that could affect your database, let’s take a closer look at the definition of backup and the types of backups available to ensure our data’s recoverability.

A backup is a real and consistent copy of data from a database that could be used to reconstruct the data after an incident. Consequently, there are two different types of backups available, which are:

Physical backups
Logical backups

A Physical Backup

A physical backup is a copy of all the physical database files that are required to perform the recovery of a database. These include datafiles, control files, parameter files, and archived redo log files. As an Oracle DBA, we have different options to make a physical backup of our database. Backups can be taken using user-managed backup techniques or using Recovery Manager (RMAN). Both techniques will be discussed in more detail later in this book. Physical backups are the foundation of any serious backup and recovery strategy.

A Logical Backup

Oracle uses Oracle Data Pump to allow us to generate a logical backup that can be used to migrate data or even do a partial or full recovery of our database. The utilities available are the Data Pump Export program (expdp) and the Data Pump Import program (impdp).Many people have a misconception of these tools in thinking that they can only be used to move data. Data Pump is a very flexible and powerful tool that can easily become a DBA’s best friend if well utilized. It is not just for moving data. It can also play a crucial role in your backup and recovery strategy.

Tip: The old Import and Export utilities – In the previous versions of Oracle we used to work with similar utilities called exp and imp. The exp utility is deprecated since Oracle 11g, but the imp utility is still currently supported by Oracle. The imp utility allows us to recover any backup generated by the old exp program. Just keep in mind that the use of exp is not supported anymore by Oracle and using it can bring future trouble to your environment.

Backup Strategies

A backup and recovery strategy mainly aims to protect a database against data loss, and this document will contain all steps required to reconstruct the database after a disaster strikes. As the person responsible for your company’s data, it is very important to have a correct backup strategy in place to allow you to recover from any possible disaster.

Before you create a strategy, you will need to understand clearly all the Service Level Agreements (SLAs) in place within your organization regarding this topic. To that end, you will need to ask some simple questions to the owners of the data:

How much data can the company lose in case of a disaster? (RPO)
How much time could the business wait to have the data restored and available again? (RTO)
How much will it cost the company for the loss of one hour of data?

What retention periods are required by law for the company data?After receiving the answers to all these questions, you will be able to implement a proper backup and recovery strategy according to your real company needs and SLAs in place.For example, if your company can only afford to lose three hours of data (RPO) but it can have the database down for up to 24 hours for a recovery process (RTO), all you will need to do to fulfill your SLA is to have a full backup of your database made daily. You will also need to make backups of all your archive logs every three hours to a tape or another network location to allow you to have all your data protected.

As part of creating a strategy, it is important to properly understand the concepts known as Recovery Point Objective (RPO) and Recovery Time Objective (RTO). As you can see in the following figure, the RPO reflects how much data might be lost without incurring a significant risk or loss to the business. The RTO is the maximum amount of time allowed to reestablish the service after an incident without seriously affecting the company.

Restore versus Recovery

Several people have asked me about the differences between restoration and recovery. Due to these questions, I will take this opportunity to explain the difference in some simple words to make it clear:

Restore: It is the act that involves the restoration of all files that will be required to recover your database to a consistent state, for example, copying all backup files from a secondary location such as tape or storage to your stage area
Recovery: It is the process of applying all transactions recorded in your archive logs, rolling your database forward to a point-in-time or until the last transaction is applied, thus recovering your database to the point-in-time you need.

In the next part of this article (Part II), we will take a closer look at what redo logs are, the two possible modes your database could operate in, and much more. Learning more about Redo will help you understand in a bit more in-depth what type of backup and recovery you could use in your environment. Be tuned for part II coming soon!

Kind Regards,

Francisco Munoz Alvarez

Want to learn more?

Check these amazing content about Oracle MAA and MSA available for you:

Oracle Maximum Availability Architecture (MAA) Main Page – https://bit.ly/3ysinDY
Database Security Main Page – https://bit.ly/3NSJWw3
Cyber Security Blog – https://bit.ly/3yPVZpG
Cyber Security Technical Paper – https://bit.ly/3akX1Ao
Securing the Oracle Database – a technical primer (free fourth edition) – https://bit.ly/3bUYQVa
Security Free Labs – https://bit.ly/3NLjCDW
Disaster and Recovery Free Labs – https://bit.ly/3yms6vK

References:

Oracle Backup and Recovery Survival Guide – https://www.amazon.com.au/Oracle-Database-Backup-Recovery-Survival/dp/1782171207

Back-to-Basics: Understanding Backup and Recovery plus High Availability(Part I)

Related

Leave a Reply Cancel reply