Slinger's Thoughts

August 30, 2013

SharePoint Governance and Disaster Recovery

Filed under: Disaster Recovery, SharePoint — Tags: , , — slingeronline @ 12:25 pm

For those of us that talk about SharePoint Disaster Recovery, we often mention that any Disaster Recovery strategy needs to be part of an overall SharePoint Governance plan, but I don’t think that any of us have gone into any detail about what that governance plan should include.  If any bloggers have, then please reference their blog posts in addition to this.  For this blog post I am going to show what our disaster recovery plan is for a governance plan that I am trying to implement at our organization.  Please keep in mind that this is part of a much larger detailed governance plan for a SharePoint environment. 

The first thing I do is establish some guidelines about how our DR strategy fits into the big picture at our company. We are all about best practices, (at least we read them, I don’t think we’ve implemented one yet), so I reference some of the existing guidelines that have already been established. The “other disaster scenarios” that are referenced in this section refer to something I call the “Tiny Disaster.” I’ll have a blog post in the future about those.

7.3 Disaster Recovery Procedure

Best practice guidelines for SharePoint 2013 Disaster Recovery procedures specifically reference large datacenter installations and only cover scenarios of total failure. For this reason, we will selectively choose the appropriate portions of Microsoft’s recommended Best Practices in regards to Disaster Recovery Strategy, and appropriately apply our own methodologies for other disaster scenarios, based on personal experience.

http://technet.microsoft.com/en-us/library/ee663490.aspx

http://technet.microsoft.com/en-us/library/cc261687.aspx

http://technet.microsoft.com/en-us/library/ff628971.aspx

Now we discuss how we will implement a disaster recovery solution. We don’t yet have all of the details from the business about what the RTO/RPO should be, but we can fill those in later.  For those that don’t know; RTO and RPO are a metric used for establishing disaster recovery guidelines. RTO is the Recovery Time Objective, or how long after a disaster strikes you are allowed to recover from the disaster. The RPO is the Recovery Point Objective, or how much data you can afford to lose in your disaster recovery strategy. You can look at these values as a timeline. From the time the disaster strikes, the RTO goes forward in time, and the RPO goes back. Instead of using one RTO/RPO for the entire organization, we understand that each group has different needs and will likely have a different RTO/RPO metric.

 7.3.1 Disaster Recovery Implementation

Backup procedures will be broken into two categories; Farm Component backups, and Content Backups. Farm component backups will include content databases, service application databases, search index files, and various other SharePoint specific files of the server’s file system on each server in the SharePoint farm. Content Backups will include all Content accessible to users through the default SharePoint interface, for all Web Applications in the SharePoint farm.

Farm Component backups will occur on a schedule determined by the RTO/RPO of all content for all Site Collections in a given database. Farm components that do not contain SharePoint content, such as Service Application databases; will be backed up on a weekly basis. The Enterprise Search Service application, including Search databases and Index files, will have a differential backup created daily in addition to weekly backups.

Each Site Collection will have its own RTO/RPO metric. This is in order to prevent redundant data backups, but still allow the most effective use of disaster recovery resources. 3rd Party software will be used in order to facilitate a more refined and more robust disaster recovery methodology, and allow for a shorter RTO when data needs to be recovered. 3rd party Disaster Recovery software will be installed according to the recommendations of the software vendor.

Site Collection

Recovery Point Objective

Recovery Time Objective

Central Administration

One Week

One Day

MySites Web Application

One Week

One Week

Corporate

Accounting

Engineering

Finance, Strategy, and Analysis

Operations

Procurement

Here we describe and define the different types of backup we will use for our DR strategy. There are other backup types that your organization may want to consider, such as incremental. We describe some of the limitations of backing up our SharePoint sites, such as site locking, so that no one in the business is surprised by it. We also lay out some guidelines, since some users may want their data that only changes once a year backed up daily. (We aren’t planning on that kind of data storage requirement.) These aren’t necessarily set in stone, and we can adjust these as needed by the business.

Backups of SharePoint content will be determined as either “Full Backups” or “Differential Backups.” A Differential Backup only contains changes to the content of the backup since the last Full Backup was created. Full Backups of SharePoint content will be scheduled to occur during expected “off peak” hours of SharePoint utilization. When Full Backups are created or when Differential Backups are created during non-business hours, “Site Locking” will be enabled in order to ensure data consistency. While a Site Collection is “locked” the entire Site Collection will be set to have “read only” permissions for ALL users. When Differential Backups are created during regular business hours, “Site Locking” will be disabled. This will allow users to continue working while the backup is being created. In cases where site locking is disabled, the differential backup schedule will occur with more frequency in order to ensure as much data consistency as possible. Where possible, multiple backups of SharePoint content will occur simultaneously in order to reduce the expected downtime.

The frequency of content changes will also affect how often backups are created, depending on the RTO/RPO metric for each Site Collection.

Frequency of Change

Full Backup Schedule

Differential Backup Schedule

Hourly

Daily

No less than every 6 hours

Daily

Weekly

No more than every 6 hours

Weekly

Monthly

Daily or Weekly

Monthly

Semi Annually

Weekly or Monthly

Yearly

Annually

Semi Annually or Never

Well, we have a backup strategy in place, and we have back ups. We’re done right?  Well, we need to know that our backups work, and that we can get data back in case we ever lose it, so we test. And we test often enough that we can sleep peacefully.  We test every aspect of our disaster recovery strategy, not just one or two pieces of it. We make sure to vary our testing, in order to ensure that there isn’t a part of our SharePoint farm that is being overlooked.  In case any part of our disaster recovery strategy fails, we describe an emergency procedure to ensure that our data is backed up in some way. (We don’t pull our reserve chute unless our main parachute fails.)

7.3.2 Disaster Recovery Testing Procedure

On the first weekend of each month, the ability to restore content will be tested. Comparisons will be made with tools such as binary file comparison tools where appropriate.

One content database will be restored, out of place, and then compared to the in service database for any discrepancies. The database selected will be one that has not been tested over the previous 4 to 6 months. Some inconsistencies are expected, such as time stamp information. If data content and data structure are not consistent, then the disaster recovery process will be reviewed and the cause of the discrepancy determined. A full backup will be created manually using native SharePoint tools in order to prevent any data loss.

One Service Application database will be restored, out of place, and then compared to the in service database for any discrepancies. The database selected will be one that has not been tested over the previous 4 to 6 months. Some inconsistencies are expected, such as time stamp information. If data content and data structure are not consistent, then the disaster recovery process will be reviewed and the cause of the discrepancy determined. A full backup will be created manually using native SharePoint tools in order to prevent any data loss.

3 SharePoint specific server files, such as a file from the “SharePoint Hive”, will be selected at random, restored out of place, and then compared to the in service file content for any discrepancies. Each file must be from a different server in the SharePoint farm. The files selected will be ones that have not been tested over the previous 4 to 6 months. Some inconsistencies are expected, such as time stamp information. If data content and data structure are not consistent, then the disaster recovery process will be reviewed and the cause of the discrepancy determined. A full backup will be created manually using native SharePoint tools in order to prevent any data loss.

5 to 15 documents and/or list items will be selected at random, restored out of place, and then compared to the in service content for any discrepancies. The content selected will not be any content that has been selected over the previous two years. If the content is not consistent, then the disaster recovery process will be reviewed and the cause of the discrepancy determined. A complete granular backup will be created manually using the native SharePoint tools in order to prevent any data loss. Preferably the selection of files to be tested will be provided by business users, and not a member of the IT department.

Well, we have the whole backup strategy laid out, and we test our backups. The only thing we have left is to define what to do in case we actually have a disaster.  This only handles the Tiny disasters for our end users, and describes the process for recovering information in SharePoint from a list or library item up to restoring an entire Site. We’ll get to the larger disasters after this.

7.3.3 Disaster Recovery End User Process

If a user needs an item restored due to the item being deleted, or corrupted, or otherwise affected, a Service Ticket must be created using the current Help Desk system. The user must provide the specific item name, the URL to the location of the item to be restored, and a date of the item to be restored. If the item to be restored still exists at the location, the requesting user has the option to request that the item be restored to another temporary location, or to overwrite/merge the existing content. Merging is not available for restoring list/library items. The item to be restored should be restored within the RTO specified for that Site Collection. Users will be able to request the restoration of list/library items, lists/libraries, and sites. Restoring a list/library or site will require approval of the owner of the Site or Site Collection as appropriate. If approval is not received after one half of the RTO has expired, it will be determined that the item is not critical to the business. The RTO requirement will be voided, and the item recovered at the earliest convenience of the SharePoint Administrator responsible for recovering the item.

We’ve covered what happens in the smaller issues. Now we need to address when the whole thing blows up. We don’t want to use this if we can avoid it, so the first thing we do is troubleshoot what the issue is. If it is a bad web.config file, or something of that nature, we don’t necessarily want to re-image the entire server. Sometimes we can fix an issue without resorting to a restore. If we can’t then we will restore the failing component.  Our absolute last ditch effort will be to re-image the servers. We want the downtime for our end users to be as little as possible, so we do whatever we can to keep it to a minimum. Once we do restore something, we need to find out why it needed to be restored in the first place, so that we don’t run into the issue again.

7.4.4 Disaster Recovery Process

If a Site Collection, Web Application, Service Application, Database or other major component of the SharePoint farm needs to be restored for any reason; all end users will be notified of the outage, and of the RTO/RPO timeframe. No approval is required. The only reason any of these components should be restored from a backup is in the case of a complete failure at that level, such as a corrupt database, or any error in the SharePoint farm that cannot be remedied using normal troubleshooting procedures. Every attempt will be made to correct the issue without having to perform a restore operation. The failing component will be restored using the appropriate methods and media. Once the failing component is restored, the environment will be verified by affected end users to ensure that all content is intact. After the object is restored, and users are able to access necessary content again, a post mortem will be performed to determine the cause of the failure, and if possible steps to prevent the issue from occurring again will be enacted.

So there you have it in a nutshell.  Our Disaster Recovery portion of our SharePoint governance plan.  I should note that the plan is not yet complete and has not been approved by anyone in the business yet. Please keep in mind that this may not apply to your business. Your DR strategy around SharePoint might need to be more robust, or may not need to be as detailed.  Some key things that your DR strategy should have though, are some clearly defined backup policies, clearly defined restore policies, a clearly defined test policy, and a metric to measure against. Feel free to use what I have as a basic guide for your own governance plans.

Advertisements

1 Comment »

  1. […] Another SharePoint Blog Post From SharePoint Governance – Google Blog Search: […]

    Pingback by SharePoint Governance and Disaster Recovery | Slinger's Thoughts | ARB Security Solutions - SharePoint Security Solutions — November 1, 2013 @ 1:05 am


RSS feed for comments on this post. TrackBack URI

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Create a free website or blog at WordPress.com.

%d bloggers like this: