Disaster Recovery Testing with the Luminex Channel Gateway/Data Domain Mainframe Virtual Tape Solution With Deduplication

1. Introduction

The Channel Gateway/Data Domain mainframe virtual tape solution with deduplication is the ideal solution for complete backup and disaster recovery. With the high capacity and low-bandwidth replication features of Data Domain deduplicaton storage, data written with the Channel Gateway system is at a disaster recovery site within minutes, as opposed to hours, or even days, with traditional physical tape and virtual tape systems.

This document describes various backup and disaster recovery configurations and the ideal use of these configurations so that true disasters and disaster recovery tests occur quickly and flawlessly. The configurations described here are actual configurations at Channel Gateway/Data Domain customer sites.

For consistency, the equipment at the primary site will be termed “primary” and the disaster recovery equipment termed “remote”.  Additionally, this document will assume a configuration consisting of just 1 Channel Gateway and 1 Data Domain appliance at the primary and remote sites. Typically, an installation will consist of 2 or more Channel Gateways to provide for a high available solution, and at least 1 Data Domain, depending on capacity requirements.

2. Background Information

As data is written by a primary Channel Gateway server to a primary Data Domain appliance, its data is deduplicated and then replicated over an existing WAN to the remote Data Domain appliance. As the data arrives at the remote Data Domain appliance, it is immediately accessible by the remote Channel Gateway.  There is no database or other data synchronization needed, as the data is self-defining. Typically, once a backup process is completed, the tape catalog is also written to a virtual tape that is also replicated to the remote system. Upon a disaster or DR test, the systems programmer must have prior knowledge of the VOLSER of that catalog. Since this is a standard practice, it should not be an issue.

To understand some considerations for a Disaster Recovery test, one needs to understand some workings of the Channel Gateway system. The Channel Gateway appears as a number of tape drives to the mainframe. Each device may be configured differently with respect to scratch pools, storage location, and

other operating parameters. Typically though, a range of 16 devices are configured identically and in most configurations, all devices are configured identically.

The Channel Gateway devices can be configured to write to multiple storage locations. The determination of which storage location occurs upon a request to mount a virtual tape. The Channel Gateway will first look to see if the data associated with the virtual tape already exists in one of the storage locations. If so, that location is used. This also applies to scratched virtual tapes. Since, by default, the data is not deleted, the data associated with the virtual tape will not change storage location and will simply be overwritten when that scratched tape is used. If the virtual tape data does not exist, that Channel Gateway will determine which available storage location to use based on different algorithms. The basic algorithm chooses the storage location with the most capacity.

In most Channel Gateway/Data Domain systems, the Channel Gateway is configured for just 1 storage location, as an NFS mount point.

3. Configuration #1 – No Remote Mainframe

In this configuration there is no remote mainframe. In this case, the remote site consists only of a remote Data Domain appliance. The primary purpose of this solution is to provide for a backup of the data. The remote Data Domain appliance is most likely at a secondary site of the user.

The diagram below shows this configuration.

In the event of a catastrophic failure of the primary Data Domain storage appliance, there are 2 options to gain access to the virtual tape data.

3.1. Option 1

In the above configuration, the primary Data Domain appliance must be either repaired or replaced. Then the data in the remote Data Domain appliance must be replicated back to the primary Data Domain appliance. Once complete, the Channel Gateway can resume normal operation.

3.2. Option 2

Another solution is to have a connection between the primary Channel Gateway and the remote Data Domain appliance. The diagram below shows this configuration.

In this configuration, a catastrophic failure to the primary Data Domain appliance does not result in a temporary loss of access to data. The primary Channel Gateway, either through an automatic or manual failover process, can access data from the remote Data Domain appliance. Initially this data access will be “read-only” until replication is broken between the Data Domain appliances. In this situation, the access rate will be compromised due to the extended connection between the primary Channel Gateway and remote Data Domain appliance. This configuration can remain until the primary Data Domain appliance has been repopulated with the data from the remote site.

4. Configuration #2 – Full Disaster Recovery Site

In this configuration, there is a mainframe at the remote site. This will require the use of remote Channel Gateways so data can be accessed by the remote mainframe. This is the most typical Disaster Recovery configuration. The number of Channel Gateways need not be the same at the primary and remote sites. This depends on performance requirements at each site.

The diagram below shows this basic configuration.

This configuration is designed to handle a significant disaster at the primary site. The advantage of Configuration 2 is that a total loss of a site does not result in the loss of operations.  The options for recovery of data in Configuration 1 still apply to this configuration. 

5. Disaster Recovery Testing

Disaster recovery testing requires some forethought on the configuration and the specific requirements of the disaster recovery test. The handling of scratch tape pools, the ability to write to the remote Data Domain appliance, and the disposition of any written data during the test must be considered. If no data is to be written during the test, then there need be no additional considerations as the system will work without any special configurations or actions and the reading of this document is not needed. But, if data is to be written, the following items must be considered.

5.1. Writing to the Remote Data Domain Appliance

Replication between the primary and remote Data Domain appliances must be broken to allow writing to the remote Data Domain appliance. This can be performed by the customer Data Domain administrator. After the DR test, the replication must be resumed.

5.2. Replicating Written Data back to Primary Data Domain Appliance

As of this writing, no user of the Channel Gateway/Data Domain solution requires the data written at the remote site to be replicated back to the primary site. The data is to be discarded after the test. If data written during the DR test is to be replicated back to the primary site, no additional considerations are

required other than the temporary breaking of the replication during the test. The customer Data Domain  administrator will be responsible for replicating the written data back to the primary site. In this situation, the user is responsible for maintaining the integrity of the virtual tape data. If previous virtual tape data is not to be modified, the user must be sure that the tape management system is fully synchronized with that of the primary site.

5.3. Discarding Test Data

If data written during the DR test is to be discarded, this requires some thought on the configuration of the remote site. Ideally, the data should be segregated such that it can be easily removed later. There are two ways to segregate the new data that is written.

5.3.1. Device Addresses for Read-Only and Write-Only

The system can be configured such that a range of devices can be configured to only have access to the previously written primary data. Access to this data will require the use of only these devices. Additional devices can be configured to only access the new area for writing. If the newly written data is to be read, these devices must also be used for the newly written data. This option will require significant pre-planning by the system programmer with respect to device allocations. The simplest configuration is the use of the Channel Gateway Read-Only feature described below. But, if tapes to be written are requested by specific mount requests as opposed to scratch requests, the use of different devices for read-only and writeable is required.

5.3.2. Channel Gateway Read-Only Feature

The Channel Gateway has the ability to read and write to multiple storage locations. Typically in a Data Domain solution, the devices have access to a single storage location. For more information on how this works, see the Background Information section at the beginning of this document. When this feature is implemented, the devices are configured to utilize 2 storage locations, but one will be considered as a “read-only” storage. This is determined by a naming convention of the NFS mount point. For example, the NFS mount point for the read-only storage will be associated with the Data Domain directory /backup. While another NFS mount point will be associated to a read/write Data Domain directory named /backup/write.

By itself this feature does not preclude writing to the “read-only” directory if a tape is overwritten or modified by the mainframe. This feature only causes the selection of a writable storage location for new data. Note: If replicated data is to be protected, this action must be performed by the Data Domain administrator by setting this directory to read-only status.  A future feature of the Channel Gateway is to provide the user the ability to set the NFS mount of the Data Domain appliance as read-only. When implemented, this feature can be used instead of requiring the setting on the Data Domain appliance.

Use of the Read-Only Feature for Disaster Recovery testing requires considerations with respect to how scratch tapes are managed and determined. This will be described below.

5.4. Scratch Tape Handling

The Channel Gateway maintains a list of scratch tapes that is periodically maintained by the scratch update utility. For more information on this process, request that document. By default, the Channel Gateway will use the most recently scratch tape. This facilitates the reclamation of storage as the data is overwritten. Without special handling of scratch request at the remote site, VOLSERS that already have associated data will be used in response to a scratch request. This would result in the “read-only” storage location being chosen and written to. This would not have the desired results of segregating written data. Additionally, the primary copy of the tape would then not match that of the remote site. To prevent this from occurring, only scratch tapes that have never been used should be mounted in response to a scratch request during a DR test. There are two methods by which this can be achieved.

5.4.1. Different Scratch Pool for DR Test

The remote Channel Gateway devices can be configured to use a different scratch pool and VOLSER’s. Specific requests for the previously written data will still be satisfied while scratch requests will result in the use of the new scratch pool. This method would require special considerations by the systems programmer with respect to the configuration of the tape management system. 

5.4.2. Segregation of the Scratch Pool by Channel Gateway

This method would allow the use of a single scratch pool. In this configuration, the scratch pool will be split between the two sites. For example, if the scratch pool consists of VOLSER’s A00000-A09999, the primary site Channel Gateways will be configured to use A00000-A09000. The remote Channel Gateways are configured to use A09001-A09999. In this way, the remote Channel Gateway(s) are guaranteed to use VOLSER’s that have no associated data. Note: The scratch pool file name at the remote site must have a different name than the primary site pool file name and that the remote Channel Gateway devices point to the different pool file.  This scratch pool file must be created and initialized on the primary system.

Since the scratch update utility sends the entire list of scratch tapes, the primary Channel Gateway must also be configured with Scratch Pool Filtering such that only A00000-A09000 are considered valid scratch tapes and can be added to the scratch pool file when scratched. VOLSER’s outside that range will not be added to the scratch pool file, and therefore will not available for scratch mount requests.

After the DR test and the replication restored, the remote scratch pool file will be replicated to the remote file. This will cause the remote pool to have the entire range of VOLSER’s for the next DR test.  

5.5. Deletion of Test Data

After the DR test, the data can be deleted. This can be performed by either the Data Domain administrator or by Luminex personnel. Currently there is no user directed feature to perform this operation.

6. Summary

The versatile and flexible Channel Gateway/Data Domain Mainframe Virtual Tape solution with deduplication provides several means by which Disaster Recovery tests can be easily and successfully performed. In the case where data written during the test is to be deleted, forethought on the configuration is required. Ideally the written data is segregated from the primary replicated data, and the primary data is not overwritten or modified. This is obtained by the implementation of a “read-only” and “writable” storage locations. The ideal solution splits the scratch pool range of VOLSER’s such that the remote site uses a subset that is not used by the primary site. In this manner, all new writes will go to the writable directory.