You are here
Home > Database Archiving for Long-term Data Retention

By Craig S. Mullins

                                                                                                              [sharethis]

This article is adapted from the second edition of Craig’s book, Database Administration: The Complete Guide to DBA Practices and Procedures, Addison-Wesley (2013).

Organizations are generating and keeping more data now than at any time in history. Many factors contribute to this reality. One contributing factor is general data growth. According to industry analysts, enterprise data is more than doubling every year. Additionally, as much as 80% of that data is not actively used to conduct business.

Why are we producing so much data? Advances in technology have better enabled our ability to capture and store data. But technology alone is not sufficient to account for the current rate of data growth.

Data is retained for both internal and external reasons. Of course, when an organization requires the data to conduct business and make money, then that data will be retained. And today’s organizations are storing more data for longer periods of time for many internal reasons. Typically, data is stored longer than it used to be in order to enable analytical processes to be conducted on the data. As such, businesses are inclined to keep data around for longer periods of time.

But external reasons, typically driven by the mandate to comply with legal and governmental regulations also compel businesses to store additional data. Indeed, data retention is a significant aspect of regulatory compliance that requires focus and attention. The need to retain data is impacted not just by the normal culprits, like Sarbanes-Oxley and HIPAA, but also by over 150 international, federal, and local laws that govern how long data must be retained. Organizations need to develop plans for archiving data from the operational databases as their data retention requirements expand over longer and longer periods of time.

The Lifecycle of Data

As data moves throughout its useful lifecycle, it progresses through five distinct phases: creation, operational, reference, archived, and discard. This is shown in Figure1. Data is created at some point, usually by means of a transaction. For a period of time after creation, the data enters an operational state. The data is required to conduct business. The operational state is followed by the reference state. During this phase data is still needed for reporting and query purposes: internal reports, external statements, or simply in case a customer asks about it. Then, after some additional time passes, the data is no longer needed for business purposes and it is no longer being queried. But the data must be saved for regulatory and legal purposes. This is the archive state.

After a designated period of time, the data is no longer needed at all and must be discarded. This phase is often relegated to an afterthought as data continues to pile up but is not used. When there is no legal requirement to maintain data, you should demand that such data be destroyed. If data that is not required for business or regulatory purposes is kept it becomes a liability instead of an asset. Any data that you keep becomes discoverable and can be used against you in a lawsuit. So why enable anyone to use your data against you if it is not a legal requirement to do so?

The data lifecycle states nothing about where the data is stored or what technology is used. But it makes senses to move archive data out of the operational database for many reasons.

The data lifecycle

Database Archiving

Database Archiving is the process of removing selected data records from operational databases that are not expected to be referencedagain and storing them in an archive data store where they can be retrieved if needed.

Before moving on, let’s break down this definition to make sure it is understood. We say removing because the data is deleted from the operational database when it is moved to the archive. If the data is still required for operational requirements it is not ready to be archived. When data moves into the archive state, query and access is no longer anticipated, so removing it is not problematic.

Next, we say selected records. We do not want to archive database data at the file or table level. We need only those specific pieces of data that are no longer needed by the business, but also related data. The archive must be able to selectively choose particular pieces of related data for archival… not the whole database, not an entire table, and not even a specific row. Instead, all of the data that represents a business object is archived at the same time. For example, if we choose to archive order data, we would also want to archive the specifics about each item on that order. This data likely spans multiple constructs within the database.

The next interesting piece of the definition is this: and storing them (the data) in an archive data store.Technicians sometimes confuse data archiving with purging data, but the two are very different (see sidebar Archive Versus Purge). Archived data is stored separately from the operational database and does not require either the DBMS or the applications. Archived data is separate and independent from the production systems from which it was moved. Doing so confers several benefits upon the database and your environment including better performance of production systems (because less data needs to be searched) and better protection for archived data (because it is protected in a separate data store).

————————————————————————————————————————————————————-

Archive versus Purge

Archiving data and purging data are two very different processes. Archived data is removed from the operational data store and maintained in an archived data store. Purged data is removed from the operational data store and discarded.

————————————————————————————————————————————————————-

The final component of the definition that warrants clarification is… where they can be retrieved if needed. The whole purpose of archiving is to maintain the data in case it is required for some purpose. So the data must be readily accessible without requiring a lot of manual intervention.

Determining the Scope of Data Retention

Before you can archive data from your operational databases, you must first determine exactly what must be retained. In other words, data must be archived at the business object level. Another way of saying this is that we need to archive selected records. This is important because not only those specific pieces of data that are to be archived must be retained, but any related data that makes the archived data understandable must accompany it into the archive. The archive needs to be able to selectively choose particular pieces of related data for archival; all of the data that represents a business object is archived at the same time.

For example, if you choose to archive order data, you would also want to archive the specifics about each item on that order. Think about the order entry and management systems you’ve encountered in the past. You cannot model the database such that all the data is in one table, or at least not if you want to avoid producing junk. There will be an ORDER table and a one-to-many relationship to the ITEM table: because an order can consist of more than one item. Simply thinking about it, if you go to amazon.com and order three books, there is one order with three items on it, right?

Furthermore, your system will likely track products and customers, so we’ll have a CUSTOMER table with a one-to-many relationship going from CUSTOMER to ORDER. This means a customer can place more than one order, but each order is for a single customer only. We’ll also have a PRODUCT table with a one-to-many relationship going from PRODUCT to ITEM. This means a product can be on multiple orders, but each individual order will have only one entry for the product. If the customer order multiples, it will be specified in the quantity field in the item.

Just to make it interesting, let’s also include a table called LOCATION that stores inventory information for products, such as bin number and quantity on hand. We’d have a one-to-many relationship going from PRODUCT to LOCATION. This means a product may be stored in more than one bin, but each bin contains only a single product type. Because a picture is worth a thousand words refer to Figure 2 for an image of this data model.

Archive data model

The data spans multiple constructs within the database, meaning tables for DB2 or Oracle and segments and/or databases for IMS. What do we archive? Let’s say we need to retain order details for ten years after the order was shipped. Do we really need all of that order data clogging up our operational databases for the full ten years? Probably not. For the purposes of this business, the data is operationally relevant for perhaps a month or two. It might be referenced for reporting or by customers for a year to 18 months after that. So, we decide to keep it in the production databases for two years after the product ships, and archive it after that.

But what needs to be archived? If we just archive the ORDER and ITEM tables, we’ll be missing pertinent information in the archive. We also need to bring along customer and product information for reference. That is, we have to make sure that the archive contains more than just the CUSTNO in ORDER and the PRODNO in ITEM. We also need the customer name and contact details. What good would CUSTNO 18123546 be if that customer is no longer in the operational database? And would PRODNO 99 be of any use without the remaining product details?

So, we need to set up our archive policy to archive the orders from the ORDER table along with the related items in the ITEM table. This means deleting the data from the operational database when we put it into the archive, and bringing along related products from PRODUCT and customers from CUSTOMER. This data is moved to the archive with the ORDER and ITEM data, but it is not deleted from the operational database. After all, those products may be in other order items that we are not archiving. Those customers may also have other orders that we are not archiving. So, we would need to create other archive plans for our customer and product data.

Developing accurate and useful database archiving policies is not a trivial matter. Defining what is to be archived and how it is to be archived requires a mix of database skills, business acumen, and knowledge of legal and government regulations. By archiving your data as soon as it is no longer required for operational purposes, you can minimize the risk of data breaches. Of course, this may not your primary reason to archive data, which is likely to be to preserve your data for long periods of time in a storage construct that is designed for long-term retention and preservation.

Database Archiving Requirements

Let’s examine the many capabilities required of a database archiving solution. Perhaps the most important consideration is that the archived data must be hardware and software independent. Independence is crucial because of the duration over which the archived data must exist. With a lifespan of decades (or longer)[1] it is likely that the production system from which the data was archived will no longer exist – at least not in the same form, and perhaps not at all. Think about the changes that your production applications have undergone over the course of the past ten or twenty years. It is completely unreasonable to expect that today’s existing operational environment will exist to enable access to archived data. We constantly change our databases. And the archive must be able to support multiple variations of the data structure as it changes.

The archive solution must be able to storage a large amount of data. As we store more data, we will archive more data. And when we combine this with long regulatory mandated data retention periods we have an explosive combination.

The archive must be able to manage data for very long time periods. Many data retention requirements are stated in decades. So the archived data will outlive the systems and the programmers that generated them. Thearchive also will outlive the media we store it on. No media lasts forever: consider that the lifespan of tape is 7 years (Also, see the sidebar on “Obsolete Hardware”). So, the archive must be able to re-purpose the archived data from one type of media to another. And ideally it should do this automatically as the media reaches the end of its useful life.

————————————————————————————————————————————————————-
Obsolete Hardware

When planning for the storage and management of data over long periods of time, be sure to factor in hardware obsolescence as a challenge. It is possible that data may be stored on media that cannot be read by modern hardware.

One consultant tells of a SOX recovery trial run that met such a fate. They called their archival vendor and had them deliver 7 year old archive tapes. When they went to mount the tapes, they discovered that they no longer had that model tape drive anywhere in the data center. In fact, they were not commercially available any more. They had to buy several used but working units off of eBay.

————————————————————————————————————————————————————-

Data, to support regulatory compliance, must remain unchanged once it is archived. So the archive must be able to protect against data modification. Only read access should be available to the archived data (with the exception of periodic administration). Archived data must be guaranteed to be authentic. And mechanisms to prevent surreptitious modification are necessary, too.

Finally the archive requires metadata to be useful: both metadata defining the archived data, as well as metadata defining what to archive and when. The archive must be able to store multiple versions of the first type of metadata. As the operational schema changes the archive must track and function across these variations in schema. The second type of metadata controls which data is archived, when, and from where. This is the metadata that drives and defines the archive itself. Both types of metadata are needed for the archive to operate.

Taking all of these considerations into account, then, a secure, durable archive data store must be used to retain data that is no longer needed for operational purposes, and it must enable query retrieval of the archived data in a meaningful format until it is discarded.

Operational databases are no place to maintain historical data over long periods of time. Database archiving will become more prevalent over time and wise organizations will start planning their database archiving needs today.

Components of a Database Archiving Solution

The diagram in Figure3 depicts the necessary components of a database archiving solution. Starting with the databases down the left side is the Extract portion, and up the right side is the data recall portion. The extract removes data from the operational database and the recall restores archived data back to the operational database.

[1] In 2007, the SNIA (Storage Networking Industry Association) Data Forum published the 100 Year Archive Requirements Survey. The survey validates the need for long-term data retention. 80% of the survey’s respondents declared they have information they must keep over 50 years and 68% of respondents said they must keep it over 100 years. http://www.sresearch.com/100_Year_Archive_Requirements_Survey_and_Report.html

Obsolete Hardware

When planning for the storage and management of data over long periods of time, be sure to factor in hardware obsolescence as a challenge. It is possible that data may be stored on media that cannot be read by modern hardware.

One consultant tells of a SOX recovery trial run that met such a fate. They called their archival vendor and had them deliver 7 year old archive tapes. When they went to mount the tapes, they discovered that they no longer had that model tape drive anywhere in the data center. In fact, they were not commercially available any more. They had to buy several used but working units off of eBay.

Data, to support regulatory compliance, must remain unchanged once it is archived. So the archive must be able to protect against data modification. Only read access should be available to the archived data (with the exception of periodic administration). Archived data must be guaranteed to be authentic. And mechanisms to prevent surreptitious modification are necessary, too.

Finally the archive requires metadata to be useful: both metadata defining the archived data, as well as metadata defining what to archive and when. The archive must be able to store multiple versions of the first type of metadata. As the operational schema changes the archive must track and function across these variations in schema. The second type of metadata controls which data is archived, when, and from where. This is the metadata that drives and defines the archive itself. Both types of metadata are needed for the archive to operate.

Taking all of these considerations into account, then, a secure, durable archive data store must be used to retain data that is no longer needed for operational purposes, and it must enable query retrieval of the archived data in a meaningful format until it is discarded.

Operational databases are no place to maintain historical data over long periods of time. Database archiving will become more prevalent over time and wise organizations will start planning their database archiving needs today.

Components of a Database Archiving Solution

The diagram in Figure3 depicts the necessary components of a database archiving solution. Starting with the databases down the left side is the Extract portion, and up the right side is the data recall portion. The extract removes data from the operational database and the recall restores archived data back to the operational database.

Database archival components.

The whole process requires metadata to operate. You must capture, validate and enhance the metadata to drive the archive process.You need to know the structure of the operational database and the structure of the archive. Furthermore, a robust archiving solution will be policy based. This means that policy rules are written to dictate what data is archived, when it is archived, and for how long it must remain in the archive before it is discarded. This policy-based metadata must be maintained and monitored against the archive, on an on-going basis.

A query capability allowing direct reads against the archive is important, too. Query against an archive data store will not necessarily be the most efficient access because of differences in the metadata over time. However, queries against archived data typically are not very performance-sensitive, at least not like typical transaction data.

Finally, it is also important to have on-going maintenance capability for the archived data. This encompasses administrative tasks such as security, access audit, administration of the structures  (such as reorganization), backup&recovery, and so on.

The Impact of e-Discovery on DBA

One of the looming issues facing data management professionals is preparation for e-discovery. Although regulations mandate that we retain data longer, there are rules and regulations that dictate when and how organizations will need to access and produce data that is retained. The only reason that data is being kept is because there may be a need to see it again.

The ability to produce retained data upon request is typically driven by lawsuits. You probably can recall examples of courtroom showdowns on television where truckloads of paper documents were required during the discovery process of the lawsuit. But times have changed. Increasingly, the data required during the discovery process is electronic, not written. That is, the data is stored on a computer, and much of that data is stored in a database management system.

Which brings me to the Federal Rules of Civil Procedure (FRCP), which are the rules used by US district courts to govern legal proceedings. One of the items in this set of rules dictates policies governing discovery. Discovery is the phase of a lawsuit before the trial occurs during which each party can request documents and other evidence from other parties or can compel the production of evidence.

The FRCP underwent changes in late 2006 to focus more on electronic data. Rule 34b of the FRCP was changed to state that “A party who produces documents for inspection shall produce them . . . as they are kept in the usual course of business…”This change clearly compels organizations to improve their ability to produce electronic data.

And Rule 37 of the FRCP adds a new section, Rule 37(f), which provides a safe harbor from sanctions arising from spoliation. According to this section, “absent exceptional circumstances, a court may not impose sanctions under these rules on a party for failing to provide electronically stored information as a result of the routine, good faith operation of an electronic information system.” Basically, this section shines a spotlight on the need for organizations to develop a clearly articulated, well-executed, and uniformly enforced records retention program. And that program should include database data. Instituting policies and procedures for how data is treated for long-term retention can provide some level of protection from “adverse inference” rulings arising from spoliation.

There are likely to be additional implications arising from the changes to the FRCP, especially when coupled with the growing list of data breaches and the growing regulations being voted into law by federal and state government. It means that we will be forced to treat data as the corporate asset that it is — instead of just saying that we treat it that way.

About the Author

Craig S. Mullins is a data management strategist and principal consultant with Mullins Consulting, Inc., in Sugar Land, TX.Craig has more than two decades of experience in all facets of database systems development including developing and teaching DB2 and SQL Server classes, systems analysis and design, database and system administration, and data analysis and modeling.  He has worked with DB2 on the mainframe since Version 1 and also has experience working with SQL Server, Sybase and IMS. Craig is also the author of two books: DB2 Developer’s Guide, currently in its sixth edition; and Database Administration: The Complete Guide to DBA Practices and Procedures.

You can contact Craig via his web site at http://www.craigsmullins.com.



 

161 thoughts on “Database Archiving for Long-term Data Retention

  1. Pingback: google
  2. Pingback: 3ds xl prix
  3. Pingback: r4 3ds gold
  4. Pingback: here me hi az df
  5. Pingback: manastirski_chay
  6. Pingback: Tech SEO Guru
  7. Pingback: online
  8. Pingback: online
  9. Pingback: sitemaplist
  10. Pingback: hqpornforiphon
  11. Pingback: pornozavrnet
  12. Pingback: kinoklub
  13. Pingback: drama2016
  14. Pingback: jpmsru
  15. Pingback: top2017bloomingme
  16. Pingback: lopoda
  17. Pingback: novinki kino 2016
  18. Pingback: gidrofob
  19. Pingback: mirdikogozapada
  20. Pingback: trumpnews
  21. Pingback: molodezhka4seria
  22. Pingback: molodejka
  23. Pingback: molodezhka4sezon
  24. Pingback: molodezjka
  25. Pingback: milidejka4-19-20
  26. Pingback: molodejka4sezon19
  27. Pingback: molodejka4s21
  28. Pingback: molodejka4sezon21s
  29. Pingback: link2016
  30. Pingback: xml18112016
  31. Pingback: xml181120167
  32. Pingback: molodejka4seria25
  33. Pingback: lastssadsafdsa
  34. Pingback: me rt last
  35. Pingback: dadafdf3ddafdsa
  36. Pingback: dadafdf3ddafdsa
  37. Pingback: topsitesss
  38. Pingback: cxzvcxzvcxzvds
  39. Pingback: cxzvcxzvcxzvds
  40. Pingback: dsafdsafd sitein
  41. Pingback: meliostadaf
  42. Pingback: badsanta2
  43. Pingback: 2121201612trump
  44. Pingback: newsss33333
  45. Pingback: kinokradserial
  46. Pingback: фильмы 2017
  47. Pingback: rajwap.xyz
  48. Pingback: judaporn.com
  49. Pingback: chuporn.net
  50. Pingback: mobilebestporn
  51. Pingback: mobilebestporn
  52. Pingback: footfuckporn
  53. Pingback: pornpageup.com
  54. Pingback: pornbitter.com
  55. Pingback: pornonaft.net
  56. Pingback: dudesex
  57. Pingback: javstreaming.mobi
  58. Pingback: top liwe
  59. Pingback: GMT7-8
  60. Pingback: 50svoboda
  61. Pingback: yotbub
  62. Pingback: yiou
  63. Pingback: hdkino720.info
  64. Pingback: youtotobe.info
  65. Pingback: serial
  66. Pingback: golubaya-laguna
  67. Pingback: articles
  68. Pingback: essayforme
  69. Pingback: Online cialis
  70. Pingback: Viagra 5mg prix
  71. Pingback: essayforme
  72. Pingback: Cialis 5 mg
  73. Pingback: Viagra 20 mg
  74. Pingback: Cheap cialis
  75. Pingback: Viagra uk
  76. Pingback: Cialis from canada
  77. Pingback: Ertugrul 116-117
  78. Pingback: News 26 04 2018
  79. Pingback: socks5
  80. Pingback: Generic cialis
  81. Pingback: Bangalore Escorts
  82. Pingback: Kolkata Escorts
  83. Pingback: Goa Escorts
  84. Pingback: Generic cialis
  85. Pingback: Cialis prices
  86. Pingback: Buy cialis
  87. Pingback: Cialis online

Comments are closed.

Top