Getting Your Test Data Right
If you are a tester or a developer, you most likely work in some manner or another with test data on a daily basis. Many of us take for granted that test data is just there to be used and that any data is good data. What has been learned the hard way over time is that not all test data is created equal. Provisioning test data can be a time-consuming endeavor. Doing it wrong not only wastes time, but bad data can have a real negative impact on the quality of your application, your deployment timelines and team resourcing.
This article outlines the ten things everyone should know about test data. The goal is to improve your overall understanding of test data and hopefully, in the process, provide some insights that will allow your teams to acquire the best test data possible in the least amount of time and improve the overall timeliness and quality of your releases.
1. Production Data Is Your Best Test Data...
This should come as no surprise to anyone, unless you are new to the game. The reason behind this is that your production data probably have naturally occurring complexities and anomalies that result from the processing done by your applications. Those complexities and anomalies are very hard to anticipate and replicate. Hence, manufactured data and synthetic data often fall short of creating top-quality test data.
2. …but, Testing With Production Data Can Be Risky
This is especially true in healthcare, financial services, banking, insurance or other areas where you work with personally identifiable data (PHI, PCI, PII). These compliance issues should not be ignored. In fact, the risks around this type of data are growing as more and more states adopt privacy laws. Data obfuscation methods should be put in place to ensure customer privacy and regulatory compliance.
3. Lack of Test Data Is a Big Issue
If you have ever been stuck waiting for someone to give you the data you need to complete a test, you know what I mean. The World Quality Report (WQR) recently published by Capgemini, says that “…the lack of test environment and data is the number one challenge our respondents face when applying testing to Agile development.” The result is that teams tend to use agile approaches to define and develop but revert to waterfall for testing and deployments. This nullifies many of the benefits that the investment in agile was meant to deliver.
4. 62% of Companies Still Use a Copy of Production
To solve the lack of test data, guess what? Despite the risks, a majority of respondents in the WQR stated that they still resort to simply making a copy of their production database(s). Given the solutions available in the market, this is a bit hard to understand. Rolling the dice on something that could be the source of a major breach does not seem like a well-thought-out strategy.
5. Test Data Can Get Stale Fast
We all know that data ages. Let’s say that your billing application has special processing for accounts with a late payment in the last 90 days. To test this, you want 100 records that satisfy those criteria. As time passes, those 100 records will no longer all meet that criterion. Your test data set needs to be continually refreshed. If you haven’t figured out how to do that efficiently you will be forced to live with less than up-to-date data or will be stuck waiting.
6. Good Test Data Avoids Extra Development Work
How many times have you found a bug in the code only to find that it wasn’t a bug after all but rather an artifact of bad test data? A surprisingly large percentage of all bugs discovered during testing are caused by bad test data. The result of this is unnecessary time spent analyzing and troubleshooting the issue, time that would have been better spent writing new code or testing other features.
7. Good Test Data Avoids Production Issues and Costs
The corollary to item #6 is that bad test data can cause a false positive, meaning it can mask a code issue. This is the basis also of point #1. The result here is additional time spent fixing product issues, which, as we know, is typically more costly than catching something in-test. For example, IBM estimates that if a bug costs $100 to fix when gathering requirements, would be $1,500 in the testing phase, and $10,000 after being introduced into production. Not to mention that it could cause customer inconvenience, impact your brand and ultimately the bottom line if it is significant enough.
8. Provisioning Test Data Can Be Resource-Intensive
OK, so hopefully by now you get it: Production data is good for testing, but risky and poorly-constructed test data has its own set of issues. Unless your database is really small and not complex, manually creating data with a spreadsheet (as done by 66% of the WQR respondents) is probably not going to be helpful. Based on this knowledge, you decide to do it right. You’re going to use production data but mask it in some manner so that it’s risk-free. You will likely quickly find that this is hard and has its own set of challenges.
This is especially true when you add any amount of complexity such as data being shared across multiple databases or databases with hundreds of tables. It is one thing to mask records in one database, but let’s say you have date-related data in three databases, and you need to obfuscate date records. Ensuring your scripts offset all the dates by the same amount is tricky; varying the offset is even trickier and ensuring that the offset does not result in unintended processing errors is even more tricky. Oh, and don’t forget, you now have to maintain those scripts.
9. Yes, You Can Have Too Much Test Data
This may sound kind of counter-intuitive after all the talk about needing to get good test data and the effort it can take. The reality is, many teams, especially those that make a copy of their production database (obfuscated or not), copy the ENTIRE database. There are several problems with that.
First, unless you are doing load or performance testing, it’s unlikely that you need all that data. You can shorten the time to load (and reload) the environment by using a subset of the database. You can also lower your storage costs with a smaller footprint. With the right tooling, you are able to create subsets that are much smaller but representative of the production data set.
10. Not Solving Your Test Data Issues Increases Your QA Costs
There are hidden QA costs for firms that haven’t solved this. For example, if you can’t figure out how to load less than the full copy of your production database per item #9 above, you are incurring additional storage costs. If your teams are struggling to find and acquire good test data, your timelines are likely to be impacted.
QA delays are one of the root causes of over-staffing. The delays often cause resources to be sub-optimally deployed – meaning they are working on things that are a lower priority, while they waiting on test data or on the testing team?
Test Data Can Be Provisioned Faster, Safer and Easier
You can acquire production quality test data without the risk, manual effort complicated scripts to maintain, and without a full copy of your production database.
The solution is to invest in a test data management (TDM) tool suite. There are a wide variety of offerings in the marketplace. You will find everything from simple tools that only work with a specific database type to enterprise-class suites that can handle the most complex environments including multiple data sources and data types. Even at that level, there are a lot of variabilities both in features and licensing (cost and models). If you are doing DevOps there are special challenges that come with finding a tool that can integrate into your environment seamlessly.