Monday 29 December 2008

DDP

Getting started with DDP
You probably already have the information you need to calculate your own DDP - the number of defects found in testing, and the number that were found later (e.g. a later stage of testing or reported to Support).

For example, if system testing (ST) found 50 bugs, and user acceptance testing (UAT) found 25, the DDP of ST was 67% (50 / 75).

Take a project that finished 3 to 6 months ago and calculate what the DDP was for that. Now you have a starting point to see how your testing is changing over time.

Not all bugs are created equal! When you start measuring DDP, you might want to use only the highest one or two severity ratings. Or you could have one DDP for high severity, and one for all bugs.

If you have any questions about DDP please ask!

9 comments:

Anonymous said...

Hi,

We use DDP to check the effectiveness of quality enhancing measures in our Software construction.
However I wonder, whether there are benchmarks (industry averages) available?

Dot Graham said...

Hi Bernhard,

Good question - thanks - and glad to hear that you are using DDP!

It is very natural to want to know how we compare to other organisations, but I actually don't think that it is very useful for DDP, for a number of reasons:

1) Different organisations have different objectives for testing and therefore different appropriate levels for DDP. For example safety-critical applications would want a higher level of DDP than a game where time to market is more important.

2) If you found a company very similar to yours, they might well be your competitor, so their DDP information may be difficult to get or unreliable.

3) Different organisations record defects and severity levels in different ways, so you may not be comparing like with like. For example if you count all defects but another organisation eliminates duplicate bugs before calculating DDP, they would not be the same.

Having said this, I have seen some general trends (not really benchmarks, more ball-park figures). If you are not very good at testing or don't take it very seriously, DDP may be 50% - 60%. If you are trying to do testing well but perhaps don't have a lot of expertise or training, DDP may be 70% to 80%. If you are using techniques well and have adequate time for testing, then DDP might be 90% to 95% or even higher.

BUT - If company A has a DDP of 60% and B has 90%, that doesn't mean that B is doing better testing, even though they find a greater percentage of defects than A. Why?

1) B may have many times more defects than A for a similar size system (finding 800 of 100 defects, leaving 200 still in) - it may not have been worth the cost of finding the remaining 4 defects (having found 6 out of 10) for A.

2) 60% DDP might be perfectly adequate for company A, with great features, quick delivery and a few bugs. Even though company B found 80% of the defects, the software may not meet stakeholder needs or is unusable. DDP isn't the only measure!

The main benefit of using DDP is to compare to yourself as you were in the past, to see if your testing process is improving at finding defects. This is what I recommend.

All the best in your future testing!

Paul Herzlich said...

Dot - you make it sound deceptively simple. You can't do DDP without taking into account the seriousness of defects. One production failure attributed to a bug that should have been found earlier is worth thousands (not exaggerated) of less serious bugs found earlier. So you cannot simply divide 50 by 75 or whatever. Although the data might exist in the organisation to calculate DDP, it may not be yours as test manager. In reality, DDP is as hard to work as any other metric. Also, you don't seem to like the code-oriented metrics, but for many code-oriented shops (and I find testing is going that way), those work fine.DDP is not as superior as you make it out to be. It all depends on one's orientation.

Dot Graham said...

Hi Paul,

Thanks for your comment - you raise a number of challenging and important points.

1) Seriousness of defects
Yes I agree that the seriousness of defects needs to be taken into account, and that one bug may be worth thousands of lesser bugs.

DDP is most useful when there are a reasonably large number of bugs of similar severity. Although it counts the defects from specific projects or stages of testing, DDP is really giving information about how the testing process is doing. So you’re right that one "super-major" bug would be counted for DDP in the same way as other major/serious bugs. (If you have lots of super-major bugs, I suppose you could measure DDP for that category, but I don't think that would make sense, as hopefully there won't be very many of them!)

When you come to decide which bugs to fix, then you would look at the seriousness of the defects, so the super-major would be first to be fixed. So knowing the relative severity is important, but that doesn’t mean that DDP isn’t valuable.

So I think that you can divide 50 by 75 because that does give you useful information, even if it doesn't give you all of the information you need to work with. Yes you are treating all bugs counted in the same way by lumping them in together to be counted, but all ways of seeing the world involve some kind of simplification. I don't remember who said it:
"All models are wrong, but some models are useful."

2) Data may not be yours as test manager
Sometimes it can be very difficult for the test manager to get access to the defect statistics, particularly those reported by users.

However, difficult is not the same as impossible! Sometimes all the test manager needs to do is simply to ask for that data. The support people may be quite pleased that someone else is interested in data they have collected.

Other times it can take some effort to try and find out; sometimes the data is collected in a different form or with different definitions of severity, or on a different bug database. So, yes, there can be significant problems here. However, it is worth at least trying to get the data.

If it truly is impossible, or it would take too much time and effort to get the data, then you would have to admit defeat and you wouldn’t be able to calculate DDP using that data.

However, you can still measure DDP of different test stages that are within your control, which can give you interesting information.

3) DDP is as hard to work as other metrics.
I think DDP is easier to work than most, but as with other metrics, there are a lot of things to take into account - there are a lot of "tunes to play" on the basic melody.

It is important to understand exactly what it is that DDP is measuring, but it is a simple enough concept that it can be explained quickly (even to high level managers) ;-)

As with other metrics, DDP can fairly easily be sabotaged. For example, if testers report each bug they find 10 times (perhaps with slightly different wording), but the bugs found afterwards are reported normally, then their DDP will look much better than it really is. But why would they do this? This behaviour normally occurs when there is fear, which means that the metric is perceived as a vehicle for punishment rather than improvement. This type of blame culture will affect any metric, not just DDP.

4) Code-oriented metrics
Ah, but I do like code-oriented metrics! They are great for developers, and it is great that developers are now more interested in testing. I tend to use system and acceptance test as examples for DDP, as it is more easily applicable to those levels. If existing code-oriented metrics are working fine, great! DDP is useful if you want to monitor your effectiveness at finding defects, over time. Other metrics are useful for other purposes.

One related thing that I am sometimes asked: Should developers use DDP on their own code? Yes, if they want to know how effective they are as an individual at removing their own defects - as a personal measure. If they know how many defects they found in their testing and how many were found afterwards, then they can measure their own DDP. But a word of warning: use this personal measure only privately, to measure your own progress. Do not ever have this measure be used in any form whatsoever as a personal assessment on testers by anyone! (I could go into more detail but not now.)

5) Superiority of DDP?
Superior - compared to what? If an organisation has no idea how well their testing process finds defects, and if finding defects is an objective for the testers, then, yes, I do think DDP is a very good metric. But of course if those don't apply, or if you don't have many defects, then DDP would not be a good one to use. I certainly am not saying this is the only metric to use, but lots of people have found it very useful, and I like it because it is relatively simple!

So I agree that context/orientation is important, but I do still think DDP is a superior metric!

One question for you - what happened when you used DDP, or tried to use it?

I look forward to your response.

Unknown said...

Hi Dorothy,
the DDP is a tool to improve testing but have you used the DDP to improve other disciplines within the Software development process? Is DDP a measurement to use whenever you do process improvements e.g. requirements – any improvement should affect the DDP positively?!?!

Dot Graham said...

Hi Ann-Charlotte,

Thanks very much for your comment! Sorry I didn't reply earlier.

You start by saying that DDP is a tool to improve testing. I may be nit-picking here, but this is actually not the case.

An analogy: measuring your fuel economy doesn't get you better mileage - driving more gently does, and the fuel economy reflects this.

DDP is a way of measuring the testing. To improve the testing, you would need to change the way you test, for example by using more techniques. Any improvement would then be reflected in an improved DDP.

I see DDP as mainly a measure of the testing process, rather than of any software development process, with one exception: DDP could be used as a measure of a reviewing process, for example. Basically we are looking at how good a defect-detecting process is (given a reasonable set of existing bugs to detect from).

If you improve your requirements (or other software development processes), the main result should be that you put fewer bugs into the system, so the main effect is on the number of defects, not DDP.

Of course, reducing the number of defects could have an effect on DDP indirectly, but it could go either way.

1) With fewer bugs in the "pool" to be detected (or not), it may take longer to find them, so perhaps we would actually find fewer bugs, so the DDP could get worse!

2) On the other hand, if the requirements are more testable, then maybe the bugs that are there would be easier to find, so the DDP would get better.

So I would say that DDP is not useful for any software development improvement, only for testing and reviews or other defect finding activities.

Hope this is helpful.

Michael Bolton http://www.developsense.com said...

Although it counts the defects from specific projects or stages of testing, DDP is really giving information about how the testing process is doing.

Is that the only explanation?

Let's look at a few extreme cases for the purpose of demonstrating the problem.

a) Imagine that a product never ships. The number of defects found in testing will probably be some number. The number of defects found after release will be zero. Does this mean that the test effort was infinitely good?

b) Imagine that we have a very nervous project manager, who decides wait for a year of testing before he ships the product. The testers have obtained superb coverage, so no problems are found in the field. However, it turns out that the testers did no work at all in the last six months. DDP here doesn't tell us about the effectiveness of the testing in the last six months.

c) Imagine that we have an aggressive test manager who decides to ship the product after a day of testing. The testers are experts and do great work in that first day, yet many problems are found in the field. DDP is low.

d) Because of the results in (c), the testers are fired, replaced by outsourced incompetents from Elbonia. The programmers, alarmed by the last release, decide to improve their code. (Something very much like this triggered the Agile movement.) The Elbonians are given a week. They discover a handful of cosmetic bugs. No problems are found in release (but that's because the programmers did extra work). DDP is high. The Elbonians end up looking better than the now-redundant expert testers.

e) On the next release, the programmers are so worn out that they quit. Desperate, management hires more incompetents from Elbonia. The testers find lots of problems in product, but after a while they stop finding them (they're incompetent, remember?). The product ships with far more undiscovered problems. Customers are so appalled that they don't even report the problems to tech suppport. Some just call customer service for a refund; others don't even bother to do that.

Okay, so those examples are extreme, but they're intended to illustrate the following problems:

1) DDP doesn't make sense for projects that get cancelled or dramatically changed in mid-stream.

2) DDP doesn't account for the role of project management in deciding when and why to ship. That's always a business decision, not a technical one. Project managers might decide to ship the product early because of market pressures or contractual obligations.

3) DDP ignores the role of the programmers and the quality of the code.

4) DDP doesn't account for variance in the role of the customer and their relationships with the product and the company.

5) DDP doesn't account for variations in the relationship between technical support and the testing organization.

6) DDP presumes that a bug is an object--classic reification error. A bug is not a thing in the world, like a fork or a truck; a bug is a relationship between the product and some person. Bugs found in testing may not bother the end user; bugs found in the field may not have been possible to find in testing due to odd configurations or unexpected field use.

7) DDP doesn't take timing into account. At one of my clients, a product had great DDP numbers, even though lots of bugs were found in the field. Why? Enterprise customers are conservative. It took 18 months to deploy the product. Defect escapes were only counted in the first six months of relesae. Plus by the time the real problem reports started coming in, the test, programming, and management teams had all changed.

I recommend that people take a serious look at the list of ten questions to ask about metrics, found in this paper by Kaner and Bond, and apply them to DDP, just as the paper applies them to MTBF. I contend that DDP is as close to an invalid metric as you can get.

---Michael B.

Dot Graham said...

Hi Michael,

Thanks for your comments – it’s great to have your input to this!

Part 1
“All models are wrong; some models are useful.”

I suggest a corollary:
All measures are flawed; some measures are useful;
and all measures can be sabotaged.

Your comments have stimulated me to spell out the “rules” for when DDP is appropriate to use (which are described in the downloadable slides). Here they are more concisely:

DDP is appropriate to use when:
1) you keep track of defects found during testing
2) you keep track of defects found afterwards
3) there are a reasonable number of defects – for both 1) and 2)

To respond to your points:

a) Product never ships. If you never use a product, the denominator will always equal the numerator, so DDP will be 100%. Does this mean the testing was good?
- We don’t know - this violates rules 2 and 3 above.

b) Test for 6 months, no testing for 6 months, then ship; measure DDP from the 2nd 6 months.
- This violates rule 1, as you have not measured the defects found while testing was happening.
- Actually if NO defects were found in the field, it also violates rule 3.

c) Ship after one day of testing, many problems in the field, DDP is low.

YES – this is exactly when DDP is useful because it demonstrates the consequences of shipping too early.

You can even use past DDP as a way to predict (roughly) the number of defects that your customers will find. You can say to your manager, “Remember last time, when we shipped at a similar point in the testing? Remember all the calls from angry customers about all the bugs they were finding? If we stop testing now, our DDP is likely to be similar – this means that the 250 we have already found is probably only 60% of the bugs– that means that around 200 more will be found in the field.”

Predicting the number of defects missed this time, based on the DDP last time is not an exact science, but it gives a reasonable “ball-park” figure if the testing process has not been dramatically changed.

d) Testers fired, Elbonian testers incompetent; developers improve software so fewer defects are now produced. Elbonian testers find a “handful” of bugs, none in the field. DDP high.
- Actually DDP would be 100% if none are found in the field. This and a “handful” of bugs violate rule 3.
- In fact the DDP is high in this case – the testers have found all the bugs found both during and after testing. The cause of DDP being either high or low is nothing to do with the measure.

e) Programmers quit, loads of bugs, not found by testers, customers appalled and don’t report any. (Have you thought of writing for Coronation Street?)
- Having no defects reported doesn’t mean there are no defects. But DDP works on reported defects, if none are reported, then this violates rule 2.


Here is another story along those lines (this is a true one, not Elbonian). There were a series of releases of a product, and they kept track of the DDP for each release. Most of them were between 90% and 95% which was reasonable in their context. But there were two releases that appeared to have 100% DDP – the most recent and one about 4 releases back. What was the cause?

The most recent one hadn’t been released yet, so they shouldn’t have included the DDP for that one anyway since they had no data from the field (rule 2).

The one from 4 releases back had been released. However, it turned out that none of the users had actually installed that release – they had all skipped it and gone for the next release. (Also rule 2 but they hadn’t realized it at the time.)

Moral of the story: the measure gives you information but doesn’t tell you why it has the value it does. You need to investigate the reasons, especially for unusual or unexpected values.

Continued in next comment ...

Dot Graham said...

Response to Michael, Part 2

In the second part of your posting, you describe a number of “problems”. I think that many of these are nothing much to do with DDP.

1) Project cancelled or dramatically changed mid-stream.
- I would agree that DDP would not make sense here.

2) DDP doesn’t account for Project Manager’s decision to ship.
- DDP doesn’t “account” for anything except the proportion of defects found relative to the known total.
The PM’s decision to ship for whatever reason will have consequences in terms of the defects that testing has not had time to find, and DDP will reflect that. The measurement of DDP is not one that can influence a “ship now” decision for the current project, because it uses data that is only available after the project has shipped. However if the PM learns from past experience, DDP from previous releases may help (as mentioned in Part 1).

The next three comments are a bit confusing to me – they don’t seem to be related to DDP, or only rather indirectly.
3) DDP ignores programmers and quality of the code
- I suppose DDP is related to the quality of the code because it is defined in terms of defects (in code), but that’s not the only measure of quality for code or programmers.
4) DDP doesn’t account for variance in customer relationships
- Different customers would be satisfied with different DDP’s, depending on their defect-tolerance.
5) DDP doesn’t account for variance in tech support – tester relationship.
- Perhaps some relevance here if customer-reported defects are difficult to find out about?

In your point 6, you raise two interesting but different points.
6) a) Presumes a bug is an object – it is a relationship with a user. Some bugs won’t bother a user.
- Agreed. If a user reports it, then it is something that bothers them, so DDP actually reflects that.

6) b) Bugs in the field may not have been possible to find in testing.
- This is one of the options to choose when deciding how your will measure DDP – one of the “tunes” that can be played on this measure.

There are two options here: one is to do an analysis of each bug found in the field to determine whether or not testing could have found it. If so, then it is counted in the denominator for DDP; if not, then it is not counted.

What does this give you?
You have a more accurate value for DDP, reflecting only those bugs that testing should have found.

But what has it cost you? The analysis time may be significant – is the additional accuracy worth the time? Usually, it isn’t – a few percentage points is not worth hours of effort.

The other (simpler) alternative is just to count them all anyway.

Yes, this is slightly unfair, and not a totally accurate measure. But as long as you are unfair in the same way all the time, you have something that can tell you how your testing is doing compared to last time.

7) Timing. DDP was measured after 6 months, even though customers took 18 months or more to report defects.
- This violates rule 2, as it doesn’t allow sufficient time for the defect count to accumulate to give a sensible measure. DDP is measured at some point after release, but whether that point is 1, 3, 6, 18, or 24 months, or the next Sprint, depends on your context.

I have read Kaner’s paper (and your Better Software articles) on measurement and have applied the 10 questions to DDP, but this post is long enough!

Is DDP “as close to an invalid metric as you can get”? Kaner seems to say that all metrics are invalid, but I contend that some metrics are still useful, including DDP.

It builds on something most people already have, and is a relatively simple measure to do. It is based on one objective for testing (finding defects), but is a way to compare testing efforts for different projects, with varying numbers of defects, different development technologies, etc. It measures what all testing efforts have in common – finding (and missing) defects; hence it is a measure of the testing process.

Thanks for your comments – I hope I have answered your points.