When staff estimate the nPS they expect To receive; you get 6x more datapoints

Net Promoter Score® is the gold standard for companies looking to understand, with a simple measure how well they are building customer advocacy and loyalty. The NPS for a company is constructed as a single aggregate score from a question which is asked of a set of sample customers about whether they would recommend the evaluating company to friends and family. NPS as a metric now has significant research support showing positive correlation between high NPS, increased customer retention and share price growth.

Building on NPS, companies now regularly use Transactional NPS, this asks a similar survey question but the question is asked soon after a service interaction and often the question is modified to ask the customer to consider the recent service interaction in their response. On the face of it, this looks like a logical step – If it is possible to get a metric direct from customers which can be used to tell a team if it is getting better, that has to be good, right?

To understand what the problems might be with the Transactional NPS approach imagine a team which takes 5000 calls per week. Let’s simplify this experiment and designate calls as either good (promoter) calls and bad (detractor) calls from each surveyed customer. On the first week of our experiment, there were 2200 good calls and 2800 bad ones, on the second week things improve, with 2800 good calls and 2200 bad. The NPS from the first week is calculated to be -12, and +12 is the score from the second week, but this score is based on all the calls in the week. Obviously it is not possible to survey 5000 customers each week, but how many customers do we need to survey to be sure we know things have gotten better (or worse)?

To better understand the problem, it is easier to visualise each call as being represented by a marble, white for good calls and black for bad ones. Imagine now, the two weeks as two buckets, one bucket (week 1) has 2200 white and 2800 black marbles and the second week 2800 white marbles and 2200 black. The question of how many customers do I need to survey to confidently know that things really got better is equivalent to asking “How many marbles should I take from each of the buckets to know which is the better bucket?”.

How many marbles to I need to take to know this is the better bucket?

If I take 10 marbles from each bucket, it is perfectly possible and in fact quite likely that I will just by chance pick more white marbles from the week 1(bad) bucket. If I take 20 marbles from each bucket however, I am more likely to get more white marbles from the “good” bucket, but how much more likely?

What the NPS standard suggests is that we should take the number of marbles necessary to have better than a 19 in 20 chance that the marble collection (sample) with more white marbles is from the good bucket. That is; that the difference in the samples reflects the real difference between the buckets.

In this example a surprisingly large amount of marbles is needed get to this level of certainty. In fact choosing 100 marbles from each bucket is not enough to give a 19 in 20 confidence or even a 11 in 12 confidence that the sample with more white marbles came from the good bucket! There is a significant chance that the sample will tell us things got worse, when in fact they got better.

There is a significant chance the sample tells us things got worse when they got better.

This has serious implications for using Transactional NPS at a team level – if as a manager, you cannot be confident of your metrics, then actions you take from those metrics can quickly lose credibility. Indeed in cases where teams achieve sample sizes of maybe only 50 Transactional NPS responses in the week and when NPS typically shifts 15 points or less between weeks, the chance that the measured shift in NPS score represents what has really happened moves so close to random, that it becomes almost meaningless. In this case, conversations about why things got better or worse are actively damaging. Improvement initiatives taken may have made things better but the data can very easily tell the opposite story, CSR’s may feel they had a bad week and slipped in their service responses, but the data may say they got better! Effective initiatives can get killed and CSR's can become demoralised.

Increasing sample sizes to 200 or 300 calls per week for each team is impractical, customers will not stand for that level of sampling as a start, never mind the increased cost. Our suggestion therefore to deal with this challenge is to derive a proxy quality score for every conversation which aligns to Transactional NPS. In this case all 5000 calls would be scored, and a selection of these proxy scores would be continually checked against real Transactional NPS scores (which you should continue to sample regularly).

We have implemented CSR self-assessed NPS proxy scores for call evaluation, there are challenges with this, however these can be overcome by using modern regret-aversion techniques to avoid gaming, and by ensuring that skills in call self-assessment are continually monitored and trained for. The accuracy of the proxy NPS on calls is measurable (compared to real Transactional NPS call scores) and becomes a new metric to track. There is some additional effort, but the increase in clarity and team motivation makes it worthwhile.

Implement a CSR self-assessed NPS proxy score for call evaluation

Using a well-validated proxy NPS score in this way will allow you to operate effectively where teams receive only 30 to 50 Transactional NPS survey points per week. You will be able to trust shifts in measured proxy NPS of 3 or 4 points where Transactional NPS shifts of 30 points would be unreliable. This approach will keep teams much more aligned with your change initiative and will cut out wasteful conversations based on invalid conclusions from too sparse data.

Posted on 01/09/2016 by Cormac Murphy