Thursday, October 22, 2009

Data: Screening, Diagnosis and Treatment

Screening Phase

Examine data for five different kinds of possible errors:
  1. Lack of data – Do some questions have far fewer answers than surrounding questions?
  2. Excess of data – Are there duplicate responses?
  3. Outliers/inconsistencies – Are there values that are so far beyond the typical that they seem potentially erroneous?
  4. Strange patterns – Are there patterns that imply cheating rather than honest answers?  For instance, does a respondent alternate between ratings of 4 and 5 on every other topic in a matrix question?
  5. Suspect analysis results – Do the answers to some questions seem counterintuitive or extremely unlikely?

Diagnosis Phase

From the Screening Phase you have highlighted data that needs investigation. To clarify suspect data, you often must review all of a respondent's answers to determine if the data makes sense taken in context. Sometimes you must review a cross-section of different respondents' answers, to identify issues such as a skip pattern that was specified incorrectly. 

With this research complete, what is the true nature of the data that you've highlighted?  The five possible values the authors give:
  1. Missing data – Answers omitted by the respondent or questions skipped over 
  2. Errors – Typos or answers that indicate the question was misunderstood
  3. True extreme – An answer that seems high but can be justified by other answers (e.g., the respondent working 100 hours a week because they work a full-time job and two part-time jobs)
  4. True normal – A valid answer 
  5. No diagnosis, still suspect – The verdict is out on this "idiopathic" data. When it comes time for the Treatment Phase, you may need to make a judgment call on how to treat this data.

Treatment Phase

You've screened the data and tried to come to a verdict on whether suspect data is guilty or innocent. You have three choices for what to do with suspect data:
  1. Leave it unchanged – The most conservative course of action is to accept this data as a valid response and make no change to it. The larger your sample size, the less that one suspect response will affect the analysis; the smaller your sample size, the more difficult the decision.
  2. Correct the data – If the respondent's original intent can be determined, then I am in favor of fixing their answer.  For instance, perhaps it is clear from the respondent's explanation for their ratings that they reversed the scale in their minds; you can invert each of their answers to this question to correct the issue. Some statisticians will argue for imputation, replacing the answers with imputed values, such as the mean for that variable, but the techniques for imputation can become quite elaborate and are best left to professional statisticians.
  3. Delete the data – The data seems illogical and the value is so far from the norm that it will affect descriptive or inferential statistics. What to do? Delete just this response or delete the entire record? Whenever you begin to toss out data, it raises the possibility that you are "cherry picking" the data to get the answer you want. 
However you choose to treat the data, make sure to document in your survey report what steps you took, how many responses were affected and for which questions.