This is a continuation of my thoughts about this subject.
Statistical analysis is extremely important to understand cause & effect! A very strong factor in this issue has to do with the way the human mind interprets data; Daniel Kahneman, the Nobel laureate psychologist, is a great expert on this subject, and I strongly recommend a fantastic book of his called Thinking, Fast and Slow. I'd like to review his book in much more detail later, but as a start I will say that it clearly shows how the mind is loaded with powerful biases, which cause us to make rapid but erroneous impressions about cause & effect, largely because a statistical treatment of information is outside the capacity of the rapid reflexive intuition which dominates our moment-to-moment cognitions. And, of course, a lack of education about statistics and probability eliminates the possibility that the more rational part of our minds can overrule the reflexive, intuitive side. Much of Kahneman's work has to do with how the mind intrinsically attempts to make sense of statistical information -- often with incorrect conclusions. The implication here is that we must cooly calculate probabilities in order to interpret a body of data, and resist the urge to use "intuition," especially in a research study.
I do believe that a formal statistical treatment of data is much more common now in published research. But I am now going to argue for something that seems entirely contradictory to what I've just said above! I'll proceed by way of a fictitious example:
Suppose 1000 people are sampled, (the sample size being carefully chosen using a statistical calculation, to elicit a significant effect size if truly present with a small probability of this effect being due to chance), all of whom with a DSM diagnosis of major depressive disorder, all of whom with HAM-D scores between 25 and 30. And suppose they are divided into two groups of 500, matched for gender, demographics, severity, chronicity, etc. Then suppose one group is given a treatment such as psychotherapy or a medication, and the other group is given a placebo treatment. This could continue for 3 months, then the groups could be switched, so that every person in the study would at some point receive the active treatment and at another point the placebo.
This is a typical design for treatment studies, and I think it is very strong. If the result of the study is positive, this is very clear evidence that the active treatment is useful.
But suppose the result of the study is negative. What could this mean? Most of us would conclude that the active treatment is therefore not useful. --But I believe this is an incorrect conclusion!--
Suppose, yet again, that this is a study of people complaining of severe headaches, carefully controlled for matching severity and chronicity, etc. And suppose the treatment offered was neurosurgery or placebo. I think that the results-- carefully summarized by a statistical statement--would show that neurosurgery does not exceed placebo (in fact, I'll bet the neurosurgery group would do a lot worse!) for treatment of headache.
Yet -- in this group of 1000 people, it is possible that 1 or 2 of these headache sufferers was having a headache due to a surgically curable brain tumor, or a hematoma. These 1 or 2 patients would have a high chance of being cured by a surgical procedure, and some other therapy effective for most other headache sufferers (e.g. a tryptan for migraine, or an analgesic, or relaxation exercises, etc.) would have either no effect or would have a spurious benefit (relaxation might make the headache pain from a tumor temporarily better -- and ironically would delay a definitive cure!)
Likewise, in a psychiatric treatment study, it may be possible that subtypes exist (perhaps based on genotype or some other factor currently not well understood), which respond very well to specific therapies, despite the majority of people in the group sharing similar symptoms not responding well to these same therapies. For example, some individual depressed patients may have a unique characteristic (either biologically or psychologically) which might make them respond to a treatment that would have no useful effect for the majority.
With the most common statistical analyses done and presented in psychiatric and other medical research studies, there would usually be no way to detect this phenomenon: negative studies would influence practitioners to abandon the treatment strategy for the whole group.
How can this be remedied? I think the simplest method would be trivial: all research studies should include in the publication every single piece of data gathered! If there is a cohort of 1000 people, there should be a chart or a graph showing the symptom changes over time of every single individual. There would be a messy graph with 1000 lines on it (which is a reason this is not done, of course!) but there would be much less risk that an interesting outlier would be missed! If most of the thousand individuals had no change in symptoms, there would be a huge mass of flat lines across the middle of the chart. But if a few individuals had a total, remarkable cure of symptoms, these individuals would stand out prominently on such a chart. Ironically, in order to detect such phenomena, we would have to temporarily leave aside the statistical tools which we had intended to use, and "eyeball" the data. So intuition could still have a very important role to play in statistics & research!
After "eyeballing" the complete setof data from every individual, I do agree that this would have to lead to another formal hypothesis, which would subsequently have to be tested using a different study design, designed specifically to pick up such outliers, then a formal statistical calculation procedure would have to be used to evaluate whether the treatment would be effective for this group. (e.g. the tiny group of headache sufferers specifically with a mass evident on a CT brain scan could enter a neurosurgery treatment study, to clearly show whether the surgery is better than placebo for this group).
I suspect that in many psychiatric conditions, there are subtypes not currently known about or well-characterized by DSM categorization. Genome studies should be an interesting area in the future decades, to further subcategorize patients sharing identical symptoms, but who might respond very differently to specific treatment strategies.
In the meantime, though, I think it is important to recognize that a negative study, even if done with very good study design and statistical analysis, does not prove that the treatment in question is ineffective for EVERYONE with a particular symptom cluster. There might possibly be individuals who would respond well to such a treatment. We could know this possibility better if the COMPLETE set of data results for each individual patient were published with all research studies.
Another complaint I have about the statistics & research culture has to do with the term "significant." I believe that "significance" is a construct that contradicts the whole point of doing a careful statistical analysis, because it requires a pronouncement of some particular probability range being called "significant" and others "insignificant." Often times, a p value less than 0.05 is considered "significant". The trouble with this is that the p value speaks for itself, it does not require a human interpretive construct or threshold to call something "significant" or not. I believe that studies should simply report the p-value, and not call the results "significant" or not. This way, 2 studies which yield p values of 0.04 and 0.07 could be seen to show much more similar results than if you called the first study "significant" and the second "insignificant." There may be some instances in which a p-value less than 0.25 could still usefully guide a long-shot trial of therapy -- this p value would be very useful to know exactly, rather than simply reading that this was a "very insignificant" result. Similarly, other types of treatments might demand that the p value be less than 0.0001 in order to safely guide a decision. Having a research culture in which p<0.05="significant" dilutes the power and meaning of the analysis, in my opinion, and arbitrarily introduces a type of cultural judgment which is out of place for careful scientists.