Where the Rubber Meets the Road - Jonas Ranstam's Homepage

This weblog contains short commentaries offering personal insights into statistical issues that frequently arise in research reports and grant applications. One reason why methodological problems are so prevalent is that, despite the central role of statistical methodology in medical research, many researchers lack adequate training in statistics. Consequently, there are many misconceptions about statistical terminology and inference. Superficiality, confusion and reliance on rules of thumb are common in research reports submitted for publication in medical journals. Therefore, many editors try to improve the situation by commissioning dedicated methodological reviews from professional statisticians. The intention behind this blog is to enhance the reader's comprehension of these issues. ### Menu - [[#Statistical terminology]] - [[#The illusion of knowing]] - [[#Statistical significance]] - [[#Undue emphasis on p-values]] - [[#Statistical modeling]] - [[#Meta-analyses]] - [[#Normality assessments]] - [[#Proof, evidence, faith, and dogma]] - [[#P-values and bias]] - [[#Selection bias]] - [[#Placebo effects, regression to the mean, and responders]] - [[#Didn't like the review comments?]] - [[#Reviewer or author, different roles]] # Reviewer or author, different roles Authors and reviewers of scientific manuscripts face the same methodological problems when writing or reviewing, but their roles in the publishing drama are different. Authors want to get their manuscripts published, and reviewers want to promote research based on sound scientific evidence and demote spin and flawed reports. While medical research manuscripts are often written by medical doctors and statistical reviewers are often professional statisticians, the views on what makes a good manuscript are not always the same. From the statistician's point of view, many medical authors are amateurs who misuse statistical terminology and misunderstand statistical inference. Some authors are aware of this, and try to demonstrate their statistical expertise by being overly technical. Others seem to think that statistical evidence is unnecessary in research. In any case, a good reviewer will provide the authors with a rationale and explanation for critical comments, often including suggestions for a sound methodological approach. However, it is important to remember that the roles of the reviewer and the author are different. Too much advice and too specific suggestions can turn the reviewer into an anonymous co-author, and this is not a good thing. [[#Menu]] # Didn't like the review comments? There are at least two ways to look at scientific reports. You choose how you want to see your work. If you see your report as a subjective assessment of a particular research problem, and you believe that your report represents the perfect solution to that problem, you are unlikely to appreciate the statistical reviewer's comments on methodological errors, terminological confusions, necessary clarifications, and specification of the limitations and uncertainties of your findings. If, on the other hand, you see your report as your contribution to the collective knowledge of objective evidence that is important to consider when trying to solve a particular research problem, you may find that the statistical reviewer is your best friend and the review comments something that helps you to improve your report considerably. In any case, you don't have to agree with the reviewer. You might even be able to change the reviewer's mind. A polite approach and rational arguments usually go a long way. [[#Menu]] # Placebo effects, regression to the mean, and responders One form of selection bias that is linked with information bias is called regression-to-the-mean (RTM). The phenomenon occurs in groups of people that are selected because they are abnormal, and it shows up as a tendency to return to normality. For example, if people with abnormally high blood pressure are selected to follow up, these patients' average blood pressure will tend to appear lower the next time the blood pressure is measured. The apparent change in mean blood pressure is a statistical phenomenon caused by random variation in combination with an abnormality selection, but it is often mistaken for a magic response to some inert exposure, such as placebo pills or superstitious ceremonies. The RTM phenomenon explains the importance of designing trials for comparison with control groups having the same inclusion criteria as the index groups, and why results from within-group comparisons are misleading as estimates of treatment effects. It also explains why responder analyses based on within-group measurements of continuous variables are potentially misleading. RTM can perhaps cause a subgroup of "responders" to be identified even when no treatment effect exists. Similar problems may affect the interpretation of change in patient-reported outcomes with floor and ceiling effects such as the EQ5D. Placebo effects have been discussed since 1955 (Beecher HK. The powerful placebo. JAMA 1955;159:1602-6). Several intriguing psychological explanations have been published, and suggestions on how to use the placebo effect for the benefit of patients are commonly discussed. However, no credible empirical evidence supporting the existence of real and robust placebo effects, justifying the use of placebos outside the context of clinical trials, has yet been presented. [[#Menu]] # Selection bias Selection bias is a systematic error that occurs when the sample selected for a study does not accurately represent the population. This problem can occur in many ways, often referred to by different names. For example, when recruiting subjects for a follow-up study of health hazards in certain workplaces, comparisons of illnesses and deaths are usually made with the whole population, adjusted for age and sex. However, some subjects may be excluded from the workforce because of diseases or accidents but remain included in the reference population. If the excluded individuals differ in disease risk or death risk, the comparison will be biased. The phenomenon is known as the "healthy worker effect". A similar "healthy screenee effect" has been discussed as a consequence of healthier individuals being more likely to participate in screening projects, which may give a false impression of the project's effectiveness. In randomised trials, selection bias is usually prevented by treatment allocation that is concealed from both the clinician and the patient, for example by using opaque envelopes, until the patient is enrolled in the trial. Predictable treatment allocation may discourage some patients with a preconceived opinion about the treatment from participating in the trial. Treatment allocation is often masked to the patient or physician or both (double-blinded) to avoid selective reporting of true or presumed treatment effects. Selection bias can also occur in the statistical analysis of a well-designed trial. Patients are selected by surviving a waiting list before the treatment starts (e.g. in a pacemaker trial) or by surviving the time until the treatment is complete (e.g. in a radiation sensitiser lung cancer trial) are likely to produce biased comparisons when compared with the unselected patients randomised to a control group. This phenomenon is also known as "immortal time bias". [[#Menu]] # P-values and bias The accuracy of measurements and statistical estimates has two components, precision and validity. For example, a pistol shooter trying to hit the bull's-eye may shake his hand when firing (inaccuracy) or have the rear sight poorly set (systematic error). P-values, confidence intervals and statistical significance are the precision measures used in statistical inference. Systematic errors (validity problems) are known as bias. The statistical precision of a hypothesis test or effect size estimate is related to the sample size. The validity of p-values or confidence intervals depends on how the data have been collected or measured. For example, observational screening studies are generally considered to be more susceptible to bias than randomised trials because screening participants in observational studies are not randomly assigned to screening but are self-selected, leading to healthy screenee bias if subjects with symptoms of the condition being screened decline to participate. An investigator can avoid foreseeable bias by taking it into account in the design of a trial, for example by randomising treatment allocation and masking allocated treatments. However, this is not possible in observational studies. There are many forms of bias, but they can be grouped into three main categories: selection bias, information bias and confounding bias. The first category relates to the selection of study subjects, the second to the collection of information from the subjects, such as recall bias, and the third to analysis problems, such as confounding by association, confounding by indication, effect modification and adjustment bias. The inferential uncertainty of a research finding as shown by p-values or confidence intervals, does not include the uncertainty about bias consequences. [[#Menu]] # Proof, evidence, faith, and dogma The difference between facts and beliefs is that facts are based on objective evidence and can be verified or disproved through experimentation, observation and logical reasoning, i.e. evidence. Proof is evidence that is considered so conclusive that it establishes a fact beyond reasonable doubt. Faith and dogma, rather than evidence, are based on beliefs that may or may not reflect a commitment to certain moral, religious or political principles. However, whereas faith is personal and open to subjective interpretation, dogma is institutionalised and rigid, considered authoritative and not to be questioned. Dogma can be central to the doctor's treatment of individual patients (Bellomo R. The dangers of dogma in medicine. Medical Journal of Australia 2011 Oct;195(7):372-373), but it is not useful in developing evidence to advance human knowledge. On the contrary, scientific progress typically involves identifying weaknesses in established beliefs and challenging them. Science is never settled. The focus of empirical research is therefore on both developing and questioning evidence. Statistical inference plays a key role in these endeavours. [[#Menu]] # Normality assessments The statistical section of research reports often includes information on how the distributional properties of the data have been assessed. Summaries of non-normally distributed variables are then presented using median and min-max instead of mean and standard deviation, and hypothesis testing is done using distribution-free tests instead of asymptotic tests. Several tests have been developed to test hypotheses about the distributional properties of a variable, such as the Kolmogorov-Smirnov test and the Shapiro-Wilk test. However, it is unclear what benefit the authors expect. The hypothesis tested in these distributional tests is not about the distribution of the observed data, but about the distribution of the variable in the population from which the sample was drawn. It needs to be explained why the distribution of a variable in a fictitious population is relevant for how observed data is described. Some investigators may claim that many tests, such as Student's t-test and ANOVA, are based on an assumption of a normal distribution, which is true but only relevant for small samples. The Central Limit Theorem implies that for large sample sizes, say 30 or more, mean values can be reliably tested regardless of the distribution of the tested variables. Distribution-free (a.k.a. exact or non-parametric) tests may be useful for performing hypothesis tests with small samples, but while these tests can yield statistically significant results, they are not useful for estimating effect size, which is necessary when evaluating the clinical significance of an effect or difference. Furthermore, simulation studies show that relying on the testing of assumptions before applying a Student's t-test can even be counterproductive, as these distribution tests often lead to wrong decisions. The use of Satterthwaite's or Welch's t-tests should be considered instead (see Rasch et al. The two-sample t test: pre-testing its assumptions does not pay off. Stat Papers 2011 Feb 1;52(1):219-231). [[#Menu]] # Meta-analyses This note addresses some important aspects of meta-analysis that are often overlooked. Observational studies differ from randomised trials in the respect that an observational study cannot be designed to prevent validity problems by randomisation, concealed allocation, and masking. The statistical analysis needs to be based on special considerations regarding internal validity and include adjustments to reduce bias. How well these issues have been addressed needs to be considered in detail and taken into account when conducting a meta-analysis (see also Faber et al. Meta-analyses including non-randomised studies of therapeutic interventions: a methodological review. BMC Medical Research Methodology 2016:35). The same goes for multiplicity issues in meta-analyses of confirmatory randomised trials (see Bender et al. Attention should be given to multiplicity issues in systematic review. Journal of Clinical Epidemiology 2008;61(9):857-865). Unlike fixed-effect models, which estimate a common effect, random-effect models estimate an average effect. The variability of the effects represented by their average may have consequences for the clinical interpretation of the findings. It can therefore be recommended to include a prediction interval in the forest plots to describe the variability (see also Hout et al. Plea for routinely presenting prediction intervals in meta-analysis. BMJ Open 2016;6:e010247). The choice between fixed and random effect models is often based on I2, i.e. the percentage of variability due to heterogeneity across studies rather than sampling error. However, I2 depends on sample size, which can be misleading. A clinically relevant definition of the degree of between-study variability measured by tau2 would be more appropriate for this purpose (see Rücker et al. Undue reliance on I2 in assessing heterogeneity may mislead. BMC Med Res Methodol 2008 Nov 27;8:79). [[#Menu]] # Statistical modeling Multiple regression analysis is often used in statistical analyses involving multiple variables to fit statistical models. Their use is often problematic, both terminologically (as discussed here: 2. Terminology) and in terms of the purpose of the analysis. The British statistician George Box coined the phrase, "All models are wrong, but some are useful". In clinical and epidemiological research, three main uses are common. First, in observational studies, multiple variables are included in a statistical model to adjust effect size estimates for confounding bias. This is an explanatory analysis, which requires assumptions about cause-effect relationships between the variables included in the analysis to produce valid estimates. Which variables to include in the analysis depends on what is known or suspected about the disease being studied, and developing the statistical model can be methodologically complicated (see Shrier I, Platt RW. Reducing bias with directed acyclic graphs. BMC Medical Research Methodology 2008 Oct 30;8:70). An alternative method is to develop a propensity score that predicts treatment allocation and stratify on this instead of individual variables. This alternative also requires careful variable selection (see, for example, Sjölander A. Propensity Scores and M-Structures. Statistics in Medicine 28, no. 9 (30 April 2009):1416-20). The problem to be avoided is residual confounding. Second, statistical models are also used to analyse randomised trials, but not to adjust for confounding, as this is dealt with in the study design. Instead, the model is used to adjust for randomised stratification, to analyse centre-specific effects in multicentre trials, and to estimate change from baseline in continuous variables. The study design and the trial protocol define the variables to be included in these statistical models. The problem to avoid is unnecessarily low precision, i.e. p-values that are too high and confidence intervals too wide. Third, if the focus is not on parameter estimates but on prediction, for example, in developing a prognostic score, data-driven modelling (e.g., forward or backward stepwise regression or lasso regression) can be used. In this case, the goal is not valid and precise effect size estimates but accurate predictions in terms of sensitivity and specificity. The goal is optimal predictive accuracy and the analysis problem is overfitting, adaptation to random variation with high predictive accuracy in the dataset used to develop the model but low in a other datasets. Many publications confuse the purpose of modelling and the presentation of results. The most common problem is probably the combination of data-driven model development and presentation of effect size estimates. See Ramspek CL, Steyerberg EW, Riley RD, Rosendaal FR, Dekkers OM, Dekker FW, et al. Prediction or causality? A scoping review of their conflation in current observational research. Eur J Epidemiol. 2021 Sep;36(9):889-98. [[#Menu]] # Undue emphasis on p-values As indicated in another post, statistical significance and p-values are often misunderstood, and this is not a new problem. Frank Yates, one of the pioneers of 20th century statistics, stated as early as 1951 that the most common weakness is the failure to recognise that estimates of treatment effects, together with estimates of the errors to which they are subject, are the quantities of primary interest in clinical research, not p-values (Yates F. The influence of Statistical Methods for Research Workers on the development of the science of statistics. J Am Stat Assoc 1951;46:19-34). What Yates was referring to is that a standard p-value can be used to test whether a treatment effect exists, regardless of the size of the effect, but this is often done, especially in observational studies, without knowing the risk of a false positive or false negative test result. Furthermore, p-values depend on the sample size. In a large sample, such as in a registry study, negligible effects may be statistically significant, and in a small sample, even major effects may be statistically nonsignificant. In addition, because of sampling variability, the effect size observed in a sample does not tell us much about the true size of a statistically significant or nonsignificant effect. Finally, describing a statistically nonsignificant test result as "no effect", which is an annoying habit in medical research reports, completes the mistake. A better approach is to try to distinguish between clinically relevant effects and negligible or nonexistent effects using a confidence interval. Confidence intervals describe the inferential uncertainty of an estimate and are presented with lower and upper limits. A clinically important effect is indicated by a confidence interval that excludes the minimum clinically important size, regardless of whether or not the effect is statistically significant. Similar approaches are used in the evaluation of equivalence trials and non-inferiority trials. [[#Menu]] # Statistical significance A common but flawed view of medical research is that all you need is a data set and the ability to run statistical tests. In the past, when tests had to be calculated by hand or on a mainframe, it took statisticians to do the tests. Today, with personal computers and easy-to-use software, anyone can calculate p-values. A p-value is the probability of drawing a sample with a characteristic that is at least as extreme as a certain value (for example, an apparent effect observed in a particular sample of patients) from a (fictitious) population with a property defined by the hypothesis being tested (in this example, that no such effect exists). The smaller the p-value, the less likely the sample has been drawn from such a population. A p-value of less than 0.05 is usually described as statistically significant and interpreted as evidence against the tested hypothesis, i.e. the finding is unlikely to be explained by sampling variation and that the rejection of the hypothesis can be considered to have empirical support. However, there is nothing to say that a statistically significant finding is clinically relevant, and inventing biologically interpretable explanations for statistically significant findings post hoc, after having screened a dataset for low p-values (also known as data dredging), is not meaningful and misleading if the tests are presented as pre-specified. Unfortunately, many journals will publish this type of research as long as the APC (article processing charge) is paid. There is more to a good study than p-values and statistical significance. Developing a study design, analysis plan, and data collection procedure to investigate a particular phenomenon is more challenging than most authors and reviewers realize. Just searching for p-values is a poor substitute for scientific reasoning. [[#Menu]] # The illusion of knowing "How do we know that cigarettes cause lung cancer?" the professor asked, going on to say, "It has never been tested in a clinical trial. The implication was that a clinical trial was needed to know for sure, and that observational studies could at best provide suggestions for further research. There was no discussion of the ethical and practical problems of conducting clinical trials to assess harmful effects on participants. Clinical trials, unlike observational studies, can be designed to reduce the uncertainty of research results by eliminating selection bias and confounding through randomisation, allocation concealment and treatment masking, but they cannot provide completely certain results. The uncertainty introduced by sampling from a population is related to sample size and is impossible to eliminate when the population being studied is infinite (i.e. today's medical research is usually done for tomorrow's patients). However, a well-designed clinical trial can, at least in principle, provide information that suffers only from aleatory uncertainty (i.e. the inherent randomness of sampling). Systematic reviews, which combine the results of several independent clinical trials, are generally considered to provide the most reliable evidence. The same is not true for the results of observational studies. These rely on statistical adjustments made by the investigators according to their assumptions, and whether or not these assumptions are met is usually unknown. Another problem is that the data needed to make the adjustments are not always available, and for practical reasons simplifications in the calculations are usually necessary. Consequently, the results suffer from both aleatoric and epistemic uncertainty (i.e. lack of knowledge about something that could, in principle, be known). All empirical research results, whether experimental or observational, are uncertain to some degree. Yet many of them form the basis of what we take to be known. The willingness of society or groups to accept some uncertain evidence as indicating truth and other uncertain evidence as reflecting error is not easy to explain. It doesn't seem to be directly related to the degree of uncertainty itself, and as evidence accumulates over time, opinions can change. However, economic, social, political or other factors are likely to be important in determining what is considered to be true. A critical interpretation of proclaimed "truths" and a reminder of the American statistician Carroll D. Wright's 1889 statement that "figures do not lie, but liars figure" is crucial to avoid being deceived. Phrases such as "we have demonstrated" and "we have shown" are often used in research reports, too often without empirical support. The primary purpose of scientific journals is to document the studies that have been done, their results, and the uncertainty of those results. My advice to authors is to accept uncertainty and focus on objective evidence rather than trying to convince the reader with subjective ideas. [[#Menu]] # Statistical terminology Michael Healy, Professor of Medical Statistics at the London School of Hygiene and Tropical Medicine, described clinical research in a Fisher Memorial Lecture in 1995 as "a largely amateur pursuit carried out by doctors". Whether or not this statement is true today, misuse of well-defined statistical terms is a sure way of appearing amateurish. This problem can easily be avoided by checking the definitions of the terms used. I recommend consulting the ISI. The Oxford Dictionary of Statistical Terms. New York: Oxford University Press, 2003. Here are a few examples of the most common errors. Tertiles, quartiles and quintiles are quantiles that divide sorted data into equal parts. Two tertiles are used to divide the data into three parts, three quartiles into four parts and four quintiles into five parts. However, medical publications are full of results presented with three tertiles, four quartiles and five quintiles. The range and interquartile range describe the difference (a single value) between the largest and smallest values of a variable and between the third and first quartiles respectively, not the largest and smallest values or the first and third quartiles themselves (two values each). A non-parametric hypothesis can be tested using a distribution-free test (often referred to as a non-parametric test), but it is nonsense to describe the data as being non-parametric. Multiple regression analysis can be used to fit a multivariate model with one or more explanatory variables, but a multivariate model is based on the assumption of a multivariate probability distribution, which implies a statistical model with more than one response variable. Thus, a multivariate model, like a univariate model, can be univariate or multivariate. The word 'correlation' may seem more scientific than the simpler 'relation', but 'correlation' is one of the most misused statistical terms, and 'relation' is often a more appropriate term because not all relations are correlations. Correlation implies a linear relationship, and even some closely related variables are not correlated. Even worse, the use of trial-specific terminology such as primary endpoint, interim analysis, and intention-to-treat in an observational study can be interpreted as spin, an attempt to mislead the reader about the level of evidence of the results. [[#Menu]]