In regression analysis, heteroscedasticity (sometimes spelled heteroskedasticity) refers to the unequal scatter of residuals or error terms. Specfically, it refers to the case where there is a systematic change in the spread of the residuals over the range of measured values.
Heteroscedasticity Because ordinary least squares (OLS), regression assumes that the residuals originate from a population with homoscedasticity. This means that they have constant variance.
When heteroscedasticity is present in a regression analysis, the results of the analysis become hard to trust. Specifically, heteroscedasticity increases the variance of the regression coefficient estimates, but the regression model doesn’t pick up on this.
This makes it much more likely for a regression model to declare that a term in the model is statistically significant, when in fact it is not.
This tutorial will explain how to detect heteroscedasticity and what causes it. It also explains possible solutions to the problem.
How to Detect Heteroscedasticity
The simplest way to detect heteroscedasticity is with a fitted value vs. residual plot.
A scatterplot can be created when you have fitted a regression line with a set data. It shows the fitted values and the residuals.
The scatterplot below shows a typical fitted value vs. residual plot in which heteroscedasticity is present.
As the fitted values increase, the residuals get more dispersed. This “cone” shape is a telltale sign of heteroscedasticity.
What Causes Heteroscedasticity?
Heteroscedasticity occurs naturally in datasets where there is a large range of observed data values. For example:
- Consider a dataset that includes the annual income and expenses of 100,000 people across the United States. For individuals with lower incomes, there will be lower variability in the corresponding expenses since these individuals likely only have enough money to pay for the necessities. For individuals with higher incomes, there will be higher variability in the corresponding expenses since these individuals have more money to spend if they choose to. Some higher-income individuals will choose to spend most of their income, while some may choose to be frugal and only spend a portion of their income, which is why the variability in expenses among these higher-income individuals will inherently be higher.
- Consider a dataset that includes the populations and the count of flower shops in 1,000 different cities across the United States. It may not be uncommon for one or two flower shops in small cities. But in cities with larger populations, there will be a much greater variability in the number of flower shops. These cities might have between 10 and 100 shops. This means when we create a regression analysis and use population to predict number of flower shops, there will inherently be greater variability in the residuals for the cities with higher populations.
Some datasets are more vulnerable to heteroscedasticity.
How to Fix Heteroscedasticity
There are three common ways to fix heteroscedasticity:
1. Transform the dependent variable
One way to fix heteroscedasticity is to transform the dependent variable in some way. The log of the dependent variable is one common transformation.
For example, if we are using population size (independent variable) to predict the number of flower shops in a city (dependent variable), we may instead try to use population size to predict the log of the number of flower shops in a city.
Heteroskedasticity can be caused by using the log of the dependent variable rather than the original dependent variables.
2. Redefine the dependent variable
Another way to fix heteroscedasticity is to redefine the dependent variable. It is common to use a rate instead of the raw value to redefine the dependent variable.
For example, instead of using the population size to predict the number of flower shops in a city, we may instead use population size to predict the number of flower shops per capita.
In most cases, this reduces the variability that naturally occurs among larger populations since we’re measuring the number of flower shops per person, rather than the sheer amount of flower shops.
3. Use weighted regression
Another way to fix heteroscedasticity is to use weighted regression. This regression assigns a weight for each data point based upon the variance of its fitted values.
This basically gives smaller weights to data points with higher variances. This shrinks their squared residuals. This can solve heteroscedasticity problems by using the right weights.
FAQ:
Q1. What is heteroskedasticity?
A1. Heteroskedasticity is a statistical term that refers to the situation where the variance of the errors in a regression model is not constant across observations.
Q2. What causes heteroskedasticity?
A2. Factors such as outliers, measurement errors, and omitted variables can cause heteroskedasticity.
Q3. What are the consequences of heteroskedasticity?
A3. The consequences of heteroskedasticity include biased estimates of the regression coefficients, inflated standard errors, and incorrect hypothesis testing results.
Q4. How can I detect heteroskedasticity?
A4. You can detect heteroskedasticity by examining the residuals of the regression model using graphical methods or statistical tests.
Q5. What are some methods for fixing heteroskedasticity?
A5. Some methods for fixing heteroskedasticity include transforming the dependent variable, using weighted least squares regression, and using robust standard errors.
Conclusion
Heteroscedasticity is a fairly common problem when it comes to regression analysis because so many datasets are inherently prone to non-constant variance.
It can be quite easy to spot heteroscedasticity using a fitted values vs. residue plot.
The problem of heteroscedasticity is often eliminated by transforming the dependent variables, redefining them, or using weighted analysis.