How to make predictions from data

Imagine if you could have an equation to describe the relationship between two or more variables and help you make predictions? How handy would it be to have the math that describes how weight varies with height, for example.

Well, you do. Regression is an incredibly powerful and flexible statistical tool. Regression superpowers correlation by quantifying relationships. Regression coefficients create a mathematical model of relationships and you can use this model to make predictions. The simplest regression equation is y=ax+b, where a and b are regression coefficients.

Regression analysis can unscramble very intricate problems. It is the workhorse of scientific research and in business analytics. Let’s list some examples of the kinds of questions you can answer with regression.

  1. Do socio-economic status and race affect educational achievement?
  2. Does exercise and diet affect health?
  3. How does a customers’ annual income affect the sales of my product?
  4. How do price and promotion affect sales?
  5. How does leadership demographics affect a company’s financial performance?
  6. Is there an optimal number of people on a software development team?
  7. Is there a relationship between a job candidates’ background, qualifications, and how they will perform?

All these questions have independent variables entwined. These can influence the dependent variables. Regression tells you which variables are statistically significant and the role each one plays. It’s the great detangler.

Let’s take a small diversion to explain independent and dependent variables.

Independent variables are the ones you include to explain or predict the dependent variable. They stand alone, other variables in the model do not influence them, and you aren’t trying to understand what causes them to change. You can also call them predictors, factors, treatment variables, input variables, explanatory variables, and x-variables. They are usually placed on the x-axis of a chart. They are always on the right hand side of the equals sign in an equation. In machine learning, they are known as features. Figuring out which ones to include is called model specification.

The dependent variable is what you want to explain or predict. It depends on the other variables. You’ll usually find it on the y-axis and on the left of the equals sign.

Crucially, regression controls every variable in your model. It isolates the role of each variable by estimating the effect that changing one independent variable has on the dependent variable while all other variables are held constant. Part of regression’s power is that it isolates the effect of each variable simply by including them in the model.

One way to develop an intuition for how controlling variables works is to imagine the model being something in the physical world. Charles Wheelan makes this intuitive with an example.

Imagine we want to isolate the effect on weight of a single variable, say, education. Imagine a big, diverse group of people meeting in a single place. Now imagine separating men from women. Then imagine that both men and women are further separated by height. Now imagine subdividing the groups by income.

Eventually there will be many rooms, each containing individuals who are identical in all respects except for education and weight, the two we care about. Our goal is to see how much of the variation in weight in each room can be explained by education. The clever, clever thing about regression is that it calculates a single coefficient for education that it can use in every room. It is the best explanation of a linear relationship between education and weight for this group of people when sex, height, and income are held constant.

This example gives you a sense of why big data is so valuable. We can control for multiple factors while still having a lot of observations in each “room.” Not only does a computer think in more dimensions than a human, it can do it all in a nanosecond without “herding thousands of people into different rooms.”

The outputs of a regression include regression coefficients and p-values. When you have a low p-value (typically less than 0.05) the independent variable is statistically significant. The coefficients represent the average change in the dependent variable given a single-unit change in the independent variable (while controlling for the other independent variables, of course).

For example, for a regression including both height and age, using one dataset Wheelan’s regression output is:

WEIGHT = -145 + 4.6 * (HEIGHT IN INCHES) + 0.1 * (AGE IN YEARS)

The 0.1 coefficient for age means that for every additional year in age, the model will predict that a person will weigh 0.1 additional pounds, holding height constant. This coefficient is significant at the 0.05 level, that is, the p-value is less than or equal to 0.05.

Regression coefficients and r-squared are both measures used in regression analysis to evaluate the fit of a model and the relationship between the independent and dependent variables. However, they are different measures with different interpretations.

Regression coefficients are estimated values that describe the relationship between each independent variable and the dependent variable in a regression model. Each independent variable in the model is associated with its own regression coefficient, which represents the change in the dependent variable associated with a unit change in the independent variable, holding all other independent variables constant. Regression coefficients are used to make predictions about the dependent variable based on changes in the independent variables.

R-squared (R^2), on the other hand, is a statistical measure that represents the proportion of the variance in the dependent variable that is explained by the independent variables in a regression model. R-squared values range from 0 to 1, with higher values indicating a better fit between the model and the data. R-squared is a summary measure that provides information about the overall goodness-of-fit of the model. It can also be used to identify potential outliers or influential observations that may be affecting the model fit.

A high R-squared value indicates that a large portion of the variation in the dependent variable is explained by the independent variables in the model, which suggests that the model is a good fit for the data. On the other hand, a low R-squared value indicates that a small portion of the variation in the dependent variable is explained by the independent variables, which suggests that the model may not be a good fit for the data.

As miraculous as regression is, there are important things to keep in mind if you want a trustworthy result. First, you have to specify the model correctly. If you fail to include all important variables the results can be biased.

It’s important to model responsibly—check for technical problems such as overfit and, as much as you are able, start with a theory. Modern data mining is a potent place to start modeling but blind automation of features can result in automating randomness.

Crucially, correlation doesn’t imply causation. When there are many other elements involved (what statisticians call confounding variables) or where there is excessive correlation between independent variables (called multicollinearity), the model can mislead.

Models don’t have to be causal to be insightful. This is why machine learning and AI are used by the biggest and most valuable companies in the world. Machines can perform vastly more complex types of regression and correlations in vast datasets that humans can’t.

Basic understanding of regression can give you an intuition for an important statistical analysis tool that has wide application in business, policy, and science. Machine learning has supersized regression. Many AI models are effectively huge and sophisticated regression analyses which change and adapt to data on their own. It’s vital to have minimum viable math and sound critical reasoning because people can make mistakes with modeling inputs and in how they interpret results.