Overview

Scatterplots and Regression is a medium-to-hard topic in the Problem-Solving and Data Analysis domain on the digital SAT. Questions require students to identify the direction and strength of association between two variables, interpret the slope and y-intercept of a line of best fit in real-world context, compute and interpret residuals, distinguish between linear, quadratic, and exponential models, and recognize the difference between correlation and causation. For the May 2026 SAT, test-takers can expect 2–3 regression questions per Math module, and Desmos can assist with fitting regression lines. These are calculator-active questions.

Key Points

1. Types of Association

AssociationPattern on Scatterplot
PositivePoints trend upward left to right
NegativePoints trend downward left to right
No associationPoints scattered with no pattern
LinearPoints cluster around a straight line
NonlinearPoints cluster around a curve

Strength of association: the closer the points cluster to the trend line or curve, the stronger the association.

2. Line of Best Fit

The line of best fit (least-squares regression line) minimizes the sum of squared vertical distances from all points to the line.

Equation form: y = mx + b

ParameterMeaning in Context
Slope (m)Predicted change in y for each 1-unit increase in x
y-intercept (b)Predicted value of y when x = 0

Slope interpretation (3-step method):

  1. Calculate or read the slope value
  2. Identify the units of both axes
  3. Write: “For every 1 [x-unit] increase in [x-variable], the predicted [y-variable] [increases/decreases] by [|m|] [y-unit]”

Example: If a line of best fit for (study hours, test score) has slope = 4.5: “For every additional hour of study, the predicted test score increases by 4.5 points.”

3. Residuals

Residual signMeaning
PositiveActual point is ABOVE the line
NegativeActual point is BELOW the line
ZeroActual point is exactly ON the line

A residual plot (residuals vs. x-values) with random scatter indicates a good model fit. A pattern in the residual plot suggests the model is not the best fit.

4. Correlation Coefficient r

  • r ranges from −1 to +1
  • r close to +1 → strong positive linear association
  • r close to −1 → strong negative linear association
  • r close to 0 → weak or no linear association
  • |r| ≥ 0.8 → generally considered a strong association

r only measures the strength of linear association. A curved relationship may have r ≈ 0 even though the association is very strong.

5. Choosing the Right Model

ModelShapeContext clues
LinearStraight lineConstant rate of change; “increases by X per unit”
QuadraticParabola (U or arch)Projectile motion; “slows then reverses”
ExponentialJ-curve or decayGrowth/decay rates; “doubles every,” “half-life”

Desmos regression (for the digital SAT):

  • Linear: type y₁ ~ mx₁ + b
  • Quadratic: type y₁ ~ ax₁² + bx₁ + c
  • Exponential: type y₁ ~ ab^x₁

6. Correlation vs. Causation

A correlation between X and Y does NOT mean X causes Y.

Possible explanations when X and Y are correlated:

  • X causes Y
  • Y causes X
  • A third variable Z causes both X and Y (confounding/lurking variable)
  • Coincidence

The SAT frequently presents a scenario and asks what can or cannot be concluded. Key language: “cannot be concluded from this study” or “the data suggest but do not prove.”

7. Interpolation vs. Extrapolation

TypeDefinitionReliability
InterpolationPrediction within the observed data rangeMore reliable
ExtrapolationPrediction beyond the observed data rangeLess reliable; model may not hold

Pitfalls and Common Mistakes

Mistake 1: Confusing correlation with causation. A strong r between two variables does not prove that one causes the other. Fix: Look for a third variable explanation; the SAT answer will specifically say “an association exists” without claiming causation for observational data.

Mistake 2: Misidentifying the sign of a residual. Students confuse which direction is positive. Fix: Residual = Actual − Predicted. If the point is above the line, actual > predicted, so residual > 0.

Mistake 3: Interpreting the y-intercept as meaningful when x = 0 is outside the data range. For a model of adult heights vs. ages, the y-intercept (age = 0) may not make real-world sense. Fix: Note whether x = 0 falls within the data range; if not, the y-intercept is a mathematical artifact, not a meaningful prediction.

Mistake 4: Assuming a high |r| means the model is linear. r measures linear association only. A quadratic model may fit better even if r is moderate. Fix: Always inspect the shape of the scatterplot before choosing a model.

Mistake 5: Extrapolating far beyond the data and trusting the prediction. The line of best fit may not hold outside the data range. Fix: Flag any prediction as extrapolation when it is outside the observed x-values, and treat it with caution.

Quick Reference Card

ConceptFormula / Rule
ResidualActual − Predicted (positive = above line)
Slope interpretationChange in y per 1-unit increase in x
r range−1 ≤ r ≤ +1; closer to ±1 = stronger
Correlation ≠ causationAssociation only; causation requires experiment
Linear modely = mx + b (constant rate)
Exponential modely = ab^x (percent-based growth/decay)
InterpolationWithin data range — reliable
ExtrapolationBeyond data range — unreliable
Desmos linear regressiony₁ ~ mx₁ + b