Overview
Scatterplots and Regression is a medium-to-hard topic in the Problem-Solving and Data Analysis domain on the digital SAT. Questions require students to identify the direction and strength of association between two variables, interpret the slope and y-intercept of a line of best fit in real-world context, compute and interpret residuals, distinguish between linear, quadratic, and exponential models, and recognize the difference between correlation and causation. For the May 2026 SAT, test-takers can expect 2–3 regression questions per Math module, and Desmos can assist with fitting regression lines. These are calculator-active questions.
Key Points
1. Types of Association
| Association | Pattern on Scatterplot |
|---|---|
| Positive | Points trend upward left to right |
| Negative | Points trend downward left to right |
| No association | Points scattered with no pattern |
| Linear | Points cluster around a straight line |
| Nonlinear | Points cluster around a curve |
Strength of association: the closer the points cluster to the trend line or curve, the stronger the association.
2. Line of Best Fit
The line of best fit (least-squares regression line) minimizes the sum of squared vertical distances from all points to the line.
Equation form: y = mx + b
| Parameter | Meaning in Context |
|---|---|
| Slope (m) | Predicted change in y for each 1-unit increase in x |
| y-intercept (b) | Predicted value of y when x = 0 |
Slope interpretation (3-step method):
- Calculate or read the slope value
- Identify the units of both axes
- Write: “For every 1 [x-unit] increase in [x-variable], the predicted [y-variable] [increases/decreases] by [|m|] [y-unit]”
Example: If a line of best fit for (study hours, test score) has slope = 4.5: “For every additional hour of study, the predicted test score increases by 4.5 points.”
3. Residuals
| Residual sign | Meaning |
|---|---|
| Positive | Actual point is ABOVE the line |
| Negative | Actual point is BELOW the line |
| Zero | Actual point is exactly ON the line |
A residual plot (residuals vs. x-values) with random scatter indicates a good model fit. A pattern in the residual plot suggests the model is not the best fit.
4. Correlation Coefficient r
- r ranges from −1 to +1
- r close to +1 → strong positive linear association
- r close to −1 → strong negative linear association
- r close to 0 → weak or no linear association
- |r| ≥ 0.8 → generally considered a strong association
r only measures the strength of linear association. A curved relationship may have r ≈ 0 even though the association is very strong.
5. Choosing the Right Model
| Model | Shape | Context clues |
|---|---|---|
| Linear | Straight line | Constant rate of change; “increases by X per unit” |
| Quadratic | Parabola (U or arch) | Projectile motion; “slows then reverses” |
| Exponential | J-curve or decay | Growth/decay rates; “doubles every,” “half-life” |
Desmos regression (for the digital SAT):
- Linear: type
y₁ ~ mx₁ + b - Quadratic: type
y₁ ~ ax₁² + bx₁ + c - Exponential: type
y₁ ~ ab^x₁
6. Correlation vs. Causation
A correlation between X and Y does NOT mean X causes Y.
Possible explanations when X and Y are correlated:
- X causes Y
- Y causes X
- A third variable Z causes both X and Y (confounding/lurking variable)
- Coincidence
The SAT frequently presents a scenario and asks what can or cannot be concluded. Key language: “cannot be concluded from this study” or “the data suggest but do not prove.”
7. Interpolation vs. Extrapolation
| Type | Definition | Reliability |
|---|---|---|
| Interpolation | Prediction within the observed data range | More reliable |
| Extrapolation | Prediction beyond the observed data range | Less reliable; model may not hold |
Pitfalls and Common Mistakes
Mistake 1: Confusing correlation with causation. A strong r between two variables does not prove that one causes the other. Fix: Look for a third variable explanation; the SAT answer will specifically say “an association exists” without claiming causation for observational data.
Mistake 2: Misidentifying the sign of a residual. Students confuse which direction is positive. Fix: Residual = Actual − Predicted. If the point is above the line, actual > predicted, so residual > 0.
Mistake 3: Interpreting the y-intercept as meaningful when x = 0 is outside the data range. For a model of adult heights vs. ages, the y-intercept (age = 0) may not make real-world sense. Fix: Note whether x = 0 falls within the data range; if not, the y-intercept is a mathematical artifact, not a meaningful prediction.
Mistake 4: Assuming a high |r| means the model is linear. r measures linear association only. A quadratic model may fit better even if r is moderate. Fix: Always inspect the shape of the scatterplot before choosing a model.
Mistake 5: Extrapolating far beyond the data and trusting the prediction. The line of best fit may not hold outside the data range. Fix: Flag any prediction as extrapolation when it is outside the observed x-values, and treat it with caution.
Related Entries
Quick Reference Card
| Concept | Formula / Rule |
|---|---|
| Residual | Actual − Predicted (positive = above line) |
| Slope interpretation | Change in y per 1-unit increase in x |
| r range | −1 ≤ r ≤ +1; closer to ±1 = stronger |
| Correlation ≠ causation | Association only; causation requires experiment |
| Linear model | y = mx + b (constant rate) |
| Exponential model | y = ab^x (percent-based growth/decay) |
| Interpolation | Within data range — reliable |
| Extrapolation | Beyond data range — unreliable |
| Desmos linear regression | y₁ ~ mx₁ + b |