If we clearly know that two paired variables have linear relationship;
Decision can be made by p-Value method or traditional method based on critical value Normal distribution table will be needed
General procedure for testing is like In the cases of 𝜎1 and 𝜎2 unknown; 𝜎1 ≠𝜎2 ,𝜎1 and 𝜎2 unknown;𝜎1 = 𝜎2
Linear regression is a tool to quantitatively describe the linear relationship between two variables
This is done by finding the graph and equation of the straight line that best represents the relationship
The best-fitting straight line is called the regression line, and its equation is called the regression equation.
Equation of Linear Regression
For two paired variables which have perfect linear correlation, we know from Algebra there is straight line (all paired values are perfectly located on the line) with the equation shown below:
y=mx+b
This is done by finding the graph and equation of the straight line that best represents the relationship
The best-fitting straight line is called the regression line, and its equation is called the regression equation.
Basic Concepts of Regression
For two paired variables which have perfect linear correlation, we know from Algebra there is straight line (all paired values are perfectly located on the line) with the equation shown below
y=mx+b
But in the most common cases, two variables do not have perfect linear correlation. When we find the best straight line, we cannot let all paired values exactly located on the line.
Given a collection of paired sample data
The regression equation (best fitting) is expressed as:
Residuals
For a pair of sample x and y values, the residual is the difference between the observed sample value of y and the value of “𝑦 ̂ ”that is predicted by using the regression equation
Residual= y-𝑦 ̂
Example:
x=1,2,4,5
y=4,24,8,32 find the regression equation
How to find the regression line and equation
The least square methodology
The regression line will have the equation that will let the sum of the squares of the residuals is the smallest sum possible, For a given sample with paired variables of X and Y, the slope and the intercept of the regression equation can be found by the following equations:
The slope and the intercept of the regression equation can be also computed by the following equations:
where r is the linear correlation coefficient, sy is the standard deviation of observed Y values, and sx is the standard deviation of observed X values.
where Ybar id the mean of observed Y values, and xbar is the mean values of observed X values
Example
In New York in the 1960s, it was noted that the cost of a slice of pizza was the same as the cost of a subway ride. Over the years it was also noted that the two costs seemed to increase by about the same amounts
Year | Pizza Slice | Subway Fare |
---|---|---|
1960 | $0.15 | $0.15 |
1973 | $0.35 | $0.35 |
1986 | $1.00 | $1.00 |
1995 | $1.25 | $1.35 |
2002 | $1.75 | $1.50 |
2003 | $2.00 | $2.00 |
Cost of Pizza Slice (x) | Subway Fare (y) | x² | xy |
---|---|---|---|
$0.15 | $0.15 | $0.0225 | $0.0225 |
$0.35 | $0.35 | $0.1225 | $0.1225 |
$1.00 | $1.00 | $1.0000 | $1.0000 |
$1.25 | $1.35 | $1.5625 | $1.6875 |
$1.75 | $1.50 | $3.0625 | $2.6250 |
$2.00 | $2.00 | $4.0000 | $4.0000 |
Σ x = $6.50 | Σ y = $6.35 | Σ x² = 9.77 | Σ xy = 9.4575 |
n=6n
∑x=6.50
∑y=6.35
∑x2=9.77
∑xy=9.4575
Calculations:
Slope m≈0.9455m
Intercept b≈0.0340b
y=0.9455x+0.034
Requirements for the sample
The sample of paired (x, y) data is a random sample of independent quantitative data.
Visual examination of the scatter-plot must confirm that the points approximate a straight-line pattern.
Outliers known to be errors should be removed, and outliers that are not errors should be considered.
Using Regression for Predictions
Considerations:
Use the regression equation for predictions only if the linear correlation coefficient r indicates that there is a linear correlation between the two variables.
Use the regression line for predictions only if the data do not go much beyond the scope of the available sample data
Only make predictions if the regression line fits the points on the graph reasonably well.
If the regression equation does not appear to be useful for making predictions, the best predicted value of a variable is its point estimate, which is its sample mean.
University of Exeter LibGuide is licensed under CC BY 4.0