Regression is when we want to examine relationship between variables
Linear regression uses the least square method.
The concept is to draw a line through all the plotted data points. The line is positioned in a way that it minimizes the distance to all of the data points.
The distance is called "residuals" or "errors".
Below are the main steps involved
Import the modules you need: Pandas, matplotlib and Scipy
Isolate Average_Pulse as x. Isolate Calorie_burnage as y
Get important key values with: slope, intercept, r, p, std_err = stats.linregress(x, y)
Create a function that uses the slope and intercept values to return a new value. This new value represents where on the y-axis the corresponding x value will be placed
Run each value of the x array through the function. This will result in a new array with new values for the y-axis: mymodel = list(map(myfunc, x))
Draw the original scatter plot: plt.scatter(x, y)
Draw the line of linear regression: plt.plot(x, mymodel)
Define maximum and minimum values of the axis
Label the axis: "Average_Pulse" and "Calorie_Burnage"
import pandas as pd
import matplotlib.pyplot as plt
from scipy import stats
full_health_data = pd.read_csv("data.csv", header=0, sep=",")
x = full_health_data["Average_Pulse"]
y = full_health_data ["Calorie_Burnage"]
slope, intercept, r, p, std_err = stats.linregress(x, y)
def myfunc(x):
return slope * x + intercept
mymodel = list(map(myfunc, x))
plt.scatter(x, y)
plt.plot(x, slope * x + intercept)
plt.ylim(ymin=0, ymax=2000)
plt.xlim(xmin=0, xmax=200)
plt.xlabel("Average_Pulse")
plt.ylabel ("Calorie_Burnage")
plt.show()
References:
https://www.w3schools.com/datascience/ds_linear_regression.asp