Tuesday, April 19, 2022

Dataframe tips

 You can get the type of the entries of your column with map:

df['ABC'].map(type)

So to filter on all values, which are not stored as str, you can use:

df['ABC'].map(type) != str


If however you just want to check if some of the rows contain a string, that has a special format (like a date), you can check this with a regex like:

df['ABC'].str.match('[0-9]{4}-[0-9]{2}-[0-9]{2}')


But of course, that is no exact date check. E.g. it would also return True for values like 0000-13-91, but this was only meant to give you an idea anyways.



Monday, April 11, 2022

Linear regression how to use scipy linear regress

Regression is when we want to examine relationship between variables

Linear regression uses the least square method.

The concept is to draw a line through all the plotted data points. The line is positioned in a way that it minimizes the distance to all of the data points.

The distance is called "residuals" or "errors".

Below are the main steps involved 

Import the modules you need: Pandas, matplotlib and Scipy

Isolate Average_Pulse as x. Isolate Calorie_burnage as y

Get important key values with: slope, intercept, r, p, std_err = stats.linregress(x, y)

Create a function that uses the slope and intercept values to return a new value. This new value represents where on the y-axis the corresponding x value will be placed

Run each value of the x array through the function. This will result in a new array with new values for the y-axis: mymodel = list(map(myfunc, x))

Draw the original scatter plot: plt.scatter(x, y)

Draw the line of linear regression: plt.plot(x, mymodel)

Define maximum and minimum values of the axis

Label the axis: "Average_Pulse" and "Calorie_Burnage"

import pandas as pd

import matplotlib.pyplot as plt

from scipy import stats

full_health_data = pd.read_csv("data.csv", header=0, sep=",")

x = full_health_data["Average_Pulse"]

y = full_health_data ["Calorie_Burnage"]

slope, intercept, r, p, std_err = stats.linregress(x, y)


def myfunc(x):

 return slope * x + intercept


mymodel = list(map(myfunc, x))


plt.scatter(x, y)

plt.plot(x, slope * x + intercept)

plt.ylim(ymin=0, ymax=2000)

plt.xlim(xmin=0, xmax=200)

plt.xlabel("Average_Pulse")

plt.ylabel ("Calorie_Burnage")

plt.show()




References:

https://www.w3schools.com/datascience/ds_linear_regression.asp

 

What is matplotlib.pyplot figure.add_subplot(111)

These are subplot grid parameters encoded as a single integer. For example, "111" means "1x1 grid, first subplot" and "234" means "2x3 grid, 4th subplot".

Alternative form for add_subplot(111) is add_subplot(1, 1, 1).


references:

https://stackoverflow.com/questions/3584805/in-matplotlib-what-does-the-argument-mean-in-fig-add-subplot111


Tuesday, April 5, 2022

What is Beta0 and Beta1 in linear regression

β0 and β1 are unknown, called regression coefficients. β0 is also called intercept (value. of EY when X = 0); β1 is called slope indicating the change of Y on average when. X increases one unit.

β0 is the value of y when x = 0, and β1 is the change in y when x increases by 1 unit. In many real–world situations, the response of interest (in this example it's profit) cannot be explained perfectly by a deterministic mode


What is the 95 confidence interval for the regression parameter β1?

A 95% confidence interval for βi has two equivalent definitions: The interval is the set of values for which a hypothesis test to the level of 5% cannot be rejected. The interval has a probability of 95% to contain the true value of βi .



For simple linear regression, the chief null hypothesis is H0 : β1 = 0, and the corresponding alternative hypothesis is H1 : β1 = 0. If this null hypothesis is true, then, from E(Y ) = β0 + β1x we can see that the population mean of Y is β0 for every x value, which tells us that x has no effect on Y .



A p-value measures the probability of obtaining the observed results, assuming that the null hypothesis is true. The lower the p-value, the greater the statistical significance of the observed difference. A p-value of 0.05 or lower is generally considered statistically significant.


What is variance in linear regression?

In terms of linear regression, variance is a measure of how far observed values differ from the average of predicted values, i.e., their difference from the predicted value mean.