Machine learning is a field of computer science that gives computer systems the ability to learn with data, without being explicitly programmed.

There are many algorithms available in python to use with machine learning. Linear regression is one of them.

We all know that linear regression is a popular technique and you might as well seen the mathematical equation of linear regression which is y=mx+b. where m is the slope of line and b is y-intercept.

But here we are going to use python implementation of linear regression.

In order to understand linear regression, you should have knowledge of statistics.

Data is the most important thing in the machine learning. If we do not have enough data we cannot predict a result or make a decision even if we implemented our algorithm correctly.

Once we have the data set, we need to identify our features and label.

Here, feature means input data and label represents validated output and our target output when we predict the result.

We are going to use diabetes dataset available in sklearn library.

import matplotlib.pyplot as plt
import numpy as np
from sklearn import datasets, linear_model
from sklearn.metrics import mean_squared_error, r2_score
diabetes_data = datasets.load_diabetes()

You can get dataset information using following code.

diabetes_data.keys,diabetes_data.data.shape

You can get all the features names available in our dataset:

diabetes_data.feature_names

Here, we are creating a data frame of our diabetes data using pandas

di = pd.DataFrame(diabetes_data.data)

We are mapping data set features names to columns of our data frame.

di.columns = diabetes_data.feature_names

We are creating new column target of diabetes dataset.

di['target'] = diabetes_data.target

Here we stored our data into x which is containing our independent data.

x=di.drop('target',axis=1)

Now, we created the object of our linear model.

rm = linear_model.LinearRegression()

You will get following output:

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

We can fit our linear regression model using our dataset that we have already generated.

rm.fit(x,di.target)

Following will return intercept and coefficient of our linear model. The coefficient is used to find out the relationship between a feature and our target output.

Coefficient will be as positive as it gets higher than 0 and negative if it is less than 0.

Positive coefficient represents a strong relationship between feature and label.

print(rm.intercept_)
print(rm.coef_)

Once we fit values in our model we can predict values by using predict() method available in sklearn. Here, we are predicting values for the first 10 records.

rm.predict(x)[:10]

We can use as many as a feature we think are enough for getting our desired result.

We are using matplotlib to plot out points around the linear line and get the visual idea of our model.

matplotlib is a Python 2D plotting library which produces publication quality figures in a variety of hard copy formats and interactive environments across platforms. matplotlib can be used in Python scripts, the Python and IPython shell, the jupyter notebook, web application servers, and four graphical user interface toolkits.

Training and Testing Data-set

Here, first, we need to select our training and testing data set. Training data set will always be bigger than the testing dataset.

We need to get the difference between actual output and our model generated output with the training dataset and to reduce that difference.

Once we finished training our model, we need to test it with test data and check the score of our model.

You should select training and testing dataset randomly because when we hand-picked dataset we may have a different type of data on the training set an entirely different on a testing dataset. In this situation even if you trained your model properly you can never get the desired outcome from your model. For this example, we are selecting a dataset manually.

Now we need to check validation of our regression model. Here, we will use MSE (mean squared error).

The mean squared error tells you how close a regression line is to a set of points. It does this by taking the distances from the points to the regression line (these distances are the “errors”) and squaring them. The squaring is necessary to remove any negative signs. It also gives more weight to larger differences. It’s called the mean squared error as you’re finding the average of a set of errors.

You can also use numpy to generate MSE:

np.mean(di.target – rm(x)**2)

You can use the following script to start with linear regression in Python.

Thanks for reading!

Click here for more details…


At BoTree Technologies, we build enterprise applications with our Python team of 15+ engineers.

We also specialize in RPA, AI, Django, JavaScript and ReactJS.

Consulting is free – let us help you grow!