Artificial Intelligence has two main branches. They are Symbolic AI and Non-Symbolic AI (Machine Learning). Machine Learning refers to how the system can identify unique/ hidden patterns in data, learn from data, and how to make decisions.
Linear regression is one in every of the foremost widely used predictive modeling techniques. Here we identify one variable as independent and one or more another variable as a dependant. As shown within the following we’ve coefficients and intercept. the number of coefficients depends on the number of observations
In this article, linear regression is used to predict the apparent temperature.
Data Set: https://www.kaggle.com/budincsevity/szeged-weather
Environment: Google Colab (https://colab.research.google.com/)
Let’s see how can we do this.
First of all, all the required libraries should be imported.
Here I used Google Colab to do this process. And I loaded the data set from Google Drive. Using read_csv() we can read the data set
w_df.head(10) shows the first 10 rows of the data set. It shows weather data set is loaded successfully.
Next, we should handle the missing values.
First of all, we should remove the duplicate values. There shouldn’t be more than one record for the same time on the same date.
Next, we can see a summarized description of the data using w_df.describe() command.
Here we can see all the values of ‘Loud Cover’ are 0. Therefore ‘Loud Cover’ feature can be removed.
Here Formatted Date doesn’t affect the Apparent Temperature. Therefore It can also be removed.
We can check whether there are any null values in each row by using w_df.isnull().any(). It shows that NULL values contain in the Precip Type. And we can check the NULL value percentage by using 100*w_df.isnull().sum()/len(w_df).
Let’s check again that there are any null values or not. There are only 0.5% of null values. Doing any kind of changes to null values. It is better to drop them.
Let’s check again that there are any null values or not after dropping null values.
The next step is to Handling Outliers.
First of all, boxplots should be created for each numerical feature.
Here we can see Humidity has 0 values. By knowledge, we know humidity can not be 0. We called it a contextual outlier. Therefore 0 values can be removed.
Let’s go ahead with other boxplots.
In here also we can see an outlier. It can be identified as a global outlier. That values can be removed.
We have to check 3 more box plots.