Stock Price Prediction by using Machine Learning in Python

Chanon Sumpantapong
5 min readNov 26, 2022

--

Stock Price Prediction for Elon musk theme by using Machine Learning in Python.The dataset we will use here to perform the analysis and build a predictive model on Twitter, Tesla, Paypal Stock Price data. We will use OHLC(‘Open’, ‘High’, ‘Low’, ‘Close’) data from 1st January 2014 to 27th Oct 2022 which is for 8 years for these stocks.

example of dataset which using for analysis model
shows the shape of an array is the number of elements in each dimension

Dataset explanation: For all Dataset there is the same 7 Column but the row of each stock is different but some Date information data are not the same. For both TWTR and TSLA there are 2222 rows but for PYPL there are 1843 rows for using a predictive model.So we plot the time series chart for Exploratory Data Analysis (EDA).

Twitter (TWTR)

The prices of Twitter stocks are showing an instability as depicted by the plot of the closing price of the stocks.

Tesla(TSLA)

The prices of tesla stocks are showing an upward trend as depicted by the plot of the closing price of the stocks.

Paypal(PYPL)

The prices of Paypal stocks are showing an upside trend (2014–2022) as depicted by the plot of the closing price of the stocks.

TWTR + TSLA +PYPL

Refer this graph: TSLA is outstanding performance from each other on 2021

If we observe carefully we can see that the data in the ‘Close’ column and that available in the ‘Adj Close’ column is the same. Let’s check whether this is the case with each row or not.From here we can conclude that all the rows of columns ‘Close’ and ‘Adj Close’ have the same data. So, having redundant data in the dataset is not going to help so, we’ll drop this column before further analysis by the way Close and Adj close the definition are different but in this case , we decided to drop it out.

The TWRT distribution of OHLC data,

In the TWTR distribution plot of OHLC data, we can see two peaks which means the data has varied significantly in two regions. And the Volume data is left-skewed

The TSLA distribution of OHLC data

In the TSLA distribution plot of OHLC data, we can see a peak which means the data has varied significantly in a region. And the Volume data is left-skewed

The PYPL distribution of OHLC data

In the PYPL distribution plot of OHLC data, we can see a peak which means the data has varied significantly in a region. And the Volume data is left-skewed

Outliers checking

TWRT

From the above TWRT boxplots, we can conclude that all data contains outliers , especially in volume boxplot.

TSLA

From the above TSLA boxplots, we can conclude that all data contains outliers , especially in volume boxplot.

PYPL

From the above PYPL boxplots, we can conclude that all data contains outliers , especially in volume boxplot.

TWRT

From the above bar graph, we can conclude that the stock prices significantly increase from the year 2017 to that in 2021

Here are some of the important observations of the above-grouped data: Prices are lower in the months which are quarter end as compared to that of the non-quarter end months. The volume of trades is lower in the months which are quarter end

Above we have added some more columns which will help in the training of our model. We have added the target feature which is a signal whether to buy or not we will train our model to predict this only. But before proceeding let’s check whether the target is balanced or not using a pie chart. If price close(t-1) more than price close(t) that will present 1, if not the will show 0

From pie chart: 0 is 50.9% and 1 is 49.1% that mean the percentage of price close(t) have a change to lower than price close(t-1) equal 49.1%

Correlation checking

From the above heatmap, we can say that there is a high correlation between OHLC that is pretty obvious, and the added features are not highly correlated with each other or previously provided features which means that we are good to go and build our model.

Data Splitting and Normalization

After selecting the features to train the model on we should normalize the data because normalized data leads to stable and fast training of the model. After that whole data has been split into two parts with a 90/10 ratio so that we can evaluate the performance of our model on unseen data. Ratio is (1999, 3) : (223, 3)

Model Development and Evaluation

Now is the time to train some state-of-the-art machine learning models(Logistic Regression, Support Vector Machine, XGBClassifier), and then based on their performance on the training and validation data we will choose which ML model is serving the purpose at hand better. For the evaluation metric, we will use the ROC-AUC curve but why this is because instead of predicting the hard probability that is 0 or 1 we would like it to predict soft probabilities that are continuous values between 0 to 1. And with soft probabilities, the ROC-AUC curve is generally used to measure the accuracy of the predictions.

Note: we run the same method with these stock and the result as below

TSLA

From the above bar graph, we can conclude that the stock prices significantly increase from the year 2020 to that in 2022.
Here are some of the important observations of the above-grouped data: Prices are higher in the months which are quarter end as compared to that of the non-quarter end months. The volume of trades is lower in the months which are quarter end.

Making pie charts. If price close(t-1) more than price close(t) that will present 1, if not the will show 0

PAYPAL

From the above bar graph, we can conclude that the stock prices significantly increase from the year 2020 to that in 2022.
Here are some of the important observations of the above-grouped data: Prices are higher in the months which are quarter end as compared to that of the non-quarter end months. The volume of trades is lower in the months which are quarter end
From pie chart: 0 is 52.3% and 1 is 47.7% that mean the percentage of price close(t) have a change to lower than price close(t-1) equal 47.7%

From all the pie chart,It shows portions 0 about 50 percent and shows portions1 about 50 percent the same. It indicates that all stocks can not predict the price of each stock because there may be less information to predict the stock price.Refer 3 models, the XGBClassifier model we can conclude that best accuracy rate belongs to Twitter and Tesla stock price since testing data accuracy rate are higher than PAYPAL, to easily explain, testing data accuracy determine how well your data learn from training dataset to adapt with unforeseen data evaluation.

--

--

Chanon Sumpantapong
Chanon Sumpantapong

Written by Chanon Sumpantapong

Business strategist | Design Engineer | Data analysis Engineer | interested in finance 💵 & Data journalism 📊

No responses yet