Stock Prediction using LSTM: The Basics
Stock price prediction has always been a topic of fascination and challenging task in the data science community. By analyzing historical stock prices, we can attempt to forecast future movements using machine learning models. In this article, I will walk you through the process of building an extremely basic stock price prediction model using Long Short-Term Memory (LSTM) networks. We will be using Python along with its libraries such as Tensorflow, scikit-learn, and matplotlib, to prepare, visualize, and model the stock price data. By the end of this project, we were able to achieve an impressive Root Mean Squared Error (RMSE) of ~0.678, indicating the model’s high accuracy in predicting future stock prices.
A brief introduction to LSTM
LSTM are a type of Recurrent Neutral Network (RNN) designed to capture patterns in sequential data, like our stock price data. Unlike traditional neural networks, LSTMs have a unique structure which enables them to retain information over time periods. This makes them particularly effective for time series forecasting, where understanding past trends and dependencies is crucial.
The key innovation in LSTMs is the use of memory cells that control how information flows through the model architecture. These cells consist of gates (forget, input, and output) that regulate which information to discard, keep, and update, effectively solving the issue of vanishing gradients that are commonly found in standard RNNs.
Together, these considerations become highly valuable in stock price prediction as the network can learn from historical price movements, capturing both short-term and long-term price fluctuation to make more accurate future predictions.
Let us now walk through the procedure of implementing a LSTM model for stock predictions. We use synthetic data for this project to make things easy to understand. The dataset, along with the code notebook in the following sections can be found in my Github repository. The images in the following sections are snapshots from my Google Colab session where I wrote the code.
Procedure
- We load the data from the Google Drive using
drive.mount
and load the libraries we think we might need during the execution of this project.numpy
andpandas
libraries are used for numerical computations and data handling.matplotlib
is used for visualizing stock price trends.MinMaxScaler
fromsklearn.preprocessing
is used to scale the data. TensorFlow’sSequential
,LSTM
, andDense
libraries are used to build and train the neural network.
Next, as seen in Figure 1 above, we load our data in a dataframe and view the first few rows to get an understanding of the data and what the columns are included.
- Next, we process the dataset to make it suitable for time-series analysis. This includes renaming and indexing wherein the ‘Index’ column renamed to ‘Date’ and converted to date-time format, and set as the index. Moreover, columns like ‘Open’, ‘High’, ‘Low’, and ‘Volume’, are dropped to focus solely on the ‘Close’ prices, which we want to predict. Post this, as standard procedure, we sort the data chronologically to ensure that our analysis respects the natural time order.
- We visualize the closing prices over the given time using
matplotlib
to get a visualization of the trends. There are no null values in our dataset since it is synthetic and it would outside the scope of this project to analyze filling null values. We shall do that in another project some time soon.
- We split the dataset into training and testing sets and then scaled the data to optimize model performance. Using
train_test_split
from thesklearn.model_selection
module, we divide the ‘Close’ prices into 70% training and 30% testing data. We setshuffle=False
to maintain the temporal order of the data, which is essential for time-series forecasting.
We scale using theMinMaxScaler
to normalize the values between 0 and 1. This is crucial as it ensures the LSTM model learns more effectively and recognizes patterns. It is important to do this separately for the training data to avoid data leakage (Simpler: The test set should not get an idea of the maximum and minimum values of the training data).
- LSTM models need data to be in sequences to recognize patterns over time. Here, we set
sequence_length
of 60, meaning the model will look at the previous 60 days’ closing prices to predict the next day’s price. This can be altered according to personal strategies and biases.
We use a loop to create these sequences and their corresponding target values inX_train
andy_train
respectively. Each sequence consists of 60 past prices as input, with the subsequent day’s closing price as the target. After gathering all the sequences, we convert them to arrays and reshapeX_train
into the format(samples, timesteps, features)
, resulting in a shape of (710, 60, 1).
- We prepare the test data to be fed into the LSTM model for making predictions by first using the
MinMaxScaler
as we did for the training dataset. This ensures that the test set is normalized using the same parameters as the training set.
Next, we create sequences for the test set similar to the prior step using a loop to generateX_test
by taking past 60 days’ closing prices and storing the subsequent day’s closing price as the target iny_test
. We then convert these lists into arrays. It should be noted that these windows are formed on a rolling basis.
Finishing off, we reshape theX_test
to match the input shape required by the LSTM model as we did in the prior step. The result in the format of(samples, timesteps, features)
gives us a shape of (270, 60, 1).
- The model starts with an LSTM layer of 50 units setting the
return_sequences=True
allowing the layer to pass its sequence output to the next LSTM layer. Theinput_shape
is set to(X_train.shape[1],1)
, which corresponds to 60 timesteps with one feature, the closing price.
A dropout layer with a rate of 20% is applied after each LSTm layer to randomly drop 20% of the neurons in the layer to avoid data overfitting and generalize better.
ANother LSTM layer with 50 units is added withreturn_sequences=False
indicating that this is the final LSTM layer, outputting a single value for each sequence.
The final Dense Layer is a fully connected layer with a single neuron acting as an output layer to predict the stock price. - The model uses the
adam
optimizer which is an adaptive learning rate method designed to optimize model training efficiently by adjusting learning rates based on the loss function’s progress.
Themean_squared_error
loss function is used because it measures the average squared difference between predicted and actual values. - The model is trained for 50 epochs (actually too many for this use case but this can be varied depending on the change in your loss), meaning the training process iterate over the entire dataset 50 times. More epochs allow for better learning but it may lead to overfitting.
A batch size of 32 means that the model updates its weights after processing 32 samples or rows. This helps in efficient training, allowing the model to learn from multiple examples before updating the weights of the neurons. - During training, the model uses
X_train
andy_train
for learning and validation,X_test
andy_test
for test. Theval_loss
represents the model’s performance on unseen data, helping monitor overfitting during training.
- The model makes predictions on the test data. First,
model.predict(X_test)
generates the predicted values, which are still in the scaled format. We then use the predicted values, which are still in the scaled format and inverse scale them to convert these predictions back to their original scale for comparison with the actual stock prices.
Next, we inverse transform they_test
to the original scale to calculate the RMSE usingmean_squared_error
fromsklearn
. The RMSE, approximately 0.68, measures the average error between the predicted values and the actual values, indicating that the model performs well in forecasting stock prices. This means that, on average, the model’s predictions deviate from the actual stock prices by about 0.68 units in the original scale of the data.
- In the final step of this project, we plot the model’s predicted stock prices against the actual stock prices to visualize its performance. The blue line represents the actual closing prices, while the red line shows the predicted prices.
We can see that the model predicted the stock movement well even though it seems a bit lagged and unable to capture the extreme granularities. But overall, this looks like a good model solely based on the closing prices.
While the results look promising with an RMSE of 0.68, it’s not perfect. Let’s be honest — stock prediction is tricky business, and there might be some pitfalls along the way. Perhaps the model missed some crucial market signals, or maybe I missed some sneaky data leakage hiding in the shadows.
Hopefully this is where some readers come through to spot any glaring mistakes, areas where I might have taken a shortcut, or any other inefficiencies. I’m open to any feedback, suggestions, collaborations (I have some more ideas so I am currently working on them as well), or even the occasional roast of the work! After all, the stock market waits for no one, and neither should a good critique xD