Stock Prediction using LSTM: The Basics

8 min read3 days ago

Stock price prediction has always been a topic of fascination and challenging task in the data science community. By analyzing historical stock prices, we can attempt to forecast future movements using machine learning models. In this article, I will walk you through the process of building an extremely basic stock price prediction model using Long Short-Term Memory (LSTM) networks. We will be using Python along with its libraries such as Tensorflow, scikit-learn, and matplotlib, to prepare, visualize, and model the stock price data. By the end of this project, we were able to achieve an impressive Root Mean Squared Error (RMSE) of ~0.678, indicating the model’s high accuracy in predicting future stock prices.

A brief introduction to LSTM

LSTM are a type of Recurrent Neutral Network (RNN) designed to capture patterns in sequential data, like our stock price data. Unlike traditional neural networks, LSTMs have a unique structure which enables them to retain information over time periods. This makes them particularly effective for time series forecasting, where understanding past trends and dependencies is crucial.

The key innovation in LSTMs is the use of memory cells that control how information flows through the model architecture. These cells consist of gates (forget, input, and output) that regulate which information to discard, keep, and update, effectively solving the issue of vanishing gradients that are commonly found in standard RNNs.

Together, these considerations become highly valuable in stock price prediction as the network can learn from historical price movements, capturing both short-term and long-term price fluctuation to make more accurate future predictions.

Let us now walk through the procedure of implementing a LSTM model for stock predictions. We use synthetic data for this project to make things easy to understand. The dataset, along with the code notebook in the following sections can be found in my Github repository. The images in the following sections are snapshots from my Google Colab session where I wrote the code.

Procedure

Figure 1 — Importing libraries and reading data

We load the data from the Google Drive using drive.mount and load the libraries we think we might need during the execution of this project. numpy and pandas libraries are used for numerical computations and data handling. matplotlib is used for visualizing stock price trends. MinMaxScaler from sklearn.preprocessing is used to scale the data. TensorFlow’s Sequential, LSTM, and Dense libraries are used to build and train the neural network.
Next, as seen in Figure 1 above, we load our data in a dataframe and view the first few rows to get an understanding of the data and what the columns are included.

Next, we process the dataset to make it suitable for time-series analysis. This includes renaming and indexing wherein the ‘Index’ column renamed to ‘Date’ and converted to date-time format, and set as the index. Moreover, columns like ‘Open’, ‘High’, ‘Low’, and ‘Volume’, are dropped to focus solely on the ‘Close’ prices, which we want to predict. Post this, as standard procedure, we sort the data chronologically to ensure that our analysis respects the natural time order.

Figure 3 — Time-series plot of closing prices

We visualize the closing prices over the given time using matplotlib to get a visualization of the trends. There are no null values in our dataset since it is synthetic and it would outside the scope of this project to analyze filling null values. We shall do that in another project some time soon.

Figure 4 — Splitting and Scaling training data

We split the dataset into training and testing sets and then scaled the data to optimize model performance. Using train_test_split from the sklearn.model_selection module, we divide the ‘Close’ prices into 70% training and 30% testing data. We set shuffle=False to maintain the temporal order of the data, which is essential for time-series forecasting.
We scale using the MinMaxScaler to normalize the values between 0 and 1. This is crucial as it ensures the LSTM model learns more effectively and recognizes patterns. It is important to do this separately for the training data to avoid data leakage (Simpler: The test set should not get an idea of the maximum and minimum values of the training data).

Figure 5 — Creating training sequences and verification

LSTM models need data to be in sequences to recognize patterns over time. Here, we set sequence_length of 60, meaning the model will look at the previous 60 days’ closing prices to predict the next day’s price. This can be altered according to personal strategies and biases.
We use a loop to create these sequences and their corresponding target values in X_train and y_train respectively. Each sequence consists of 60 past prices as input, with the subsequent day’s closing price as the target. After gathering all the sequences, we convert them to arrays and reshape X_train into the format (samples, timesteps, features), resulting in a shape of (710, 60, 1).

We prepare the test data to be fed into the LSTM model for making predictions by first using the MinMaxScaler as we did for the training dataset. This ensures that the test set is normalized using the same parameters as the training set.
Next, we create sequences for the test set similar to the prior step using a loop to generate X_test by taking past 60 days’ closing prices and storing the subsequent day’s closing price as the target in y_test. We then convert these lists into arrays. It should be noted that these windows are formed on a rolling basis.
Finishing off, we reshape the X_test to match the input shape required by the LSTM model as we did in the prior step. The result in the format of (samples, timesteps, features) gives us a shape of (270, 60, 1).

Figure 7 — Building and Training the LSTM model

The model starts with an LSTM layer of 50 units setting the return_sequences=True allowing the layer to pass its sequence output to the next LSTM layer. The input_shape is set to (X_train.shape[1],1), which corresponds to 60 timesteps with one feature, the closing price.
A dropout layer with a rate of 20% is applied after each LSTm layer to randomly drop 20% of the neurons in the layer to avoid data overfitting and generalize better.
ANother LSTM layer with 50 units is added with return_sequences=False indicating that this is the final LSTM layer, outputting a single value for each sequence.
The final Dense Layer is a fully connected layer with a single neuron acting as an output layer to predict the stock price.
The model uses the adam optimizer which is an adaptive learning rate method designed to optimize model training efficiently by adjusting learning rates based on the loss function’s progress.
The mean_squared_error loss function is used because it measures the average squared difference between predicted and actual values.
The model is trained for 50 epochs (actually too many for this use case but this can be varied depending on the change in your loss), meaning the training process iterate over the entire dataset 50 times. More epochs allow for better learning but it may lead to overfitting.
A batch size of 32 means that the model updates its weights after processing 32 samples or rows. This helps in efficient training, allowing the model to learn from multiple examples before updating the weights of the neurons.
During training, the model uses X_train and y_train for learning and validation, X_test and y_test for test. The val_loss represents the model’s performance on unseen data, helping monitor overfitting during training.

Figure 8 — Making predictions and Model Evaluation

The model makes predictions on the test data. First, model.predict(X_test) generates the predicted values, which are still in the scaled format. We then use the predicted values, which are still in the scaled format and inverse scale them to convert these predictions back to their original scale for comparison with the actual stock prices.
Next, we inverse transform the y_test to the original scale to calculate the RMSE using mean_squared_error from sklearn. The RMSE, approximately 0.68, measures the average error between the predicted values and the actual values, indicating that the model performs well in forecasting stock prices. This means that, on average, the model’s predictions deviate from the actual stock prices by about 0.68 units in the original scale of the data.

Figure 9 — Prediction vs. Actual Stock prices

In the final step of this project, we plot the model’s predicted stock prices against the actual stock prices to visualize its performance. The blue line represents the actual closing prices, while the red line shows the predicted prices.
We can see that the model predicted the stock movement well even though it seems a bit lagged and unable to capture the extreme granularities. But overall, this looks like a good model solely based on the closing prices.

While the results look promising with an RMSE of 0.68, it’s not perfect. Let’s be honest — stock prediction is tricky business, and there might be some pitfalls along the way. Perhaps the model missed some crucial market signals, or maybe I missed some sneaky data leakage hiding in the shadows.
Hopefully this is where some readers come through to spot any glaring mistakes, areas where I might have taken a shortcut, or any other inefficiencies. I’m open to any feedback, suggestions, collaborations (I have some more ideas so I am currently working on them as well), or even the occasional roast of the work! After all, the stock market waits for no one, and neither should a good critique xD

Stock Prediction using LSTM: The Basics

A brief introduction to LSTM

Procedure

Written by Nikhil Gauba