Capstone Project

Overview

Problem Addressed

One of the largest needs I encounter daily in portfolio management is the need to evaluate U.S. stocks and determine their fair value as a price target and if they should be over, neutral, or underweighted. This is essential information in not only stock selection, but portfolio management. It drives if a stock should be purchased or sold, and contributes to the desired exposure to a stock in an overall portfolio that is benchmarked to an index where it drives weighting of a security.

The assumption underlying all of this is stock prices are efficent in the long term and suffer short term price displacements in the short term. This means overtime, we expect the price of a stock to fully relect the pricing of all available information, but in the short term, frictions, inefficencies, and uncertainly provide opportunity to seek outperformance.

There are many ways to arrive at a price target for a stock including fundamental analysis with cashflow or dividend discount models, technical analaysis, and quantitative modeling. It would be great to be able to conduct fundamental analysis and create a cashflow model for every investible stock and use those to drive our valuation, but this is not possible due to limited time and resources to undertake such a significant task, especially at a small firm. What is needed is a method to quantitatively evaluate stocks by predicting price and assigning a rating using available data to indicate which are best to allocate limited research resources toward a fundamental research review. Further, there is a need for timely data, as out perfomance in investing is often found through seeking mispricing in markets. Therefore, it is even more essential to have a tool to quickly evaluate stocks and do so in bulk.

This leads to the need for a quantitative model for stock price prediction. There are commercially available options available as well as research available from firms like investment banks. The difficulty that is encountered, are the commercially available models are both costly, which is a burden for firms with limited resources, as well as having many users, so any informational edge they give users is likely to be quickly incorporated into market pricing. A proprietary model for stock price prediction I see to address the need for the stock price prediction and rating while avoiding the issues of cost and broad usage.

Approach

My approach for my capstone project was to use FactSet and Nasdaq stock datasets for the trailing 10 year period to train and deploy the best of a set of machine learning models for stock price prediction and generate from the results a rating for the stock based on the predicted price. I first downloaded the FactSet datasets and merged the data together using postgresql as well as initally filtering out some signifiant portions of the data irrelevant to this analysis. I next exported the data from the output table as a .csv file.

I conducted the next portion of the project, in a Jupyter notebook usign python where I cleaned, analyzed, and trained and evaluted machine learning models. I imported the .csv file output from postgres and verified the data was successfully loading into the puthon environment. I next conducted preliminary data clearning. This cleaning removed rows not invididual stocks, not listed U.S. stocks, and missing critial datapoints like price. I also addressed formating issues in the data data to make it uniform. I next identified that dataset did not include specific descriptive stock data like sector, industry, or market capitalization so I sourced another dataset from the Nasdaq stock screening tool and converted dollar market cap to ordinal market cap categories. I then merged select columns with my existing data. The next step for me was to convert prices to log price returns. This addressed issues with price distribution as well as being potentially better for machine learning as price by address the distribution. Finally, I added columns for the prior 12 months of lagged prices log returns as past price changes are likely to be correlated with current price change.

Next, I addressed missing values in the dataset, dropping rows missing critial datapoints like price or lagged price (and thus removing the earliest year of data) as well as removing all columns with substantially missing data which I defined as greater than 25% missing values. I next removed duplicate columns from the dataset which were the result of merging multiple data sources. I then reviewed the remainging columns and removed those without a fundamental basis for a relationship to price or those substantially similar to other columns.

The next step was a furthere analysis of the data using dscriptive statistics and graphical exploration. I first examined column descriptive statistics. Next I looked into column correlations for numeric data and found a strong presence of correlation in the data. This intrinsically makes sense as many of the variables in a company's balance sheet directly impact over values like enterprise value being driven by assets and debt of the company. I then explore the data for outliers using the interquartile range (IQR) method defining outliers as those 1.5(IQR) outside the mean 50% of the data and extreme outliers as those 3(IQR) outside the mean 50% of the data. I find there to be a strong presence of outliers in the data. I beleive these outliers are for the most part driven by the width breath companies with different focuses and needs and for the most part, represents intrinsic variation in the data and should not be removed. I find there to be a lesser degree of extreme outliers and beleive these are potentially due erroreous data or novel situations like earnings manipulation, that do not represent the underlying fundamentals of the data. I choose to cip these values to the highest extreme outlier value to minimize their impact, but stll be included. I then split the data into training and testing and stratified the data on market cap and sector as I saw significant distribtion disparties across categories. I next extensively plotted the data to look for underlying relationships.

I then processed by data using a pipeline to combine processing for numeric, categorical, and ordinal variables taking care of encoding, imputing, and scaling each as needed. For numerical variables, I imputed remaining missing values with the median value and scaled the variables. For categorical variables, I imputed missing values with the most frequent value and one hot endcoded the variables while dropping the first variable to avoid colinearily. For ordinal variables, I imputed missing vlaues with the most frequent and ordinally encoded the data. As I noted that I saw strong correlation in the data, I then created a parallel data set that I processed using principal component analysis seeking to retain 95% of the variation in the data. I believe this might address the correlations in the data.

I now created and trained a set of machine learning models on the processed data. I made a neural network, a neural network on the PCA transformed data, a random forest, a random forest on the PCA transformed data, a linear regress, a lienar regression on the PCA transformed data, and a lasso regression. for the neural networks, random forests, and lasso model, I conducted model optimization using k-fold cross validation in a grid search to tune model hyper parameters. Once trained, I used the models to make predictions using the testing data and calculated model statistics to compare the models. These I use to determend which model I beleive to be best. Lastly, I exported the models as a pickle model for preservation and deployment of the best model.

Results

Model Metrics

linear regression:: mae:0.0898, mse:0.0155, r2:0.1096, rmse:0.1245

linear regression pca:: mae:0.0898 mse:0.0155, r2:0.1107, rmse:0.1244

lasso regression:: mae:0.0919, mse:0.0161, r2:0.0763, rmse:0.1268

>random forest:: mae:0.0858, mse:0.0145, r2:0.1653, rmse:0.1206

random forest pca:: mae:0.0892, mse:0.0153, r2:0.1217, rmse:0.1237

neural network:: mae:0.0868, mse:0.0146, r2:0.1605, rmse:0.1209

neural network pca:: mae:0.0866, mse:0.0146, r2:0.1625, rmse:0.1207

In evalutating the models I considered mean average error, mean squared error, root mean square error, and R2 to evalutate model performance.To put hte metrics in context, the mean of price change was -0.0026 and the standard deviation was 0.1319. Mean Average Error (MAE) is the average of the average error of the model predictions and are in units of the predicted value, here percent price change. Lower values indicate relatively better model performance. Of the models in my analysis, the random forest had the lowest and best MAE of 0.0858 representing a mean error of 8.58 percentage points for price change. This was similar to the result for Mean Squared Error(MSE) and Root Mean Squared Error (RMSE) which are average error metric that more signficantly penalize larger errors compared to small errors. The RMSE is the root of MSE with the difference being RMSE here is in percent and MSE is in percent squared units. Smaller values indicate a relatively better better for each. The random forest model had the lowest MSE and RMSE of 0.0145 and 0.1206 respectively which corresponds to about a 12 percentage point error on average. The final metric, r2, indicates the amount of the variation explained by the model and larger values indicate a better model. Again, the random forest model had the largest and best value at 0.1653 indicating 16.3% of the variation in price change is explained by the model. Overall, these values indicate even the best model is a relatively poor predictor of price change. This makes the application of the model as a component of stock selection and portfolio management to have strong concern over the effectiveness and value. Since the model does explain some of the variation in price change, about 16%, and, given the need for only being slightly better than average to lead to out performance, there may be value in using the model as a component of the mosaic of information used by analysts in stock evaluation if the model predicition in practice allow for analysts to have even a slight analytical advantage. I beleive although the model itself is poor at predicting price change, it may provide enough additional information to an analyst's mosaic of information to move the needle and have a postiive impact on an investment process in use. In order to preserve the commerical application of the model, I have deployed the second best model, the neural network model below.

Model

Download Upload Template

Please input the completed template below with stock data to generate predictions.

Predictions will take a moment to be computed.

A file with your predictions will populate below the input form once calculated for download.

*This is an academic project done for eductional purposes and is not investment advice or a recommendation to buy or sell any security.*