Blog Viewer

Full Machine Learning Experiment Using Historical IPO Data - Source Code Included

By Shawn Hainsworth posted 08-07-2019 23:19

I have created an in-depth legal machine learning experiment using historical IPO, Initial Public Offering, data. The experiment is written in Python. Full source code for this project is available on GitHub.

The goal of this experiment is to teach by example. I cover all of the major steps in a data science experiment:

  • Data Import and Cleansing
  • Data Exploration and Visualization
  • Feature Engineering
  • Training and Evaluating a Model

Business Objective

The business objective is to predict whether or not an IPO will be under-priced, using only the information that we have available before the offering. An offering is considered under-priced if the offering price is higher than the price at the end of the first day of trading. Given the small size of the data set and the difficulty of the business problem, our primary objective is to understand the process and techniques of a data science experiment rather than to generate a significant model.

This experiment is based on the work of Jay R. Ritter, the Joseph B. Cordell Eminent Scholar in the Department of Finance at the University of Florida,    who has kindly made his data files available on-line. The data is from 1975-1984.

Part 1 -  Data Preparation

The first step in our legal machine learning experiment is to import, clean, transform and visualize data using Pandas, NumPy and Matplotlib. This will include the following steps:

  • Import data from Excel Files
  • Clean and transform data
  • Merge data frames on a key column
  • Create Calculated Columns / Feature Engineering
  • Visualize the data and remove outliers
  • Write cleaned and merged files to XDF files

The following blog post details the data preparation steps for this experiment.

Part 2 - Feature Engineering

Feature selection is used to identify which features are most relevant for building a model. Removing irrelevant features from our model has the following benefits:

  • Improves the accuracy of our model
  • Reduces over-fitting
  • Reduces training time
The following blog post details the feature engineering steps for this experiment.

Part 3 - Training and Evaluating a Model

In part 3, I train and evaluate a number of different binary classification models using both scikit-learn and the Microsoft revoscalepy and microsoftml packages. We will use hyperparameter optimization with cross-validation for the random forest classification algorithm, and also implement ensemble learning.

I use the receiver operating characteristic and area under the curve (AUC) for model evaluation. 

The following blog post details the model training and evaluation steps for this experiment.

Technical Notes

This experiment is written in Python. The source code includes a Jupyter notebook as well as a separate Python script. The Jupyter notebook uses standard Python (scikit-learn) packages. The Python scripts use features included with the Microsoft Machine Learning Server version 9.3 and the revoscalepy package.

For more information on the Microsoft Machine Learning Server, see my Pluralsight course, Scalable Machine Learning using the Microsoft Machine Learning Server.