Data Mining Techniques from Perl

This page is a Demonstration of ETL techniques as applied to stock market data. The data while reasonably current should be compared against it's original source. For definitions of the data columns' parameters, refer to Yahoo Finance of which this data is but a subset.

  1. Extraction - LWP Perl Modules using Techniques described by Clinton Wong.
  2. Tranformation - Cpan Module Descriptive Statistics provides numerical organization to the dataset.
  3. Load - CSV data is loaded into Excel pages using John McNamara's Perl Module.

Submit sector from the Demonstration pulldown creates a bullet for each industry in that sector. Each bullet item is now a link into that industry's Excel formatted data. The Stars column is merely a count of how many columns were highlighted (top 30% of all columns for that row) for that company. Make sure you don't miss the legend worksheet (there are 2 worksheets) for a numerical explanation of the highlighting.

Once the data is available in the form of comma delimited files many data preparation methods are available. One approach is to create a Microsoft Access database using these csv files as externally linked tables. Then write SQL queries to find the best companies. Finally ODBC can automatically run these queries to create static html using modules like HTML::Table  or   HTML::Dashboard to present colorized cells similar to the Excel demonstration above.

Even without a formal database, perl module Data::Table provides an easy to use SQL interface that works well in conjunction with DBI::DBD for creating dynamic HTML on this apache server.

A more complete Industry Browser demonstrates the use of these various preparation methods. Each method called from this Browser for examining an industry's data uses a slightly different means of organizing the data. A rating (number of stars) is best if the number is larger. A ranking is best when the number is smaller. However all the data is commonly derived from comma delimited files using perl that incorporates Modules like Data::Table or DBI. For colors Green is typically best followed by Yellow.


George Elgin
Questions or Comments