The Naval Postgraduate School Machine Learning Library (NPSML)

submit to reddit

Quick Links

Welcome to the Naval Postgraduate School Machine Learning Library (NPSML) project page. From here you may:

TOC Table of Contents



1. Installation Instructions

To install NPSML, first get the source code from the hosted mercurial repository. Installation instructions are given in the INSTALL file in the NPSML project root directory.
Do `cat INSTALL` to print the instructions to your terminal, or open the file in your favorite text editor.

[top]

2. Guide to the Source Directory Structure

Below are a list of some of the major subdirectories in the source tree. All libraries are C unless stated otherwise.

src/pypython modules
src/bio BIO library: utilities for storing numerical data on disk as a binary file format.
src/b-util Command line utilities for working with BIO files.
src/classifiers the machine learning classifiers
src/num Numerical libraries
src/num/blas Building our own 64 bit BLAS
src/nps General purpose libraries including common data structures and the C object system
src/scr Utilities written in python or other scripting language
src/error A C library for error code encapsulation and related messages.
src/io Input/Output related C libraries
src/util Command line utilities to get various jobs done

[top]

3. Data Formats

3.1. Classifier Data Format

The format used by NPSML programs and utilities that expect classifier training/test data is similar to that used by MEGAM and LIBSVM. We have deviated from these pre-existing file formats for several reasons. On one hand, we wanted a single file format that would be re-usable across a variety of machine learning algorithms. On the other hand, we wanted a simple file format that could be easily verified for incorrectly encoded data. Our experience in teaching beginners to data mining tools is that the feature extraction step is the single most likely point of introducing technical errors. These mistakes are particularly difficult to catch when they are the result of non-printing characters or the incorrect type of white space (e.g. using tabs when spaces are expected). For this reason, we have developed a file format that imposes tight constraints on the use of white space and non-printing characters. Just as a "type-safe" programming language can prevent certain types of programming errors by limiting the set of "allowed" programs, our file format and file format checking tools limit the set of "allowed data" to filter out some particularly common error cases. The input must be in UTF-8, with each document taking up its own line, and with each element of the line delimited by a single whitespace character. No extra space is allowed before, after, or between lines or the elements within lines. Comments within a data file are not supported. The following characters are illegal: Unicode code point 127 (delete key) in addition to all code points less than 32 ('space key'). This happens to eliminate the non-printable ASCII characters, and all ASCII white space characters with the exception of the 'space key'.

The BNF for the file format follows:

< syntax > ::= < line >
< line > ::= < line > < line > | < required-columns > < feature-value-list >
< required-columns > ::= < instance-id > < space > < instance-weight > < space > < instance-category >
< feature-value > ::= < space > feature-string < space> feature-value < feature-value-list > | < EOL >

The < space > symbol is defined only as ASCII character x20 ("Space").


Each line must begin with three columns:
[instance ID] [instance weight] [instance category]
instance ID: A unique ID given to the document described by the list of features that follows. This can be a number, a name, or any other distinguishing label as long as it does not contain whitespace or other illegal characters.
instance weight: The weight of the instance. If you are unfamiliar with the topic of instance weighting, probably you should set this value to 1.0. This number must be positive. The decimal part is not necessary if the weight is an integer (e.g. 1 is the same as 1.0) but using the float representation can help humans disambiguate this column from the other two required columns when they visually inspect your file.
instance category: The category that this document belongs to. This column is used during training and accuracy calculation. Although a value is always required for the file format, it is ignored when generating predictions. Integers are permitted as category labels, however use of integers in this fashion is ill-advised due to the way the NPSML file format may integrate with third-party tools for which we have built adapters.

After the required columns, the document's features are listed directly after one another, alternating the feature name and its value. Each value is delimited by a single whitespace (tabs are not allowed). The feature names cannot contain whitespace, punctuation marks, or any non-printable characters.

For example, the following would describe three documents, two from one category and one from another, with features foo, bar, and baz:
a1 1.0 CatA foo 6 bar 3
a2 1.0 CatA foo 8.0 bar 4.0 baz 1.0
b1 1 CatB foo 2.0 baz 8

In the example we intentionally varied the use of integer and floating point numbers to show the variety that is permitted. It is a very good idea to verify that there are no problems or typos in the formatting by passing the input to nps-dataCheck before using it with other scripts or utilities that expect data in the NPSML classifier file format.

The NPSML document data format can be translated to a format readable by MEGAM and LIBSVM with the nps2megam and nps2libSVM utilities, respectively.

[top]

3.3. BIO

When document data is used to construct a feature model, such as the output of nb-learn, the model is stored in the BIO format. In case there is need to inspect the contents of a BIO file, the tool command b-print does this. Alternatively, the Python module bio.py gives a programmatic interface.
The extension of the BIO file specifies the length of each row and the size of each cell.
BIO files are in a binary format; attempting to open them in a plain text editor will not produce meaningful output.

[top]

4. Classifiers

The library currently provides two classifiers: naive Bayes and an (averaged) perceptron. The perceptron implementation handles the multi-category case as well. Both classifiers use the NPSML classifier file format described above.

4.1 Naive Bayes

The naive Bayes algorithm is implemented in two executables: nb-learn and nb-classify. Both are distributed with man pages that describe their features.

4.2 Averaged Linear Perceptron

The perceptron algorithm is implemented in two executables: perc-learn and perc-classify. Both are distributed with man pages that describe their features. The same executables handle binary and multi-category classification.

4.3 Logistic Regression

This is under development

5. Utilities

We have developed a number of useful utilities for working with inputs and outputs to the classifiers and other tools. These utilities are written initially in Python and than ported over to C as they mature.
Unless stated otherwise, it should be assumed that these utilities print to standard output. If a large output is expected, it is probably a good idea to redirect it to a file, e.g.
$ bin/nps-shuffl document-data.npsml > document-data-shuffled.npsml

The following is a brief summary of the some utilities included and their purpose:
nps-moments: Reports statistical data about a list of values.
nps2megam: Converts from the NPSML classifier data format to one recognized by MEGAM.
nps2libSVM: Converts from the NPSML classifify data format to one recognized by libSVM.
nps-shuffl: Randomly shuffles the lines of a file.
nps-tTSplit: Divides a data set into training sets and test sets.
nps-bTTSplit: Divides a data set into training sets and test sets proportionally by category.
nps-tfIdf: Reports TF/IDF scores for the features in a data set.
nps-stripe: Manages parallel execution of shell commands.

[top]

6. NPSML In Action (A Cross Validation Example)

A simple data classification task can be performed in the following manner:

 \#!/bin/bash  
 \#this file 100 characters width
 set -e # stop immediately on errors
 
 \#original SRCFILE contains all the data in a single file
 SRCFILE=~/repo/nps/projects/class_data/out/md_sentiment/dvd/all.lr.txt
 WORK=./work
 MEGAM_REPEAT=1 \#IMPORTANT: change to 100 once the script is ready for final run
 rm -rf $WORK  \#CAREFUL!
 mkdir -p $WORK
 
 \#map categories to indices so they are constant across all runs of cross 
 \#validation this file is important for nps2svm converter
 cat $SRCFILE | cut -d' ' -f 3 | sort -u > $WORK/categories.txt
 NUMCAT=`wc -l $WORK/categories.txt`
 TOPCAT=`head -1 $WORK/categories.txt` \#pick the first category to be positive for SVM
 
 \#shuffle the data to ensure a random permutation
 nps-shuffl $SRCFILE > $WORK/data.shuffled.txt
 \#create 10 fold cross-validation with balanced categories
 nps-bTTSplit -c2 10 $WORK $WORK/data.shuffled.txt 
 \#at this point stop, inspect the file to check your work
 
 for f in ./work/train.* ; do \#were you careful to only prefix appropriate files with train?
     N=${f\#*train.}
     echo "Performing fold $N"
     MEGAMTRAIN=$WORK/megam.train.$N
     MEGAMTEST=$WORK/megam.test.$N
     nps2megam $f > $MEGAMTRAIN
     nps2megam ./work/test.$N > $MEGAMTEST
     megam -nc -fvals -repeat $MEGAM_REPEAT multiclass $MEGAMTRAIN > $WORK/weights.$N
     megam -nc -fvals -predict $WORK/weights.$N multiclass $MEGAMTEST > $WORK/megam_results.$N
     
     SVMTRAIN=$WORK/svm.train.$N
     SVMTEST=$WORK/svm.test.$N   
     rm -f $WORK/svm_features.txt
     nps2libSVM -a $TOPCAT $WORK/svm_features.txt $f > $SVMTRAIN
     nps2libSVM -r $TOPCAT $WORK/svm_features.txt ./work/test.$N > $SVMTEST
     \#I'll let someone else fill in libSVM detail beyond this point (not much experience)
 done
 
Here is an example of shell script to tabulate the results:
 \#!/bin/bash 
 \#this script tabulates the results
 set -e
 WORK=./work
 MEGAM_REPEAT=1 \#change to 100
 
 RESULT=$WORK/megam.cross_val.accuracy.txt
 rm -f $RESULT
 
 for f in ./work/train.* ; do \#were you careful to only prefix appropriate files with train?
     N=${f\#*train.}
     cut -f 3 -d' ' $WORK/test.$N > $WORK/column.1.txt      \#TRUTH
     cut -f 1 $WORK/megam_results.$N > $WORK/column.2.txt   \#PREDICTION
     paste $WORK/column.1.txt $WORK/column.2.txt > $WORK/answers.txt             
     nps-accuracy $WORK/answers.txt >> $RESULT 
 done
 rm -f $WORK/column.1.txt
 rm -f $WORK/column.2.txt
 rm -f $WORK/answers.txt
 
 nps-moments ./work/megam.cross_val.accuracy.txt
 
[top]

NPSML Authors

Reproduced from the AUTHORS file found in the NPSML source root:

 Andrew I. Schein
 Research Assistant Professor
 Naval Postgraduate School
 aischein@nps.edu
 
 Constantine V. Perepelitsa
 Student Intern
 Naval Postgraduate School
 cvperepe@nps.edu
[top]

Copyright Information

Reproduced from the COPYING file found in the NPSML source root:

 The software in this source code repository has been entered into the
 public domain (as interpreted in the United States).  Thus, there is no 
 copyright holder.  
 
 THERE IS NO WARRANTY FOR THE PROGRAM, TO THE EXTENT PERMITTED BY
 APPLICABLE LAW. THE PARTIES PROVIDE THE PROGRAM "AS IS" WITHOUT
 WARRANTY OF ANY KIND, EITHER EXPRESSED OR IMPLIED, INCLUDING, BUT NOT
 LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
 A PARTICULAR PURPOSE. THE ENTIRE RISK AS TO THE QUALITY AND
 PERFORMANCE OF THE PROGRAM IS WITH YOU. SHOULD THE PROGRAM PROVE
 DEFECTIVE, YOU ASSUME THE COST OF ALL NECESSARY SERVICING, REPAIR OR
 CORRECTION.
 
 As there is no copyright, there is no need to ask anyone's permission to use 
 the software in any fashion. 
 
 If you use this software in an academic context, citation will be 
 greatly appreciated by its authors.  However, there is no obligation 
 whatsoever to perform attribution.  
 
 Patches or external contributions to the NPSML library will only be accepted on the 
 condition that they be given copyright-free.
 All Classes Namespaces Files Functions Variables Typedefs Enumerations Enumerator Defines
Generated on Sun Sep 11 09:40:45 2011 for NPSML by  doxygen 1.6.3