Tweet |
|
|
The format used by NPSML programs and utilities that expect classifier training/test data is similar to that used by MEGAM and LIBSVM. We have deviated from these pre-existing file formats for several reasons. On one hand, we wanted a single file format that would be re-usable across a variety of machine learning algorithms. On the other hand, we wanted a simple file format that could be easily verified for incorrectly encoded data. Our experience in teaching beginners to data mining tools is that the feature extraction step is the single most likely point of introducing technical errors. These mistakes are particularly difficult to catch when they are the result of non-printing characters or the incorrect type of white space (e.g. using tabs when spaces are expected). For this reason, we have developed a file format that imposes tight constraints on the use of white space and non-printing characters. Just as a "type-safe" programming language can prevent certain types of programming errors by limiting the set of "allowed" programs, our file format and file format checking tools limit the set of "allowed data" to filter out some particularly common error cases.
The input must be in UTF-8, with each document taking up its own line, and with each element of the line delimited by a single whitespace character. No extra space is allowed before, after, or between lines or the elements within lines. Comments within a data file are not supported. The following characters are illegal: Unicode code point 127 (delete key) in addition to all code points less than 32 ('space key'). This happens to eliminate the non-printable ASCII characters, and all ASCII white space characters with the exception of the 'space key'.
The BNF for the file format follows:
< syntax > | ::= | < line > |
< line > | ::= | < line > < line > | < required-columns > < feature-value-list > |
< required-columns > | ::= | < instance-id > < space > < instance-weight > < space > < instance-category > |
< feature-value > | ::= | < space > feature-string < space> feature-value < feature-value-list > | < EOL > |
When document data is used to construct a feature model, such as the output of nb-learn, the model is stored in the BIO format. In case there is need to inspect the contents of a BIO file, the tool command b-print does this. Alternatively, the Python module bio.py gives a programmatic interface.
The extension of the BIO file specifies the length of each row and the size of each cell.
BIO files are in a binary format; attempting to open them in a plain text editor will not produce meaningful output.
We have developed a number of useful utilities for working with inputs and outputs to the classifiers and other tools. These utilities are written initially in Python and than ported over to C as they mature.
Unless stated otherwise, it should be assumed that these utilities print to standard output. If a large output is expected, it is probably a good idea to redirect it to a file, e.g.
$ bin/nps-shuffl document-data.npsml > document-data-shuffled.npsml
The following is a brief summary of the some utilities included and their purpose:
nps-moments: Reports statistical data about a list of values.
nps2megam: Converts from the NPSML classifier data format to one recognized by MEGAM.
nps2libSVM: Converts from the NPSML classifify data format to one recognized by libSVM.
nps-shuffl: Randomly shuffles the lines of a file.
nps-tTSplit: Divides a data set into training sets and test sets.
nps-bTTSplit: Divides a data set into training sets and test sets proportionally by category.
nps-tfIdf: Reports TF/IDF scores for the features in a data set.
nps-stripe: Manages parallel execution of shell commands.
A simple data classification task can be performed in the following manner:
\#!/bin/bash \#this file 100 characters width set -e # stop immediately on errors \#original SRCFILE contains all the data in a single file SRCFILE=~/repo/nps/projects/class_data/out/md_sentiment/dvd/all.lr.txt WORK=./work MEGAM_REPEAT=1 \#IMPORTANT: change to 100 once the script is ready for final run rm -rf $WORK \#CAREFUL! mkdir -p $WORK \#map categories to indices so they are constant across all runs of cross \#validation this file is important for nps2svm converter cat $SRCFILE | cut -d' ' -f 3 | sort -u > $WORK/categories.txt NUMCAT=`wc -l $WORK/categories.txt` TOPCAT=`head -1 $WORK/categories.txt` \#pick the first category to be positive for SVM \#shuffle the data to ensure a random permutation nps-shuffl $SRCFILE > $WORK/data.shuffled.txt \#create 10 fold cross-validation with balanced categories nps-bTTSplit -c2 10 $WORK $WORK/data.shuffled.txt \#at this point stop, inspect the file to check your work for f in ./work/train.* ; do \#were you careful to only prefix appropriate files with train? N=${f\#*train.} echo "Performing fold $N" MEGAMTRAIN=$WORK/megam.train.$N MEGAMTEST=$WORK/megam.test.$N nps2megam $f > $MEGAMTRAIN nps2megam ./work/test.$N > $MEGAMTEST megam -nc -fvals -repeat $MEGAM_REPEAT multiclass $MEGAMTRAIN > $WORK/weights.$N megam -nc -fvals -predict $WORK/weights.$N multiclass $MEGAMTEST > $WORK/megam_results.$N SVMTRAIN=$WORK/svm.train.$N SVMTEST=$WORK/svm.test.$N rm -f $WORK/svm_features.txt nps2libSVM -a $TOPCAT $WORK/svm_features.txt $f > $SVMTRAIN nps2libSVM -r $TOPCAT $WORK/svm_features.txt ./work/test.$N > $SVMTEST \#I'll let someone else fill in libSVM detail beyond this point (not much experience) doneHere is an example of shell script to tabulate the results:
\#!/bin/bash \#this script tabulates the results set -e WORK=./work MEGAM_REPEAT=1 \#change to 100 RESULT=$WORK/megam.cross_val.accuracy.txt rm -f $RESULT for f in ./work/train.* ; do \#were you careful to only prefix appropriate files with train? N=${f\#*train.} cut -f 3 -d' ' $WORK/test.$N > $WORK/column.1.txt \#TRUTH cut -f 1 $WORK/megam_results.$N > $WORK/column.2.txt \#PREDICTION paste $WORK/column.1.txt $WORK/column.2.txt > $WORK/answers.txt nps-accuracy $WORK/answers.txt >> $RESULT done rm -f $WORK/column.1.txt rm -f $WORK/column.2.txt rm -f $WORK/answers.txt nps-moments ./work/megam.cross_val.accuracy.txt[top]
Reproduced from the AUTHORS file found in the NPSML source root:
Andrew I. Schein Research Assistant Professor Naval Postgraduate School aischein@nps.edu Constantine V. Perepelitsa Student Intern Naval Postgraduate School cvperepe@nps.edu[top]
Reproduced from the COPYING file found in the NPSML source root:
The software in this source code repository has been entered into the public domain (as interpreted in the United States). Thus, there is no copyright holder. THERE IS NO WARRANTY FOR THE PROGRAM, TO THE EXTENT PERMITTED BY APPLICABLE LAW. THE PARTIES PROVIDE THE PROGRAM "AS IS" WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESSED OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE. THE ENTIRE RISK AS TO THE QUALITY AND PERFORMANCE OF THE PROGRAM IS WITH YOU. SHOULD THE PROGRAM PROVE DEFECTIVE, YOU ASSUME THE COST OF ALL NECESSARY SERVICING, REPAIR OR CORRECTION. As there is no copyright, there is no need to ask anyone's permission to use the software in any fashion. If you use this software in an academic context, citation will be greatly appreciated by its authors. However, there is no obligation whatsoever to perform attribution. Patches or external contributions to the NPSML library will only be accepted on the condition that they be given copyright-free.