| Tweet |
|
|
|
|
|
The format used by NPSML programs and utilities that expect classifier training/test data is similar to that used by MEGAM and LIBSVM. We have deviated from these pre-existing file formats for several reasons. On one hand, we wanted a single file format that would be re-usable across a variety of machine learning algorithms. On the other hand, we wanted a simple file format that could be easily verified for incorrectly encoded data. Our experience in teaching beginners to data mining tools is that the feature extraction step is the single most likely point of introducing technical errors. These mistakes are particularly difficult to catch when they are the result of non-printing characters or the incorrect type of white space (e.g. using tabs when spaces are expected). For this reason, we have developed a file format that imposes tight constraints on the use of white space and non-printing characters. Just as a "type-safe" programming language can prevent certain types of programming errors by limiting the set of "allowed" programs, our file format and file format checking tools limit the set of "allowed data" to filter out some particularly common error cases.
The input must be in UTF-8, with each document taking up its own line, and with each element of the line delimited by a single whitespace character. No extra space is allowed before, after, or between lines or the elements within lines. Comments within a data file are not supported. The following characters are illegal: Unicode code point 127 (delete key) in addition to all code points less than 32 ('space key'). This happens to eliminate the non-printable ASCII characters, and all ASCII white space characters with the exception of the 'space key'.
The BNF for the file format follows:
| < syntax > | ::= | < line > |
| < line > | ::= | < line > < line > | < required-columns > < feature-value-list > |
| < required-columns > | ::= | < instance-id > < space > < instance-weight > < space > < instance-category > |
| < feature-value > | ::= | < space > feature-string < space> feature-value < feature-value-list > | < EOL > |
When document data is used to construct a feature model, such as the output of nb-learn, the model is stored in the BIO format. In case there is need to inspect the contents of a BIO file, the tool command b-print does this. Alternatively, the Python module bio.py gives a programmatic interface.
The extension of the BIO file specifies the length of each row and the size of each cell.
BIO files are in a binary format; attempting to open them in a plain text editor will not produce meaningful output.
We have developed a number of useful utilities for working with inputs and outputs to the classifiers and other tools. These utilities are written initially in Python and than ported over to C as they mature.
Unless stated otherwise, it should be assumed that these utilities print to standard output. If a large output is expected, it is probably a good idea to redirect it to a file, e.g.
$ bin/nps-shuffl document-data.npsml > document-data-shuffled.npsml
The following is a brief summary of the some utilities included and their purpose:
nps-moments: Reports statistical data about a list of values.
nps2megam: Converts from the NPSML classifier data format to one recognized by MEGAM.
nps2libSVM: Converts from the NPSML classifify data format to one recognized by libSVM.
nps-shuffl: Randomly shuffles the lines of a file.
nps-tTSplit: Divides a data set into training sets and test sets.
nps-bTTSplit: Divides a data set into training sets and test sets proportionally by category.
nps-tfIdf: Reports TF/IDF scores for the features in a data set.
nps-stripe: Manages parallel execution of shell commands.
A simple data classification task can be performed in the following manner:
\#!/bin/bash
\#this file 100 characters width
set -e # stop immediately on errors
\#original SRCFILE contains all the data in a single file
SRCFILE=~/repo/nps/projects/class_data/out/md_sentiment/dvd/all.lr.txt
WORK=./work
MEGAM_REPEAT=1 \#IMPORTANT: change to 100 once the script is ready for final run
rm -rf $WORK \#CAREFUL!
mkdir -p $WORK
\#map categories to indices so they are constant across all runs of cross
\#validation this file is important for nps2svm converter
cat $SRCFILE | cut -d' ' -f 3 | sort -u > $WORK/categories.txt
NUMCAT=`wc -l $WORK/categories.txt`
TOPCAT=`head -1 $WORK/categories.txt` \#pick the first category to be positive for SVM
\#shuffle the data to ensure a random permutation
nps-shuffl $SRCFILE > $WORK/data.shuffled.txt
\#create 10 fold cross-validation with balanced categories
nps-bTTSplit -c2 10 $WORK $WORK/data.shuffled.txt
\#at this point stop, inspect the file to check your work
for f in ./work/train.* ; do \#were you careful to only prefix appropriate files with train?
N=${f\#*train.}
echo "Performing fold $N"
MEGAMTRAIN=$WORK/megam.train.$N
MEGAMTEST=$WORK/megam.test.$N
nps2megam $f > $MEGAMTRAIN
nps2megam ./work/test.$N > $MEGAMTEST
megam -nc -fvals -repeat $MEGAM_REPEAT multiclass $MEGAMTRAIN > $WORK/weights.$N
megam -nc -fvals -predict $WORK/weights.$N multiclass $MEGAMTEST > $WORK/megam_results.$N
SVMTRAIN=$WORK/svm.train.$N
SVMTEST=$WORK/svm.test.$N
rm -f $WORK/svm_features.txt
nps2libSVM -a $TOPCAT $WORK/svm_features.txt $f > $SVMTRAIN
nps2libSVM -r $TOPCAT $WORK/svm_features.txt ./work/test.$N > $SVMTEST
\#I'll let someone else fill in libSVM detail beyond this point (not much experience)
done
Here is an example of shell script to tabulate the results:
\#!/bin/bash
\#this script tabulates the results
set -e
WORK=./work
MEGAM_REPEAT=1 \#change to 100
RESULT=$WORK/megam.cross_val.accuracy.txt
rm -f $RESULT
for f in ./work/train.* ; do \#were you careful to only prefix appropriate files with train?
N=${f\#*train.}
cut -f 3 -d' ' $WORK/test.$N > $WORK/column.1.txt \#TRUTH
cut -f 1 $WORK/megam_results.$N > $WORK/column.2.txt \#PREDICTION
paste $WORK/column.1.txt $WORK/column.2.txt > $WORK/answers.txt
nps-accuracy $WORK/answers.txt >> $RESULT
done
rm -f $WORK/column.1.txt
rm -f $WORK/column.2.txt
rm -f $WORK/answers.txt
nps-moments ./work/megam.cross_val.accuracy.txt
[top]Reproduced from the AUTHORS file found in the NPSML source root:
Andrew I. Schein Research Assistant Professor Naval Postgraduate School aischein@nps.edu Constantine V. Perepelitsa Student Intern Naval Postgraduate School cvperepe@nps.edu[top]
Reproduced from the COPYING file found in the NPSML source root:
The software in this source code repository has been entered into the public domain (as interpreted in the United States). Thus, there is no copyright holder. THERE IS NO WARRANTY FOR THE PROGRAM, TO THE EXTENT PERMITTED BY APPLICABLE LAW. THE PARTIES PROVIDE THE PROGRAM "AS IS" WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESSED OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE. THE ENTIRE RISK AS TO THE QUALITY AND PERFORMANCE OF THE PROGRAM IS WITH YOU. SHOULD THE PROGRAM PROVE DEFECTIVE, YOU ASSUME THE COST OF ALL NECESSARY SERVICING, REPAIR OR CORRECTION. As there is no copyright, there is no need to ask anyone's permission to use the software in any fashion. If you use this software in an academic context, citation will be greatly appreciated by its authors. However, there is no obligation whatsoever to perform attribution. Patches or external contributions to the NPSML library will only be accepted on the condition that they be given copyright-free.
1.6.3