py::bio Namespace Reference
Functions |
def | find |
def | sfx |
def | parseSfx |
def | mkSfx |
def | mapP |
def | resize |
def | mapW |
def | save |
def | load |
def | Str |
def | strSet |
def | ldRep |
def | isNA |
def | cvt |
Variables |
| E = os.environ.get |
list | tys = ['b','B','h','H','i','I','l','L','f','d'] |
dictionary | chToTy |
dictionary | tySz |
dictionary | tyToCh |
dictionary | NA |
tuple | path = E("BIO_PATH","./") |
Detailed Description
bio.py - Binary Input/Ouput Library
Author: Andrew Schein
BIO -- Library for storing vectors and matrices of basic C type data
in self-describing binary files.
Briefly, this module helps solve 3 problems:
1. How to store binary data of basic C types on disk in a tool-neutral
format.
2. How to represent missing values.
3. How to access large binary data without incredably large executable
start up costs or heavy- handed software architecture. (e.g. lazy
loading of disk pages)
In greater detail...
This module provides a python numpy/ndarray interface to what is
ultimately a C standard (on-going work) for representing binary data
of basic C types on disk. Type descriptors are given by the BIO
suffix. File size (combined with the size of the underlying C type)
is used to infer row span. For matrices, the BIO suffix provides
column span information. Eventually, the library will provide
facility for cubes and other n-dimensional structures consisting of
basic C numeric types. Character strings are encoded via bytes (as in
C).
BIO files can be memory-mapped using numpy's memory map interface,
providing convenient loading of data as it is actually used. The BIO
library provides mapP (memory map 'private'--disk copy is unaltered)
and mapW (memory map 'writable'--changes are stored on disk) functions
for this purpose, in addition to a more conventional save routine for
storing unmapped numpy ndarray's to the file system.
A desired property of a database is the represention of missing
values, e.g. NULL values in a SQL database. Python's numpy has no
such facility or standard, and so BIO establishes a convention for
each type. The missing code is called NA ('not applicable'). For
floats and doubles, NAN values will suffice. For signed integer types,
BIO establishes the largest magnitude negative number as the NA code.
For unsigned integer types, BIO establishes the largest magnitude
number as the NA code. Note that standard numpy type conversions
(.asType('')) will not convert NA codes properly, and so BIO provides
NA-aware ndarray conversions.
Function Documentation
def py::bio::cvt |
( |
|
b, |
|
|
|
newTy | |
|
) |
| | |
INPUT a ndarray b, type specifier newTy
POST convert b to newTy while preserving NA values
def py::bio::find |
( |
|
p, |
|
|
|
wrn = True | |
|
) |
| | |
INPUT: (potentially partial) directory path p, warning toggle wrn.
POST: look up filename p as absolute path (if it has BIO suffix) or else in BIO_PATH.
INPUT: a ndarray
POST returns byte ndarray describing NA structure of b
INPUT: file name f
POST: load BIO array f into matrix and return matrix
INPUT: path to file.
POST: if p contains a BIO suffix, map privately p. Otherwise, use find to locate the file
and mapp that one. Raises an error if file can't be found.
def py::bio::mapW |
( |
|
p, |
|
|
|
rows, |
|
|
|
replace = True , |
|
|
|
fill = True , |
|
|
|
default = NA | |
|
) |
| | |
INPUT:
p: absolute path to file for storage including suffix (used to infer size).
rows: the number of rows in the file.
cols: the column structure as a list. Currenly only 1/2 dimensions are supported.
replace: do we eliminate data in current file (if present).
fill: do we fill newly allocated space with the default value.
default: value to fill matrix when fill is set. Defaults to type-specific NA value
POST: create space on disk for ndarray, mmap and returns ndarray.
Status: first draft does not support resizing or re-using files. To be added.
def py::bio::mkSfx |
( |
|
shape, |
|
|
|
typCh | |
|
) |
| | |
INPUT: numpy ndarray b with >= 1 dimensions.
POST: construct BIO suffix. Raise error if ndarray type is not in BIO set.
def py::bio::parseSfx |
( |
|
sfx |
) |
|
INPUT: BIO suffix sfx, e.g. .B1000d
POST: parses sfx into column span and type information
def py::bio::resize |
( |
|
p, |
|
|
|
rows, |
|
|
|
default = NA | |
|
) |
| | |
resizing code is activated by mapW
def py::bio::save |
( |
|
f, |
|
|
|
b | |
|
) |
| | |
INPUT: file name prefix f, ndarray b
POST: write b to f + BIO suffix.
def py::bio::sfx |
( |
|
fname |
) |
|
INPUT: fname -- a string.
POST: sfx returns a tuple (fname_pre,suffix) which splits fname into everything before
and after the suffix.
INPUT: ndarray of bytes or ubytes representing strings.
POST: returns string
def py::bio::strSet |
( |
|
b, |
|
|
|
string, |
|
|
|
nullTerm = True | |
|
) |
| | |
INPUT: a byte/ubtye ndarray intended to receive string.
POST: will guarantee null termination by default. Will truncate string as necessary.
Variable Documentation
Initial value:{ 'b' : N.int8 , 'B' : N.uint8, 'h' : N.int16, 'H' : N.uint16,
'i' : N.int32 , 'I' : N.uint32, 'l' : N.int64, 'L' : N.uint64,
'f' : N.float32, 'd' : N.float64}
Initial value:{ N.int8 : -1 << 7 , N.uint8 : (1 << 8) , N.int16 : -1<<15, N.uint16 : 1 << 16,
N.int32 : -1 << 31, N.uint32 : (1 << 32), N.int64 : -1<<63, N.uint64 : 1 << 64,
N.float32 : N.nan , N.float64 : N.nan}
Initial value:{ 'b' : 1 , 'B' : 1, 'h' : 2, 'H' : 2,
'i' : 4 , 'I' : 4, 'l' : 8, 'L' : 8,
'f' : 4 , 'd' : 8}
Initial value:{ N.int8 : 'b' , N.uint8 : 'B', N.int16 : 'h', N.uint16 : 'H',
N.int32 : 'i' , N.uint32 : 'I', N.int64 : 'l', N.uint64 : 'L',
N.float32 : 'f' , N.float64 : 'd'}