Msort sorts files in sophisticated ways. Records may be fixed size,
newline-separated blocks, or terminated by any specified character.
Key fields may be selected by position, tag, or character range. For
each key, distinct exclusions, multigraphs, substitutions, and a sort
order may be defined or locale collation rules used. Comparisons may
be lexicographic, numeric, numeric string, hybrid, random, by string
length, angle, date, time, month name, or ISO8601 timestamp. Keys may
be reversed so as to generate reverse dictionaries. Optional keys are
supported. Unicode is supported, including full case-folding. Msort
itself has a somewhat complex command line interface, but may be
driven by an optional GUI.
This class provides methods to validate:
- ISBN (International Standard Book Number)
- ISSN (International Standard Serial Number)
- ISMN (International Standard Music Number)
- ISRC (International Standard Recording Code)
- EAN/UCC-8 number
- EAN/UCC-13 number
- EAN/UCC-14 number
- UCC-12 (U.P.C.) ID number
- SSCC (Serial Shipping Container Code)
This library can read and write CSV files, including all extensions used by
Excel - eg. quotes, newlines, 8 bit characters in fields, "0 etc.
An ocaml wrapper for the Expat XML parsing library.
PXP is a validating XML parser for OCaml. It strictly complies
to the XML-1.0 standard.
The parser is simple to call, usually only one statement (function
call) is sufficient to parse an XML document and to represent it
as object tree.
Once the document is parsed, it can be accessed using a class
interface. The interface allows arbitrary access including
transformations. One of the features of the document representation
is its polymorphic nature; it is simple to add custom methods to
the document classes. Furthermore, the parser can be configured
such that different XML elements are represented by objects created
from different classes. This is a very powerful feature, because
it simplifies the structure of programs processing XML documents.
OCaml-Text is a library for dealing with ``text'', i.e. a sequence of Unicode
characters, in a convenient way.
OpenToken is a facility for performing token analysis and parsing within
the Ada language. It is designed to provide all the functionality of a
traditional lexical analyzer/parser generator, such as lex/yacc. But due
to the magic of inheritance and runtime polymorphism it is implemented
entirely in Ada as withed-in code. No precompilation step is required, and
no messy tool-generated source code is created. The tradeoff is that the
grammar is generated at runtime.
AI::Categorizer is a framework for automatic text categorization. It
consists of a collection of Perl modules that implement common
categorization tasks, and a set of defined relationships among those
modules. The various details are flexible - for example, you can choose
what categorization algorithm to use, what features (words or otherwise)
of the documents should be used (or how to automatically choose these
features), what format the documents are in, and so on.
The basic process of using this module will typically involve obtaining a
collection of pre-categorized documents, creating a "knowledge set"
representation of those documents, training a categorizer on that
knowledge set, and saving the trained categorizer for later use. There are
several ways to carry out this process. The top-level AI::Categorizer
module provides an umbrella class for high-level operations, or you may
use the interfaces of the individual classes in the framework.
A simple sample script that reads a training corpus, trains a categorizer,
and tests the categorizer on a test corpus, is distributed as eg/demo.pl .
Perl library that provides several modules to compute or validate check digits.