This library can read and write CSV files, including all extensions used by
Excel - eg. quotes, newlines, 8 bit characters in fields, "0 etc.
Sphinx is a full-text search engine, distributed under GPL version
2. Commercial license is also available for embedded use.
Generally, it's a standalone search engine, meant to provide fast,
size-efficient and relevant fulltext search functions to other
applications. Sphinx was specially designed to integrate well with SQL
databases and scripting languages. Currently built-in data sources
support fetching data either via direct connection to MySQL, or from
an XML pipe.
As for the name, Sphinx is an acronym which is officially decoded as
SQL Phrase Index.
An ocaml wrapper for the Expat XML parsing library.
PXP is a validating XML parser for OCaml. It strictly complies
to the XML-1.0 standard.
The parser is simple to call, usually only one statement (function
call) is sufficient to parse an XML document and to represent it
as object tree.
Once the document is parsed, it can be accessed using a class
interface. The interface allows arbitrary access including
transformations. One of the features of the document representation
is its polymorphic nature; it is simple to add custom methods to
the document classes. Furthermore, the parser can be configured
such that different XML elements are represented by objects created
from different classes. This is a very powerful feature, because
it simplifies the structure of programs processing XML documents.
OCaml-Text is a library for dealing with ``text'', i.e. a sequence of Unicode
characters, in a convenient way.
Libtextcat is a library with functions that implement the classification
technique described in Cavnar & Trenkle, "N-Gram-Based Text Categorization" [1].
It was primarily developed for language guessing, a task on which it is known to
perform with near-perfect accuracy.
The central idea of the Cavnar & Trenkle technique is to calculate a
"fingerprint" of a document with an unknown category, and compare this with the
fingerprints of a number of documents of which the categories are known. The
categories of the closest matches are output as the classification. A
fingerprint is a list of the most frequent n-grams occurring in a document,
ordered by frequency. Fingerprints are compared with a simple out-of-place
metric.
[1] The document that started it all: William B. Cavnar & John M. Trenkle (1994)
N-Gram-Based Text Categorization, <http://citeseer.ist.psu.edu/68861.html>.
TyXML is an OCaml library that allows you to build XML trees whose validity is
insured by the typechecker. It supports XHTML 1.0 and 1.1, HTML5 and SVG
(partial).
odt2txt is a command-line tool which extracts the text out of OpenDocument Texts
produced by LibreOffice, OpenOffice, StarOffice, KOffice and others.
odt2txt can also extract text from some file formats similar to OpenDocument
Text, such as OpenOffice.org XML, which was used by OpenOffice.org version 1.x
and older StarOffice versions. To a lesser extent, odt2txt may be useful to
extract content from OpenDocument spreadsheets and OpenDocument presentations.
odt2txt is:
- small
- supports multiple output encodings
- adopts to your locale
- able to substitute common characters which the output charset does not contain
with ascii look-a-likes
- written in C, has few dependencies
- portable (runs on Linux, Mac OS X, Windows, *BSD, Cygwin, Solaris, HP-UX)
OpenFTS (Open Source Full Text Search engine) is an advanced
PostgreSQL-based search engine that provides online indexing of data
and relevance ranking for database searching.
Close integration with database allows use of metadata to restrict
search results.