Lingua::Identify identifies the language a given string or file is
written in.
Lingua::Ispell.pm - a module encapsulating access to the Ispell program.
ispell, when reporting on misspelled words, indicates the string it was
unable to verify, as well as its starting offset in the input line.
No such information is returned for words which are deemed to be
correctly spelled.
This module provides a way for the user to specify possible languages
in order of preference, and then to pick the best language of those
available. Different 'dialects' given by the 'territory' part of the
language specifier (such as en, en_GB, and en_US) are also supported.
Seamus Venasse <svenasse@polaris.ca>
IDNA::Punycode is a module to encode / decode Unicode strings into
Punycode, an efficient encoding of Unicode for use with IDNA.
Lingua::Stem - Stemming of words
This routine applies stemming algorithms to its parameters, returning the
stemmed words as appropriate to the selected locale.
Currently supported locales are:
EN - English (also EN-US and EN-UK)
DA - Danish
DE - German
GL - Galician
IT - Italian
NO - Norwegian
PT - Portuguese
SV - Swedish
Smi is a Simple Markup Interpreter / filter for simplified Markup dialect.
smi can be fed text in Markdown, and return HTML output. smi can be fed
HTML, and return the markup translated to entities. I use smi as a filter
for devel/cgit to parse the README.md files, returning HTML output. I am
also using it to markup wiki pages, for a git backed wiki. The use cases
are limited only by your imagination.
Strip whitespace and comments from JavaScript code
This class knows how to read two treebank formats, the Penn format
and the Chomsky Normal Form (CNF) format. These formats differ in
how they handle terminal nodes. The Penn format places pre-terminal
part of speech tags in the left-hand position of a
parenthesis-delimited pair, just like it does non-terminal nodes.
The CNF format attaches pre-terminal tags to the word with an
underscore.
MARC::Charset allows you to turn MARC-8 encoded strings into UTF-8
strings. MARC-8 is a single byte character encoding that predates
unicode, and allows you to put non-Roman scripts in MARC bibliographic
records.
MARC::Lint provides a mechanism for validating MARC records.