|
B U
L L E T I N |
|
|
|
|
Programming
Languages for Library and Textual Processing
by
Howard Fosdick Howard
Fosdick has written many technical articles and several books,
founded two software users groups and been an independent computer
consultant since 1989. He has programmed in the majority of
languages mentioned in this article. His most recent book is Rexx
Programmer’s Reference (Wiley, 2005). Email him at hfosdick@compuserve.com Which
programming language is best for library and textual processing?
Which is best for information science? This article addresses these
questions through an historical/evolutionary approach. To understand
where we are today, we must understand where we have been. The
Requirements
Different
programming languages address different kinds of problems. To
understand which are most suitable for library and information
science (LIS), first define the kinds of programming problems these
areas present. LIS
programming divides into two classes: problems that are similar to
those of business and government, and those that are unique to
library and information science. The former consist of front-office
applications, such as word processing, spreadsheets and
desktop computing, and back-office
applications, comprising data processing functions like
accounts payable and receivable, payroll and other business
operations. Together they comprise information technology (IT).
For IT applications, LIS employs the same programming languages as
other organizations. Their goals are the same. LIS
also presents unique requirements. These derive from the need for text processing – the ability to analyze, process and reformat
text. Examples include products created by computer text
manipulation, such as concordances, indexes, bibliographies and
citation maps. Information
retrieval (IR) is another area. The goal is to rapidly
retrieve relevant information by applying Boolean logic to keywords
and searching databases optimized for textual storage and retrieval.
A third area is linguistic
research and natural
language processing. Textual analysis answers all manner of
research questions. The classic is: Who was Shakespeare and were all
his works written by one person? A more recent example attempted to
predict the effectiveness of the 2004 presidential candidates by
analyzing how they expressed concepts in their speeches. These
problems all require text processing.
Text consists of words (and structural syntax) combined into
larger units like sentences, paragraphs and documents. A single
sequence of letters is a character
string or string. String manipulation
includes several operations. Strings must be parsed or scanned for specific sub-strings and split into their
constituent sub-strings. Pattern
matching refers to how strings are inspected and their contents
analyzed. Bifurcation
splits strings into components. Concatenation
joins two or more strings into one. Languages
implement string processing through operators
that manipulate strings within expressions.
Functions or object methods
perform additional string operations. External or callable libraries contain additional functions or object methods. Whether
a programming language is suitable for LIS depends on how well that
language processes text. A suitable language means smaller, simpler
programs. Fewer errors occur. Programs are easier to enhance and
maintain. An unsuitable language means more code, more effort and
greater likelihood of error. Programs are difficult to enhance and
maintain. Beginnings
Early
programmers worked in native computer code or machine
language. This was
tedious, error-prone and labor-intensive. The first higher-level
languages leveraged computer power to address the problem. Each
reflected a different understanding of how to conceptualize
programming problems, a different programming
paradigm. FORTRAN views
problems through the prism of numeric formulas and calculations.
COBOL processes records and thereby forms the basis for library data
processing. LISP resolves problems by manipulating lists. The
first language based on a string-processing paradigm was COMIT.
Defects marked it as an early effort. SNOBOL (StriNg Oriented
SymBOlic Language) was the first viable string processing language.
It reached its final form by 1969 as SNOBOL4. SNOBOL
features exceptionally powerful pattern matching, string
substitution and replacement. Problems that might require hundreds
of lines of code in unsuitable languages (say, COBOL or Pascal) can
be resolved in small, easily understood SNOBOL programs. These
advantages won SNOBOL its role as the primary language for
specialized text analysis and research during the 1970s and 1980s.
But SNOBOL remained outside of mainstream computing. The language is
resource-intensive, runs only on certain computers and offers meager
facilities outside of string processing. Its input/output model,
mathematical abilities and control constructs are limited. In
contrast to SNOBOL’s specialization, PL/I was the first
general-purpose programming language to offer strong string
manipulation capabilities. It attained mainstream popularity by
combining the formulaic abilities of FORTRAN with the record
processing features of COBOL and integrating string processing. PL/I
first proved the efficacy of the text-processing paradigm to many in
computing. Through
the 1970s and 1980s, PL/I was the primary text processing language
for librarians and information scientists, while SNOBOL held sway
among those in specialized text analysis and university research. Diaspora
Technology
trends in the 1980s upset this consensus. Personal computers became
ubiquitous and led to rising popularity for new programming
languages such as BASIC, Pascal and C. The
view from LIS was that BASIC and Pascal could be quite good at
string processing – but only if they went beyond the official
language definitions, which included minimal string manipulation. C
has many string functions and good text processing capability but is
a low-level language; it
takes more code to do the same amount of work with this
detail-oriented language. C and its descendants (C++, Objective C
and C#) never gained much popularity for LIS applications. As
these new languages became popular, PL/I declined. Its mainframe
popularity did not transfer to newer platforms like personal
computers and Unix-based servers. SNOBOL
declined when its primary author developed a new language called
Icon. Icon had the string processing capability of SNOBOL plus new
features, but it was an entirely
new and different language. The consensus on SNOBOL for
advanced text analysis fragmented as some researchers stayed with
SNOBOL, some went on to Icon and others drifted off to other
languages. The
early 1990s saw the rise of a new programming paradigm – object-oriented
programming or OOP.
OOP identifies entities or objects
and analyzes the processing or methods
one applies to them. For example, the library Book object
might include the methods Check_In, Check_Out, Add_to_Collection,
Mark_as_Lost, and the like. The premier object-oriented programming
languages in IT today are Java and C++. Neither has achieved much
popularity in LIS programming. Another
big trend that started in the 1990s was the open
source software ( The
Rise of Scripting Languages
Another
major trend in programming languages has been a bit of a sleeper.
Its slow, steady unfolding has not been fully appreciated within the
computer science community. This is the trend towards scripting
languages. The
difference between scripting languages and traditional programming
languages is one of degree, not of kind. Agreement on what
constitutes a scripting language is not complete (there are even a
few languages about whose categorization reasonable people will
disagree). The
most important characteristic of scripting languages is that they
are high level – each line of code in a scripting language does more
than a line of code in a traditional, lower-level programming
language. Scripting languages can be characterized as follows: Glue languages.
Scripting ties together
existing software components. These include operating system
commands, widgets, objects, functions, commands, programs and
service routines. Leveraging existing code yields greater
productivity. Interpreted rather
than compiled.
Statements are dynamically
converted to machine language and executed one line at a time. This
feature means a shorter development cycle as developers get
immediate feedback on errors and interactively debug programs. Automatically manage variables.
Scripting transfers some of
the programming burden from person to machine by automatically
managing variables. Many scripting languages do not require defining
variables before use, declaring the data types and managing the
lengths of variables, or defining the maximum size of tables or
arrays. The
strengths of scripting languages play directly to LIS programming
requirements because their dynamic nature fits the needs of text
processing. For example, a classic text processing issue in many
programming languages is keeping track of the lengths of character
strings. While compiled languages rely on pre-defined variables and
fixed-size data, most scripting languages automatically manage
string lengths. Scripting is flexible and dynamic. While
string processing features were retroactively added to many older
programming languages, scripting centers on string processing. In
fact, two prominent scripting languages, Tcl/Tk and Rexx, consider all program variables to be variable-length strings. The
trend towards scripting languages benefits LIS programming because
the basic nature of scripting addresses the core needs of this
community. Scripting
for LIS
The
sidebar lists selection criteria for LIS scripting languages.
Let’s evaluate prominent scripting languages according to these
criteria. These include the Unix shell languages, PHP, Ruby, Python,
Tcl/Tk, Perl and Rexx. All run on any mainstream platform. The
Unix shell languages are an entire family of scripting languages
with origins in the Unix operating system. Many have since become
open source or free. Among the more popular today are the Korn and
the Bourne-Again shells. Shell languages offer powerful string
processing features, but this power is purchased at the cost of
tortured syntax. Scripts are cryptic and are often hard to read due
to the many special
characters that have unique meanings within the language.
Perhaps because of this, the Unix shell languages are rarely used
outside of Unix-derived operating systems. PHP
is designed for developing websites and dynamic Web pages. It has
excellent string manipulation facilities but is a special-purpose
language. It is rarely used outside of the Web page code (HTML)
within which is it embedded. Ruby
is a pure object-oriented scripting language that “…has many
features to process text files…,” as its creator relates in his
website language summary at www.ruby-lang/en/. Ruby has received
some publicity but is new and not widely used. Python
is a general-purpose, object-oriented scripting language that has
been applied to a wide variety of problems. It includes strong
string manipulation features in its string module, which supplies
many of the string-oriented functions of the language. Python
supports regular
expressions, a way to succinctly describe string patterns for
parsing and searching. The book Text
Processing in Python (Addison-Wesley, 2003) describes the
language’s string-manipulation facilities at length. Tcl/Tk
is an embeddable,
extensible command language for issuing commands to text editors,
debuggers, illustrators and shells. Its Tk toolkit provides a
popular graphic user interface (GUI) that programmers use to create
windowing interfaces for scripts written in Tcl and other languages.
Tcl stores all data as character strings. The introduction to Tcl at
the Tcl Developer Xchange states, “The easiest way to understand
the Tcl interpreter is to remember that everything is just an
operation on a string.” That Tcl is based on the string-processing
paradigm underscores this trend in scripting languages. Originally
designed with a rather narrow focus, newer versions of Tcl/Tk add
features that generalize the language. Tcl applies to an
increasingly wide range of programming problems. Perl
and Rexx are the two
open-source scripting languages that are most suitable for LIS
programming. These two languages address both classes of LIS
problems (IT problems and those unique to LIS). They are general
purpose, yet string manipulation capabilities are at their core.
Both run on nearly any platform. Their strong standards ensure that
they operate consistently wherever they run. Perl and Rexx are quite
popular worldwide, with a couple million Perl programmers and up to
one million Rexx developers. Given their importance to LIS, we now
describe them in detail. Perl
Perl
is a general purpose scripting language that has superior text
processing capabilities. Its creator states this objective in the
opening sentence to his book, Programming
Perl: “Perl is a language for easily manipulating text, files
and processes.” The basic piece of data in Perl is the scalar,
which holds either a string or a number. (Perl converts between the
two as necessary). Perl has the standard string operators backed up
by built-in functions for string manipulation, data conversion and
the like. Perl’s
pattern matching and string substitution capabilities are
outstanding because Perl implements them through regular
expressions, a method of succinctly describing string
patterns with maximum flexibility. Regular expressions give Perl
string processing power on the order of SNOBOL’s. Perl’s
several million programmers have created the open source Comprehensive Perl Archive Network (CPAN). This resource includes
thousands of free function libraries (or modules) that support almost any imaginable interface or
functionality. The
downside to Perl is its complexity. The language has torturous
syntax it absorbed from its Unix antecedents. It employs a dizzying
array of special variables and operators with the result that no
possible keystroke is left without some special meaning. Clear
programs can be developed in Perl. But the language and its culture
encourage clever, pithy programs. In text processing one can use
regular expressions to devise very brief scripts that might require
many times more code in other languages – and that can never be
understood by anyone other than the person who wrote them. Rexx
Rexx
was invented for IBM mainframes in the early 1980s with the goal of
combining power and ease of use. Once limited to the IBM universe,
Rexx has found new popularity within the burgeoning open source
movement. Today there are at least eight free Rexx interpreters in
worldwide use. While all meet the Rexx language standard, each
presents different strengths. For example, some run on specific
platforms or support object-oriented programming. Rexx
is a general purpose programming language rooted in the
string-processing paradigm. The only
data type in Rexx is the string. Variables contain strings
and their values are manipulated according to their contents. For
example, two strings containing digits can be subject to arithmetic
operations. Or they may be treated as character strings and
manipulated by any character string operation. Strings and tables
(or arrays) are inherently assumed variable-length, and Rexx
internally manages lengths for the programmer. The string-processing
paradigm provides exceptional flexibility. Rexx
includes operators and instructions for concatenation, parsing and
standard string processing. It includes three dozen built-in
functions designed for string processing and many that operate on words
(character groups enclosed within spaces). Rexx
comes with free external function libraries and interfaces for
nearly any purpose. These offerings are less numerous than Perl’s
CPAN but comprehensive enough that nothing required for LIS
programming is missing. Rexx’s
outstanding characteristic is its ease of use. Its creator states in
the first sentence of his book The
Rexx Language: “Rexx has been designed with just one
objective… to make programming easier than it was before.” Rexx
programs are quick to develop, easy to debug and contain fewer
errors. The programs can easily be altered, enhanced and maintained.
Perl
Versus Rexx
With
regular expressions for pattern matching, the huge CPAN library and
a larger worldwide user base, Perl offers several advantages over
Rexx. Perl is a popular
general-purpose language, yet it offers the string manipulation
power once found only in off-the-beaten-path SNOBOL. But
Perl has difficult syntax, many special variables and operators,
default variables and other features that make it a difficult
language to work in. Perl programs can be written clearly, but many
are not. Many are difficult to enhance and maintain. Perl works best
for professional developers who program full-time and researchers
who delve deeply into the language. Rexx
lacks Perl’s regular expressions for pattern matching but is fully
rooted in the string-processing paradigm. It provides all the
text-processing features LIS programming requires. While
Perl is one of the more difficult languages to learn, read and
understand, Rexx is one of the easiest. Rexx is the most accessible
string processing language, while Perl shares the power (and
sometimes the obtuseness) of SNOBOL. Trends
Today
LIS
presents unique requirements for programming languages. Text
processing is central to those requirements. For years, researchers
favored SNOBOL, the language that proved a wide range of programming
problems could be solved through the string-processing paradigm. LIS
practitioners preferred the more mainstream PL/I, which proved the
central role of string processing in general-purpose programming
languages. Today
we witness a major shift in mainstream computing towards the
string-processing paradigm. The new open-source scripting languages
drive this trend. The most popular scripting language in the world,
Perl, is a superior text-processing language. Rexx also provides
excellent string processing yet is much easier to learn and use.
Scripting languages like Rexx and Tcl/Tk highlight the ascendancy of
the string-processing paradigm in that they recognize only one kind
of variable – the string. Trends
in data representation parallel those in programming languages. EXtensible Markup Language or XML changes data into
self-descriptive text. XML stores all data as text with descriptive
identifiers or tags and requires that programs perform string
manipulation to process the data. The string-processing paradigm
thus spreads from programming languages to the data itself. Text processing, once the ugly duckling of computer science, has become the belle of the ball. Sidebar Selection
Criteria for LIS Programming Languages Selection
criteria vary by project, but these are core criteria common to most
LIS projects: ·
String-oriented
– String processing facilities are critical, and full grounding in
the string processing paradigm is preferable. ·
General
purpose –
General-purpose languages with excellent string processing
facilities are preferable to specialized languages because they
apply to library IT problems as well as text processing and
analysis. ·
Mainstream
– Popular, mainstream
languages offer the
benefit of a larger user community, more add-on products and
external libraries, better support and a larger labor pool. ·
Universal
– Languages that run on all platforms preserve code investment as
equipment changes or is upgraded. ·
Easy
– Languages that are easy to learn, use and maintain lead to fewer
errors, higher reliability, higher productivity, easier maintenance
and enhancement. ·
Open
Source – Not only are
open source languages free, they free their users from restrictive
license agreements and vendor attempts at planned obsolescence and
forced upgrades. ·
High
level – Higher-level
languages are more productive than detail or lower-level languages.
They also result in smaller, less complex programs that are easier
to enhance and maintain. ·
Standardized
– Languages that enjoy standards from the American National
Standards Institute (ANSI) enhance the value of scripts because the
code is a known quantity that can more easily be maintained, ported,
enhanced and upgraded. Language Resources Standard
Reference Works Language
Reference SNOBOL
Griswold, Ralph. The
SNOBOL4 Programming Language. NJ, Prentice-Hall, 1971, 2nd Ed. ICON
Griswold, Ralph. The
ICON Programming Language. Peer to Peer Publishing, 2000, 3rd
Ed. PL/I
Hughes, Joan. PL/I
Structured Programming. NY, Wiley Text Books, 1986, 3rd Ed. Korn
Korn, David and Morris Bolsky. The
New KornShell Command and Programming Language. NJ,
Prentice-Hall, 1995, 2nd Ed. Bourne-Again
Shell
Newham, Cameron and Rosenblatt, B.
Learning the bash Shell. Sebastopol, CA, O’Reilly, 1998, 2nd
Ed. PHP
Moulding, Peter. PHP
Black Book. CA, Paraglyph Publishing, 2001. Ruby
Matsumoto, Yukihiro. Ruby
in a Nutshell. Sebastopol, CA, O’Reilly, 2001. Tcl/Tk
Ousterhout, John. Tcl
and the Tk Toolkit. Reading, Addison-Wesley, 1994. Python
van Rossum, Guido. The
Python Language Reference. New Theory Ltd, 2003. Perl
Wall, Larry. Perl
Programming. Sebastopol, CA, O’Reilly, 1996, 2nd Ed. Rexx
Cowlishaw, Michael. The
Rexx Language: A Practical Approach to Programming. NJ,
Prentice-Hall, 1990, 2nd Ed. Websites These
are key websites for the free and open source programming languages
discussed in this article. Language
Website SNOBOL
www.snobol4.org/,
www.snobol4.com ICON
www.cs.arizona.edu/icon/,
www.engin.umd.umich.edu/CIS/course.des/cis400/icon/icon.html PL/I
www-306.ibm.com/software/awdtools/pli/,
http://home.nycap.rr.com/pflass/pli.htm,
http://pl1gcc.sourceforge.net/ Korn
www.cs.mun.ca/~michael/pdksh/,
www.osxgnu.org/software/Shells/pdksh/ Bourne-Again
Shell
www.gnu.org/software/bash/bash.html,
www.tldp.org/HOWTO/Bash-Prog-Intro-HOWTO.html,
http://tille.soti.org/training/bash/ PHP
www.php.net/, http://php.resourceindex.com/,
www.phpbuilder.com/ Tcl/Tk
www.scriptics.com/, http://dev.scriptics.com/software/tcltk/
http://hegel.ittc.ukans.edu/topics/tcltk/ Python
www.python.org/, www.pythonware.com/,
www.activestate.com Perl
www.perl.org/, www.perl.com/,
www.cpan.org/ Rexx
http://regina-rexx.sourceforge.net/,
www-306.ibm.com/software/awdtools/rexx/language/,
http://users.comlab.ox.ac.uk/ian.collier/Rexx/rexximc.html Selected
Bibliography Information Retrieval
Baeza-Yates,
Ricardo and Berthier Ribiero-Neto. Modern
Information Retrieval. Reading, Addison-Wesley, 1999. Berry,
Michael. Survey of Text
Mining: Clustering, Classification and Retrieval. NY, Springer
Verlag, 2003. Chakrabarti,
Soumen, Mining the Web:
Analysis of Hypertext and Semi-Structured Data. San Francisco,
CA, Morgan Kauffman, 2002. Witten, Ian,
Alistair Moffit, and
Timothy Bell. Managing
Gigabytes: Compressing and Indexing Documents and Images. San
Francisco, CA, Morgan Kauffman, 1999. LIS
Programming
Brown, Martin.
XML Processing with Perl, Python and PHP, Alameda, CA,
Sybex, 2001. Cooper,
Michael D. Design of Library
Automation Systems: File Structures, Data Structures, and Tools. NY:
John Wiley & Sons, 1996. Davis, Charles
H. Illustrative Computer
Programming for Libraries: Selected Examples for Information
Specialists. Westport, CN, Greenwood Press, 1974. Davis, Charles
H., Gerald W. Lundeen, and Deborah Shaw. Pascal Programming for Libraries: Illustrative Examples for Information
Specialists. Westport, CN, Greenwood Press, 1988. Fosdick, H.
Structured PL/I Programming: For Textual and Library Processing.
Littleton, CO, Libraries Unlimited, 1982. Mertz,
David. Text Processing in
Python. Reading, Addison-Wesley, 2003. Salton,
Gerald. Automatic Text
Processing. Reading, Addison-Wesley, 1989. Linguistics
Programming
Grishman,
Ralph. Computational
Linguistics. Cambridge, Cambridge University Press, 1986. Hausser,
Roland. Foundation of
Computational Linguistics: Human-Computer Communication in Natural
Language. NY, Springer Verlag, 2001. Lawler,
John and Helen Dry. Using Computers in Linguistics: A Practical Guide. Routledge,
1998.
McEnery, Tony
and Andrew Wilson. Corpus
Linguistics. Edinburgh, Edinburgh University Press, 2001, 2nd
Ed. Natural Language Processing
Allen,
James. Natural Language
Understanding. Addison-Wesley, 1994, 2nd Ed. Iwanska,
Lucja and Stuart Shapiro. Natural
Language Processing and Knowledge Representation. AAAI Press,
2000. Jackson,
Peter and Isabelle Moulinier. Natural
Language Processing for Online Applications: Text Retrieval,
Extraction and Categorization. Amsterdam, John Benjamins Pub Co,
2002. Manning,
Christopher and Henrich Schutze. Foundations
of Statistical Natural Language Processing. Boston, MIT Press,
1999. Programming for the Humanities
Corre, Alan. Icon
Programming for Humanists. NJ, Prentice-Hall, 1990. Hockey,
S. Electronic Texts in the Humanities: Principles and Practice.
Oxford, Oxford University Press, 2001. Hockey,
S. A Guide to Computer Applications in the Humanities. Johns Hopkins
University Press, 1983. Hockey, S. SNOBOL Programming for the Humanities. Oxford, Clarendon Press, 1985. |
|
|
|
|
Copyright © 2005, American Society for Information Science and Technology |