B  U  L  L  E  T  I  N


of the American Society for Information Science and Technology   Vol. 31, No. 6   August/September 2005

Go to
Bulletin Index

bookstore2Go to the ASIST Bookstore

Copies

Programming Languages for Library and Textual Processing

by Howard Fosdick

Howard Fosdick has written many technical articles and several books, founded two software users groups and been an independent computer consultant since 1989. He has programmed in the majority of languages mentioned in this article. His most recent book is Rexx Programmer’s Reference (Wiley, 2005). Email him at hfosdick@compuserve.com

Which programming language is best for library and textual processing? Which is best for information science? This article addresses these questions through an historical/evolutionary approach. To understand where we are today, we must understand where we have been.

The Requirements

Different programming languages address different kinds of problems. To understand which are most suitable for library and information science (LIS), first define the kinds of programming problems these areas present.

LIS programming divides into two classes: problems that are similar to those of business and government, and those that are unique to library and information science. The former consist of front-office applications, such as word processing, spreadsheets and desktop computing, and back-office applications, comprising data processing functions like accounts payable and receivable, payroll and other business operations. Together they comprise information technology (IT). For IT applications, LIS employs the same programming languages as other organizations. Their goals are the same.

LIS also presents unique requirements. These derive from the need for text processing – the ability to analyze, process and reformat text. Examples include products created by computer text manipulation, such as concordances, indexes, bibliographies and citation maps. Information retrieval (IR) is another area. The goal is to rapidly retrieve relevant information by applying Boolean logic to keywords and searching databases optimized for textual storage and retrieval. A third area is linguistic research and natural language processing. Textual analysis answers all manner of research questions. The classic is: Who was Shakespeare and were all his works written by one person? A more recent example attempted to predict the effectiveness of the 2004 presidential candidates by analyzing how they expressed concepts in their speeches.

These problems all require text processing. Text consists of words (and structural syntax) combined into larger units like sentences, paragraphs and documents. A single sequence of letters is a character string or string. String manipulation includes several operations. Strings must be parsed or scanned for specific sub-strings and split into their constituent sub-strings. Pattern matching refers to how strings are inspected and their contents analyzed. Bifurcation splits strings into components. Concatenation joins two or more strings into one.

Languages implement string processing through operators that manipulate strings within expressions. Functions or object methods perform additional string operations. External or callable libraries contain additional functions or object methods.

Whether a programming language is suitable for LIS depends on how well that language processes text. A suitable language means smaller, simpler programs. Fewer errors occur. Programs are easier to enhance and maintain. An unsuitable language means more code, more effort and greater likelihood of error. Programs are difficult to enhance and maintain.

Beginnings

Early programmers worked in native computer code or machine language. This was tedious, error-prone and labor-intensive. The first higher-level languages leveraged computer power to address the problem. Each reflected a different understanding of how to conceptualize programming problems, a different programming paradigm. FORTRAN views problems through the prism of numeric formulas and calculations. COBOL processes records and thereby forms the basis for library data processing. LISP resolves problems by manipulating lists.

The first language based on a string-processing paradigm was COMIT. Defects marked it as an early effort. SNOBOL (StriNg Oriented SymBOlic Language) was the first viable string processing language. It reached its final form by 1969 as SNOBOL4.

SNOBOL features exceptionally powerful pattern matching, string substitution and replacement. Problems that might require hundreds of lines of code in unsuitable languages (say, COBOL or Pascal) can be resolved in small, easily understood SNOBOL programs.

These advantages won SNOBOL its role as the primary language for specialized text analysis and research during the 1970s and 1980s. But SNOBOL remained outside of mainstream computing. The language is resource-intensive, runs only on certain computers and offers meager facilities outside of string processing. Its input/output model, mathematical abilities and control constructs are limited.

In contrast to SNOBOL’s specialization, PL/I was the first general-purpose programming language to offer strong string manipulation capabilities. It attained mainstream popularity by combining the formulaic abilities of FORTRAN with the record processing features of COBOL and integrating string processing. PL/I first proved the efficacy of the text-processing paradigm to many in computing.

Through the 1970s and 1980s, PL/I was the primary text processing language for librarians and information scientists, while SNOBOL held sway among those in specialized text analysis and university research.

Diaspora

Technology trends in the 1980s upset this consensus. Personal computers became ubiquitous and led to rising popularity for new programming languages such as BASIC, Pascal and C.

The view from LIS was that BASIC and Pascal could be quite good at string processing – but only if they went beyond the official language definitions, which included minimal string manipulation. C has many string functions and good text processing capability but is a low-level language; it takes more code to do the same amount of work with this detail-oriented language. C and its descendants (C++, Objective C and C#) never gained much popularity for LIS applications.

As these new languages became popular, PL/I declined. Its mainframe popularity did not transfer to newer platforms like personal computers and Unix-based servers.

SNOBOL declined when its primary author developed a new language called Icon. Icon had the string processing capability of SNOBOL plus new features, but it was an entirely new and different language. The consensus on SNOBOL for advanced text analysis fragmented as some researchers stayed with SNOBOL, some went on to Icon and others drifted off to other languages.

The early 1990s saw the rise of a new programming paradigm – object-oriented programming or OOP. OOP identifies entities or objects and analyzes the processing or methods one applies to them. For example, the library Book object might include the methods Check_In, Check_Out, Add_to_Collection, Mark_as_Lost, and the like. The premier object-oriented programming languages in IT today are Java and C++. Neither has achieved much popularity in LIS programming.

Another big trend that started in the 1990s was the open source software ( OSS ) movement. The defining characteristic of open source software is that it’s free. Restrictions on free use typically apply only if OSS is used to develop and sell proprietary, closed source software. OSS profoundly impacted programming. Today, free versions of nearly any programming language are available for any mainstream computer.

The Rise of Scripting Languages

Another major trend in programming languages has been a bit of a sleeper. Its slow, steady unfolding has not been fully appreciated within the computer science community. This is the trend towards scripting languages.

The difference between scripting languages and traditional programming languages is one of degree, not of kind. Agreement on what constitutes a scripting language is not complete (there are even a few languages about whose categorization reasonable people will disagree).

The most important characteristic of scripting languages is that they are high level – each line of code in a scripting language does more than a line of code in a traditional, lower-level programming language. Scripting languages can be characterized as follows:

Glue languages. Scripting ties together existing software components. These include operating system commands, widgets, objects, functions, commands, programs and service routines. Leveraging existing code yields greater productivity.

Interpreted rather than compiled. Statements are dynamically converted to machine language and executed one line at a time. This feature means a shorter development cycle as developers get immediate feedback on errors and interactively debug programs.

Automatically manage variables. Scripting transfers some of the programming burden from person to machine by automatically managing variables. Many scripting languages do not require defining variables before use, declaring the data types and managing the lengths of variables, or defining the maximum size of tables or arrays.

The strengths of scripting languages play directly to LIS programming requirements because their dynamic nature fits the needs of text processing. For example, a classic text processing issue in many programming languages is keeping track of the lengths of character strings. While compiled languages rely on pre-defined variables and fixed-size data, most scripting languages automatically manage string lengths. Scripting is flexible and dynamic.

While string processing features were retroactively added to many older programming languages, scripting centers on string processing. In fact, two prominent scripting languages, Tcl/Tk and Rexx, consider all program variables to be variable-length strings.

The trend towards scripting languages benefits LIS programming because the basic nature of scripting addresses the core needs of this community.

Scripting for LIS

The sidebar lists selection criteria for LIS scripting languages. Let’s evaluate prominent scripting languages according to these criteria. These include the Unix shell languages, PHP, Ruby, Python, Tcl/Tk, Perl and Rexx. All run on any mainstream platform.

The Unix shell languages are an entire family of scripting languages with origins in the Unix operating system. Many have since become open source or free. Among the more popular today are the Korn and the Bourne-Again shells. Shell languages offer powerful string processing features, but this power is purchased at the cost of tortured syntax. Scripts are cryptic and are often hard to read due to the many special characters that have unique meanings within the language. Perhaps because of this, the Unix shell languages are rarely used outside of Unix-derived operating systems.

PHP is designed for developing websites and dynamic Web pages. It has excellent string manipulation facilities but is a special-purpose language. It is rarely used outside of the Web page code (HTML) within which is it embedded.

Ruby is a pure object-oriented scripting language that “…has many features to process text files…,” as its creator relates in his website language summary at www.ruby-lang/en/. Ruby has received some publicity but is new and not widely used.

Python is a general-purpose, object-oriented scripting language that has been applied to a wide variety of problems. It includes strong string manipulation features in its string module, which supplies many of the string-oriented functions of the language. Python supports regular expressions, a way to succinctly describe string patterns for parsing and searching. The book Text Processing in Python (Addison-Wesley, 2003) describes the language’s string-manipulation facilities at length.

Tcl/Tk is an embeddable, extensible command language for issuing commands to text editors, debuggers, illustrators and shells. Its Tk toolkit provides a popular graphic user interface (GUI) that programmers use to create windowing interfaces for scripts written in Tcl and other languages. Tcl stores all data as character strings. The introduction to Tcl at the Tcl Developer Xchange states, “The easiest way to understand the Tcl interpreter is to remember that everything is just an operation on a string.” That Tcl is based on the string-processing paradigm underscores this trend in scripting languages. Originally designed with a rather narrow focus, newer versions of Tcl/Tk add features that generalize the language. Tcl applies to an increasingly wide range of programming problems.

Perl and Rexx are the two open-source scripting languages that are most suitable for LIS programming. These two languages address both classes of LIS problems (IT problems and those unique to LIS). They are general purpose, yet string manipulation capabilities are at their core. Both run on nearly any platform. Their strong standards ensure that they operate consistently wherever they run. Perl and Rexx are quite popular worldwide, with a couple million Perl programmers and up to one million Rexx developers. Given their importance to LIS, we now describe them in detail.

Perl

Perl is a general purpose scripting language that has superior text processing capabilities. Its creator states this objective in the opening sentence to his book, Programming Perl: “Perl is a language for easily manipulating text, files and processes.” The basic piece of data in Perl is the scalar, which holds either a string or a number. (Perl converts between the two as necessary). Perl has the standard string operators backed up by built-in functions for string manipulation, data conversion and the like.

Perl’s pattern matching and string substitution capabilities are outstanding because Perl implements them through regular expressions, a method of succinctly describing string patterns with maximum flexibility. Regular expressions give Perl string processing power on the order of SNOBOL’s.

Perl’s several million programmers have created the open source Comprehensive Perl Archive Network (CPAN). This resource includes thousands of free function libraries (or modules) that support almost any imaginable interface or functionality.

The downside to Perl is its complexity. The language has torturous syntax it absorbed from its Unix antecedents. It employs a dizzying array of special variables and operators with the result that no possible keystroke is left without some special meaning.

Clear programs can be developed in Perl. But the language and its culture encourage clever, pithy programs. In text processing one can use regular expressions to devise very brief scripts that might require many times more code in other languages – and that can never be understood by anyone other than the person who wrote them.

Rexx

Rexx was invented for IBM mainframes in the early 1980s with the goal of combining power and ease of use. Once limited to the IBM universe, Rexx has found new popularity within the burgeoning open source movement. Today there are at least eight free Rexx interpreters in worldwide use. While all meet the Rexx language standard, each presents different strengths. For example, some run on specific platforms or support object-oriented programming.

Rexx is a general purpose programming language rooted in the string-processing paradigm. The only data type in Rexx is the string. Variables contain strings and their values are manipulated according to their contents. For example, two strings containing digits can be subject to arithmetic operations. Or they may be treated as character strings and manipulated by any character string operation. Strings and tables (or arrays) are inherently assumed variable-length, and Rexx internally manages lengths for the programmer. The string-processing paradigm provides exceptional flexibility.

Rexx includes operators and instructions for concatenation, parsing and standard string processing. It includes three dozen built-in functions designed for string processing and many that operate on words (character groups enclosed within spaces).

Rexx comes with free external function libraries and interfaces for nearly any purpose. These offerings are less numerous than Perl’s CPAN but comprehensive enough that nothing required for LIS programming is missing.

Rexx’s outstanding characteristic is its ease of use. Its creator states in the first sentence of his book The Rexx Language: “Rexx has been designed with just one objective… to make programming easier than it was before.” Rexx programs are quick to develop, easy to debug and contain fewer errors. The programs can easily be altered, enhanced and maintained.

Perl Versus Rexx

With regular expressions for pattern matching, the huge CPAN library and a larger worldwide user base, Perl offers several advantages over Rexx.  Perl is a popular general-purpose language, yet it offers the string manipulation power once found only in off-the-beaten-path SNOBOL.

But Perl has difficult syntax, many special variables and operators, default variables and other features that make it a difficult language to work in. Perl programs can be written clearly, but many are not. Many are difficult to enhance and maintain. Perl works best for professional developers who program full-time and researchers who delve deeply into the language.

Rexx lacks Perl’s regular expressions for pattern matching but is fully rooted in the string-processing paradigm. It provides all the text-processing features LIS programming requires.

While Perl is one of the more difficult languages to learn, read and understand, Rexx is one of the easiest. Rexx is the most accessible string processing language, while Perl shares the power (and sometimes the obtuseness) of SNOBOL.

Trends Today

LIS presents unique requirements for programming languages. Text processing is central to those requirements. For years, researchers favored SNOBOL, the language that proved a wide range of programming problems could be solved through the string-processing paradigm. LIS practitioners preferred the more mainstream PL/I, which proved the central role of string processing in general-purpose programming languages.

Today we witness a major shift in mainstream computing towards the string-processing paradigm. The new open-source scripting languages drive this trend. The most popular scripting language in the world, Perl, is a superior text-processing language. Rexx also provides excellent string processing yet is much easier to learn and use. Scripting languages like Rexx and Tcl/Tk highlight the ascendancy of the string-processing paradigm in that they recognize only one kind of variable – the string.

Trends in data representation parallel those in programming languages. EXtensible Markup Language or XML changes data into self-descriptive text. XML stores all data as text with descriptive identifiers or tags and requires that programs perform string manipulation to process the data. The string-processing paradigm thus spreads from programming languages to the data itself.

Text processing, once the ugly duckling of computer science, has become the belle of the ball.

Sidebar

Selection Criteria for LIS Programming Languages

Selection criteria vary by project, but these are core criteria common to most LIS projects:

·         String-oriented – String processing facilities are critical, and full grounding in the string processing paradigm is preferable.

·         General purpose – General-purpose languages with excellent string processing facilities are preferable to specialized languages because they apply to library IT problems as well as text processing and analysis.

·         Mainstream – Popular, mainstream languages offer the benefit of a larger user community, more add-on products and external libraries, better support and a larger labor pool.

·         Universal – Languages that run on all platforms preserve code investment as equipment changes or is upgraded.

·         Easy – Languages that are easy to learn, use and maintain lead to fewer errors, higher reliability, higher productivity, easier maintenance and enhancement.

·         Open Source – Not only are open source languages free, they free their users from restrictive license agreements and vendor attempts at planned obsolescence and forced upgrades.

·         High level – Higher-level languages are more productive than detail or lower-level languages. They also result in smaller, less complex programs that are easier to enhance and maintain.

·         Standardized – Languages that enjoy standards from the American National Standards Institute (ANSI) enhance the value of scripts because the code is a known quantity that can more easily be maintained, ported, enhanced and upgraded.


Language Resources

Standard Reference Works

Language                    Reference

SNOBOL                       Griswold, Ralph. The SNOBOL4 Programming Language. NJ, Prentice-Hall, 1971, 2nd Ed.

ICON                            Griswold, Ralph. The ICON Programming Language. Peer to Peer Publishing, 2000, 3rd Ed.

PL/I                              Hughes, Joan. PL/I Structured Programming. NY, Wiley Text Books, 1986, 3rd Ed.

Korn                             Korn, David and Morris Bolsky. The New KornShell Command and Programming Language. NJ, Prentice-Hall, 1995, 2nd Ed.

Bourne-Again Shell        Newham, Cameron and Rosenblatt, B. Learning the bash Shell. Sebastopol, CA, O’Reilly, 1998, 2nd Ed.

PHP                             Moulding, Peter. PHP Black Book. CA, Paraglyph Publishing, 2001.

Ruby                             Matsumoto, Yukihiro. Ruby in a Nutshell. Sebastopol, CA, O’Reilly, 2001.

Tcl/Tk                           Ousterhout, John. Tcl and the Tk Toolkit. Reading, Addison-Wesley, 1994.

Python                          van Rossum, Guido. The Python Language Reference. New Theory Ltd, 2003.

Perl                              Wall, Larry. Perl Programming. Sebastopol, CA, O’Reilly, 1996, 2nd Ed.

Rexx                             Cowlishaw, Michael. The Rexx Language: A Practical Approach to Programming. NJ, Prentice-Hall, 1990, 2nd Ed.

Websites

These are key websites for the free and open source programming languages discussed in this article.

Language                   Website

SNOBOL                       www.snobol4.org/, www.snobol4.com

ICON                            www.cs.arizona.edu/icon/, www.engin.umd.umich.edu/CIS/course.des/cis400/icon/icon.html

PL/I                              www-306.ibm.com/software/awdtools/pli/, http://home.nycap.rr.com/pflass/pli.htm, http://pl1gcc.sourceforge.net/

Korn                             www.cs.mun.ca/~michael/pdksh/, www.osxgnu.org/software/Shells/pdksh/

Bourne-Again Shell        www.gnu.org/software/bash/bash.html, www.tldp.org/HOWTO/Bash-Prog-Intro-HOWTO.html, http://tille.soti.org/training/bash/

PHP                             www.php.net/, http://php.resourceindex.com/, www.phpbuilder.com/

Ruby                             www.ruby-lang.org/en/

Tcl/Tk                           www.scriptics.com/, http://dev.scriptics.com/software/tcltk/ http://hegel.ittc.ukans.edu/topics/tcltk/

Python                          www.python.org/, www.pythonware.com/, www.activestate.com

Perl                              www.perl.org/, www.perl.com/, www.cpan.org/

Rexx                             http://regina-rexx.sourceforge.net/, www-306.ibm.com/software/awdtools/rexx/language/, http://users.comlab.ox.ac.uk/ian.collier/Rexx/rexximc.html

Selected Bibliography

Information Retrieval

Baeza-Yates, Ricardo and Berthier Ribiero-Neto. Modern Information Retrieval. Reading, Addison-Wesley, 1999.

Berry, Michael. Survey of Text Mining: Clustering, Classification and Retrieval. NY, Springer Verlag, 2003.

Chakrabarti, Soumen, Mining the Web: Analysis of Hypertext and Semi-Structured Data. San Francisco, CA, Morgan Kauffman, 2002.

Witten, Ian, Alistair Moffit,  and Timothy Bell. Managing Gigabytes: Compressing and Indexing Documents and Images. San Francisco, CA, Morgan Kauffman, 1999.

LIS Programming

Brown, Martin. XML Processing with Perl, Python and PHP, Alameda, CA, Sybex, 2001.

Cooper, Michael D. Design of Library Automation Systems: File Structures, Data Structures, and Tools. NY: John Wiley & Sons, 1996.

Davis, Charles H. Illustrative Computer Programming for Libraries: Selected Examples for Information Specialists. Westport, CN, Greenwood Press, 1974.

Davis, Charles H., Gerald W. Lundeen, and Deborah Shaw. Pascal Programming for Libraries: Illustrative Examples for Information Specialists. Westport, CN, Greenwood Press, 1988.

Fosdick, H. Structured PL/I Programming: For Textual and Library Processing. Littleton, CO, Libraries Unlimited, 1982.

Mertz, David. Text Processing in Python. Reading, Addison-Wesley, 2003.

Salton, Gerald. Automatic Text Processing. Reading, Addison-Wesley, 1989.

Linguistics Programming

Grishman, Ralph. Computational Linguistics. Cambridge, Cambridge University Press, 1986.

Hausser, Roland. Foundation of Computational Linguistics: Human-Computer Communication in Natural Language. NY, Springer Verlag, 2001.

Lawler, John and Helen Dry. Using Computers in Linguistics: A Practical Guide. Routledge, 1998.

McEnery, Tony and Andrew Wilson. Corpus Linguistics. Edinburgh, Edinburgh University Press, 2001, 2nd Ed.

Natural Language Processing

Allen, James. Natural Language Understanding. Addison-Wesley, 1994, 2nd Ed.

Iwanska, Lucja and Stuart Shapiro. Natural Language Processing and Knowledge Representation. AAAI Press, 2000.

Jackson, Peter and Isabelle Moulinier. Natural Language Processing for Online Applications: Text Retrieval, Extraction and Categorization. Amsterdam, John Benjamins Pub Co, 2002.

Manning, Christopher and Henrich Schutze. Foundations of Statistical Natural Language Processing. Boston, MIT Press, 1999.

Programming for the Humanities

Corre, Alan. Icon Programming for Humanists. NJ, Prentice-Hall, 1990.

Hockey, S. Electronic Texts in the Humanities: Principles and Practice. Oxford, Oxford University Press, 2001.

Hockey, S. A Guide to Computer Applications in the Humanities. Johns Hopkins University Press, 1983.

Hockey, S. SNOBOL Programming for the Humanities. Oxford, Clarendon Press, 1985.


How to Order

American Society for Information Science and Technology
8555 16th Street, Suite 850, Silver Spring, Maryland 20910, USA
Tel. 301-495-0900, Fax: 301-495-0810 | E-mail:
asis@asis.org

Copyright © 2005, American Society for Information Science and Technology