Introduction To Computer Languages (2)

Introduction

Many years ago I wrote an article that tried to provide an overview of different computer languages. It was a small success, but I was never very happy with it. This is a new attempt - probably shorter and simpler. The original can still be found at, for example, the Internet archive.

The aim here is the same as before - to give someone fairly new at programming a clearer view of the range of ideas that are out there. I am not trying to catalogue every different language so much as trying to give an idea of the variety that exists.

Unlike the previous version, I have not explained every term or provided a list of languages that illustrate particular ideas or paradigms. This is only intended as a high-level starting point for people who can then go on to search for more information (links are google searches for the appropriate term). Think of this document as a fairly readable, concise summary of what you may not know: it's good to know what you don't know.

I have arranged things by dividing the discussion into five "families" of languages; these are not formal groups and many languages use ideas from several. After these families I discuss a few general points.

Family 1 - Object Oriented Languages

I guess that object-oriented (OO) languages are the most common these days, and probably what most people learn first. Objects bundle together values and commands in a single entity - programs are then collections of interacting objects.

Typically an object represents some concept with a program. For example, perhaps someone's bank account. Such an object might have a method (a collection of commands) that can modify the bank account balance (a value stored within the same object). The general idea is that each object manages its own values using the code in its methods; to change values in a separate object the methods in that object should be called. The idea that each object is uniquely responsible for managing its own values is called encapsulation.

Classes / Templates

Above we discussed using an object to represent a bank account. A banking program might deal with many bank accounts, each with their own balance, and so a programming language has to address how best to manage many similar objects (a programmer does not want to have to write separate code for each bank account when they share so much in common). One way of doing this is to use classes, which include all the methods (commands) common to all related objects (called class instances). However, this is not the only approach - it is also possible to create new objects by using other, existing objects as templates.

In addition to providing a way for creating many identical objects with a single description, classes can typically extend other classes. This is called inheritance.

Inheritance introduces the possibility of complex behaviour when one class extends another with a method whose name is identical to a method that has already been defined. Code in "earlier" classes that refers to the method by name might invoke either the "old" or "new" method. Using the new method gives a powerful mechanism for the programmer to "customise" existing classes, but can also introduce subtle errors. The rules for controlling which method is invoked vary from language to language.

In a simple class based system, a method is associated with a single class. The method invoked when a particular name is used then depends on the name of the method and the object's class. This is "single dispatch". In such languages the program text is often written in a way that identifies the single object associated with the method in a special way. Typically the object name precedes the method name and then other method parameters are written as function parameters; within the method the object is implicit or referred to with a special keyword like "self" or "this". Alternatively, methods may be associated with combinations of several classes. This is multiple dispatch and uses a syntax in which all method arguments are treated like function parameters (as I write this I can only think of one language that is multiple dispatch - Common Lisp).

Classes as Types

Not all entities within a program are the same; the way in which an entity can be used is reflected in its type. So, for example, a number and a string are usually different types since arithmetic operations do not apply to strings. Programming languages vary in how they use information about the types of entities; those that use the information to validate some properties of a program before it runs are called statically typed languages; the logical structure that underlies the types (and so indicates whether or not a program is valid) is called a type system.

In OO languages, different objects can be used in different ways (depending on what methods they have). In statically typed languages the objects must be integrated within the type system. One approach - conceptually perhaps the simplest, since it mirrors the typical runtime behaviour in a program that is not statically typed - is to treat each method name in isolation. So a program is invalid if at some point the programmer attempts to use an object with a method name that is undefined (for that object).

In practice duck typing (the process just described) is not used much in statically typed languages (perhaps because it is difficult to track what methods are associated with each object without additional annotation in the program, and because additional annotation for each object is too verbose). Instead, validation is typically performed using classes: each object is associated with a particular class and only methods defined for that class can be invoked on the object. This reduces the complexity of the type inference and/or annotation that is required to validate the code, but makes classes take a more critical role in the language (it is harder to have unusual - perhaps dynamically changing - objects that do not exactly match any one class).

An important aspect of type systems is how generic uses are validated. For example, a process may require only that entities support some simple operations (like sorting), or a collection (array or list) may store a variety of different types. OO languages can use inheritance to help manage this. Sorting may be defined using a class with simple requirements (perhaps a single method for comparing two values), and can then be used with any subclass (any class that extends that class). Allowing a class to extend many different classes (in parallel) is necessary to use this "trick" with more than one simple type and is called "multiple inheritance". However, multiple inheritance can introduce other problems (particularly method resolution when two different parent classes share a common superclass); some languages therefore allow only single inheritance of classes, but allow multiple inheritance of purely abstract classes or "interfaces" (ie. class signatures with method names, parameters, and return types but no implementation).

Family 2 - Procedural Languages

This family of languages has fallen out of favour recently. They are perhaps best described as OO languages without objects. Examples include C, Fortran, Algol, Cobol, and Pascal.

Historically (in very simple terms) OO languages can be seen as an attempt to solve the problems that were experienced with procedural languages, where changing values could introduce subtle bugs that were difficult to track down. In practice it is possible to program in an "OO style" with many procedural languages. This can be done simply by following careful conventions or it can involve exploiting other language features to enforce encapsulation. In C, for example, structs can be used with function pointers to create ad-hoc objects (without inheritance).

Family 3 - Functional Languages

Mutable State

Earlier I mentioned how OO was one attempt to manage changing values (mutable state) within a program. Functional programs are a separate line of attack which work by not allowing values to change "at all". This seems counter-intuitive (or impossible), but there are two details that make it work. First, you can always calculate a new value and pass that on to some other part of the program - it is only "old" values that are fixed. Second, each time a function is called, any variables introduced within the function can take on new values. Together these mean that a program can still work, but that things similar to both "global variables" and objects are impossible.

So programs written in functional languages tend to be very explicit about the values that can change. These values are identified clearly and the program is structured around them (by, for example, writing functions that modify those values).

The above isn't a very good explanation; I don't know how to explain the constraint on mutable state better and only understood the idea myself by writing some programs using functional languages. Perhaps this will help: one way in which I view functional programs is something like a workshop that makes a sculpture. The sculpture sits in the middle of the workshop and various tools remove or add parts until we have a final work of art; the sculpture is the data, the workshop is the program, and the tools are functions.

In practice you can write in a "functional style" in a procedural or OO language by taking care to be explicit about managing data (in which case OO doesn't add much to a basic procedural language). However, functional languages add a number of tools that help simplify the work involved. First, functional languages provide many ways to build data structures and to access parts of those structures in a simple way. Second, they support efficient tail recursion - this allows functions to call themselves (or a set of closely related functions to call each other) repeatedly without running out of memory. This is useful because recursion is used instead of iteration (for example instead of changing a counter value within a loop, which is not allowed, a function might call itself, passing in the current counter value "plus one" for each loop).

From an explanation alone it is almost certainly unclear how disallowing mutable state will improve a program. In practice, making the state explicit tends to simplify the program logic, while the extra tools available to manipulate data help make the code clearer. On the other hand, it can sometimes be annoying to constantly have to handle state. As a consequence some languages do allow mutable state as a special case (typically with an ugly syntax) - these are called "impure" functional languages.

First Class and Higher Order Functions

A second characteristic of functional programming languages is the emphasis on the use of functions (this is related to the theoretical basis for functional languages, lambda calculus). Functions are "first class" which means that they can be manipulated within the program in a similar way to other entities (passed as values, stored in collections, etc). They can also be "higher order" - a related term that describes a function that takes another function as an argument.

A language that supports first class functions must be concerned with variable scope. This defines how are variables are bound to values when a function is evaluated and becomes important when a function can be passed to elsewhere in the program before it is evaluated.

Most languages use lexical scope (the alternative, static scope, was used in some early Lisp implementations). This means that the values used are those available to the function when it was defined. So if a function uses the value of a variable "x", for example, that value is the value of the variable called "x" that was defined around where the function was defined and not any other variable that "by chance" happens to be called "x" near where the function is evaluated.

This may seem an obscure point, but it requires closures for implementation which gives programs written using functional languages a powerful tool for encapsulating data. In non-strict functional languages closures can be used in a very similar way to objects in OO languages.

Declarative

Programs written in functional languages are usually considered more declarative than the equivalent in procedural or OO languages. What does this mean and why does it happen?

"Declarative" means that the program places more emphasis on describing the relationship between entities than on the fine details of how to manipulate those entities. This will become clearer when we look at logic languages (below).

This shift in emphasis is partly due to reasons already mentioned. More powerful tools for managing data structures mean that less of the program is spent in low-level details. The use of recursion instead of iteration also gives a more "logical" feel to code. First class and higher order functions allow actions to be expressed in concise, flexible ways.

Lazy Evaluation

All(?) OO and procedural languages, and most functional programming languages are "eager". This means that function calls are evaluated when they are first mentioned. Some functional languages, however, are "lazy" - functions are only evaluated when the result is "needed" (eg. for display to the user). This allows expressions to name results before they are calculated and so handles infinite expressions; it also allows infinite series to be handled in a simple way. The disadvantage is that the order of evaluation (and so the memory use) becomes less easy to predict. This explains why lazy evaluation is only practical in purely functional languages - in a language with side effects the "unpredictable" evaluation strategy would make global state unreliable.

Because lazy evaluation "automatically" determines execution order it helps make programs declarative (the order in which operations occur becomes more implicit, leaving the "fixed relations" between entities).

Type Systems

Perhaps because functional languages have tended to be associated with academia, much of the research into type systems has been made with functional languages. So statically typed functional languages tend to have some of the most expressive and complex type systems. These may address issues like modularity, polymorphism, type classes, dependent types, etc.

One important idea, handling generic types, was already discussed in the context of OO languages with inheritance. Statically typed functional languages typically use a more general form of polymorphism to solve the same problem and this has filtered back into some OO languages as "generics" (C++ templates are perhaps more closely related to code generation, which I will discuss elsewhere).

Note that not all type systems require annotations. Some may infer (perhaps only partially) information from the source code (for example, Hindley Miller). And some type systems are so expressive they are Turing complete.

Multi-Paradigm Languages

Much of this section has been rather vague. This is because, in my opinion, the value of functional programming is not that closely tied to the definition of functional programming as a lack of mutable state (and first class functions can often be approximated using objects or function pointers). So the tools and programming style that characterise functional programming can often be used in other (OO and procedural) languages. This is good news because it means that many of the advantages of functional programming languages can be exploited more widely. And I think this is happening for a number of reasons: the work done in extending virtual machines originally intended for OO to support a wider range of languages (CLR, Da Vinci); the investment Microsoft has made in the Functional Programming (FP) community; the increased popularity of languages that support several paradigms (main examples are probably Scala and OCaml, but Java and C# are also adopting ideas developed first in functional languages).

Family 4 - Logic Languages

(I am on less familiar ground here as I have little experience with logic languages.)

I mentioned earlier that functional languages are considered more declarative that OO or procedural ones, and that lazy evaluation also helps. The idea can be taken further by automating the process of searching for a solution to relationships defined by the programmer. In other words, a program consists of a list of desired relationships; executing the program is equivalent to finding a solution that satisfies all the relationships.

There are various ways this can be implemented - depth or breadth first search, or by enumerating all possibilities. Some languages allow the programmer to improve efficiency and/or control memory use by, for example, indicating where backtracking is not needed.

Since the execution order is unpredictable the programmer will not always know which variables have values and which are still undetermined. Within the language this is expressed using logic variables - these are either undetermined or bound to a value.

One interesting area in which logic languages prove to be powerful (apart from the obvious rule-based and database software) is implementing languages (for example, Erlang was first implemented in Prolog). This can be partly understood by considering the relationship between recursive descent parsing and depth first search (then, I think, something similar applies to interpreting the AST).

As with other paradigms, logic languages may be statically typed (eg. Mercury).

Multi-Paradigm Languages

Oz/Mozart is an interesting language/system that attempts to integrate logic programming with other paradigms. It combines both explicit execution flow (in a multithreaded environment where logic variables are used for synchronisation; threads block on accessing an undefined value) and the possibility to search over values (I strongly recommend Peter Van Roy's book, Concepts, Techniques, and Models of Computer Programming).

Family 5 - Stack Languages

Standard Implementation Techniques

(This sub-section is general background; it is not about stack languages but I wanted to clarify how stacks are used across most language implementations.)

Most languages use a stack to store local variables (those created within a function which can be discarded when the function returns). The stack increases with each function call and decreases with each return. In this way variables are automatically managed with the correct lifetime. Values that must exist beyond the scope of the function are located separately (on the "heap").

There are some complications: continuations require stacks to be preserved (often meaning that the stack becomes a tree with backwards references, called a spaghetti stack); closures may require values to be moved from the stack to the heap; values on the heap require more careful management, since the memory is not freed automatically (this may be manual or automatic, via "garbage collection").

Stack Languages

In stack languages the management of variables via the stack is made explicit. Access to values is made by relative position on the stack rather than via names. Such languages may be simpler and more compact, but the loss of direct naming of local variables - perhaps the most fundamental abstraction - can make programming more difficult.

Other Topics

Code Generation

In some languages the source code itself can be manipulated. Languages in the Lisp family exploit a uniform, logically consistent syntax that unifies code and data. Prolog has similar abilities. Statically typed languages present more of a challenge, but significant progress has been made via staged evaluation.

Concurrency

Many languages allow multiple "processes" (in a general sense) to run at once. Typically this is based on threads, but coroutines are one alternative approach. When threads are used they may be managed directly by the language, or exploit functionality provided by the operating service. Threads managed by the language tend to be lighter weight, but are less likely to run on multiple processors.

One of the most challenging aspects of programming with multiple threads is dealing with shared mutable state. Traditionally access to shared mutable values are controlled by mutexes, but this can easily lead to deadlock. Alternative approaches include software transactional memory, explicit message passing, or avoiding the issue with pure functional languages.

Erlang is particularly notable in how it addresses not just concurrency, but related issues of reliability and availability.