Introduction to Definite Clause Grammars

Theoretical Linguistics is concerned with articulating the regularities in a natural language; the generalizations it arrives at are formulated without concern as to how those rules might be utilised. To take a particularly simple example,

1) a sentence can be a noun phrase followed by a verb phrase.

As a generalisation about sentences of English, it holds irrespectively of whether one considers their construction, by speakers, or their interpretation by an audience. Still less is it concerned with how a computer might process sentences.

On the conventional, symbolist, approach to natural language processing, at some level the computer will be following instructions, such as one might render informally as

2) to test if a List of words is a sentence :-
      split the List into a Front and Back, and
      test if the Front is a noun phrase, and
      test if the Back is a verb phrase.  

At this level, all sorts of procedural issues must be resolved: how to represent sentences (e.g. as lists of words), in what direction to parse them (e.g. from left to right, and top-down). Care may be needed in the formulation of certain rules to avoid sending the machine into an infinite loop. Although Computational Linguistics must be appraised of all such issues, its ideal level of representation is one intermediate between the abstract principles of Theoretical Linguistics and the lower level of a programming language. The latter is cluttered with procedural detail which can obscure the essential linguistic principles.

We want to be able to formulate the principles in a format which can be utilised by a computer, but without worrying all the time about the details of their execution. This is a familiar strategy in Artificial Intelligence; linguistic knowledge is a form of knowledge, so this exemplifies the idea of employing special knowledge representation formalisms to represent different kinds of knowledge - in this case, a formalism dedicated for the expression of linguistic principles in a form which can be interpreted directly by a machine. Definite Clause Grammars, or DCGs, are one such formalism: they provide us with a high level grammar formalism, but one which is efficiently interpreted. Most Prolog programming environments have fully integrated DCG interpreters. In DCG format, rule 1) can be stated

sentence --> 
  noun_phrase, 
  verb_phrase .

or more briefly,

s --> np, vp. 

s "goes to" np, vp; i.e. a sentence can be realised as an np followed by a vp. We will start with a simple exercise: to write a DCG which can handle the following examples, in the sense that it will recognize them as grammatical sentences. (We shall employ Prolog lists to record sentences - square brackets enclosing 0 or more items, each separated from the next by a comma.)

[fred, wriggled]
[mary, employs, john]
[she, paid, him]
[the, bat, ate, a, spider]
[some, lecturer, taught, the, students]

One thing which will be required is a record of the words our grammar is to know about, classifying each one under the most basic linguistic category into which it falls: a lexicon. In DCG format these basic lexical records might take the form:

Common Noun --> 'woman'.

(A common noun can be realised by the string 'woman' - or, the string 'woman' can be classified as belonging to the category Common Noun.) In our case this will actually take the form,

cn --> [woman] . 

Other syntactic categories required for this fragment are those of Proper Name, Pronoun, Determiner, Transitive and Intransitive Verb. Examples of each of these are:

pn --> [john] .
pro --> [she].
det --> [every] .
tv --> [employs] .
iv --> [wriggled] .

Proper names and pronouns are both instances of a more inclusive category, that of noun phrase. That is, words from either of these two categories can play exactly the same syntactic roles within an English sentence; they can sit in the same places within a sentence. This means that one such word can always be substituted for another without affecting grammatical acceptability. (The issue is not the truth or falsity of any sentence produced in this way, nor even its meaningfulness, but simply whether it is grammatically acceptable to replace one constituent by another.) We would miss some significant generalisations if we did not include this piece of linguistic taxonomy - that both pns and pros can be re-classified under the higher category of np.

np --> pn .
np --> pro .

They are the simplest examples of noun phrases, syntactically unstructured (strictly speaking this is an over-simplification, since proper names can be structured e.g. by first name and surname). But there are also structured phrases which can occupy these positions we have labelled 'noun phrase', those such as [the, students].

np --> det, cn .
vp --> iv .
vp --> tv, np .

At the Prolog level, English phrases are held as difference lists. This is a prime example of a programming detail which the DCG formalism abstracts from. Whenever the DCG employs s/0, the underlying Prolog interpretation of this is in terms of the 2-place predicate s(List1, List2). These difference pairs take care of the splicing of lists of words automatically. This means that we can (and must) abstract from this detail at the level of the DCG; conversely, if we want to check things at the Prolog level, we must specify the two extra list arguments. (Thus e.g. to get Prolog to run through all sentences that this grammar generates, we could query all solutions to s(X, []).)

We can define our own predicate to translate between Prolog and the grammatical categories of the DCG:

parse(List) :-
  s(List, []) .

We can then test to see if particular sequences of words can be parsed as sentences:

?- parse( [mary, employs, john] ).

Alternatively, we can use the built-in predicate phrase:

?- phrase(X, [mary, employs, john]).

?- phrase(vp, X).

etc.

 


Exercises

Complete the DCG so that it can parse all the sentences given above. (Suggestion: you can copy code out of this document and paste it into a program window, but you should re-arrange the order so that the grammar is nicely presented. All the structural rules should be grouped together. Likewise, all the entries for each grammatical category should occur together - keep all pns together, all pros, etc. Save it as dcg1.)


© INGENIO.co.uk