# ftwmaxent Tutorial

This page walks you through using the ftwmaxent module to fit and test some simple models.
First install the software as described here: install.html. If you can type the following:
```user@host \$ python
Python 2.4.1 (#1, Apr  1 2005, 13:34:43)
[GCC 3.2] on linux2
>>> import ftwmaxent
>>>

```
then you are ready to go. (This tutorial doesn't require the pypar module for parallel processing, so you can ignore any messages about this.)

## 2  Machine translation example of Berger et al.

The paper of Berger et al. (1996) gives the following example:
We wish to model the translation of the English preposition `in' into French. We observe parallel corpora and note the following:
1. The translation is always one of dans, en, à, au cours de, or pendant.
2. 30% of the time the translation is either dans or en.
3. 50% of the time the translation is either dans or à.
We will suppose these three facts represent `truth' and constrain our model p as follows:
p(dans) + p(en) + p(à) + p(au cours de) + p(pendant) = 1
p(dans) + p(en) = 0.3
p(dans) + p(à) = 0.5
You can do this with the following commands:
```import math
import ftwmaxent, ftwutils

a_grave = u'\u00e0'
samplespace = ['dans', 'en', a_grave, 'au cours de', 'pendant']

def f0(x):
return x in samplespace

def f1(x):
return x=='dans' or x=='en'

def f2(x):
return x=='dans' or x==a_grave

f = [f0, f1, f2]
model = ftwmaxent.Model(f, samplespace)
K = [1.0, 0.3, 0.5]

```
Now we have created a model with the specified list of feature functions to constrain in the mean. Note that the features are indicator functions, returning True (1) or False (0), and constraining their expectations is equivalent to the constraints (1-3) on probabilities.
To fit the maximum entropy model subject to expectation constraints Efi(X) = Ki, we use the command:
```model.fit(K)

```
This will iteratively modify the parameter vector model.theta from its initial value of zero. After successful termination the constraints are satisfied to within the default absolute tolerance of 10-5 and the model has maximal entropy (most `uniformity') among all possible models satisfying the constraints.
To show the model parameters, type:
```>>>model.theta

```
To display the model probabilities assigned to each word in the sample space, type the following:
```>>> p = model.probdist()
>>> print "Fitted distribution is:"
>>> for x in model.samplespace:
>>>     print ("\tx = %-15s" %(x + ":",) + " p(x) = "+str(p[x])).encode('utf-8')
>>>

```
The output will be similar to:
```Fitted distribution is:
x = dans:           p(x) = 0.185859147419
x = en:             p(x) = 0.114145831992
x = Ã :              p(x) = 0.314137927881
x = au cours de:    p(x) = 0.192928546354
x = pendant:        p(x) = 0.192928546354

```
We have now fit a constrained maxent model on a small sample space. This procedure could be used for many natural language applications of involving tagging, chunking, and conditional (word-by-word) language modelling. For a detailed account of different natural language applications using this framework for inference, see the PhD thesis of Ratnaparkhi [2]. For conditional language modelling, see the paper [3] by Rosenfeld.

## 3  A larger example: modelling sentences

We now consider a modelling task on a larger (discrete) sample space-the space of sentences. There are too many possible strings of words to compute feature expectations by summing over them all, which makes fitting a model more difficult. It is necessary to determine a model's feature expectations at each iteration during fitting, so we will estimate them by importance sampling: we will sample random strings of words x1,x2,... and use these to estimate the feature expectations of the current model.
We consider a simple model of sentences with two types of constraint: unigram and sentence-length. We have the following `corpus' saved as alice1.txt:
Alice was beginning to get very tired of sitting by her sister on the bank, and of having nothing to do: once or twice she had peeped into the book her sister was reading, but it had no pictures or conversations in it, `and what is the use of a book,' thought Alice `without pictures or conversation?'
So she was considering in her own mind (as well as she could, for the hot day made her feel very sleepy and stupid), whether the pleasure of making a daisy-chain would be worth the trouble of getting up and picking the daisies, when suddenly a White Rabbit with pink eyes ran close by her.
There are only 78 unique words in these two paragraphs, but the number of possible `sentences' up to ten words in length is more than 7810 » 8×1018.
We read unigram (word-token) frequencies from disk with the following commands:
```>>> import string,re
>>> import ftwmaxent,ftwutils
>>> corpus = [re.split("[" + string.whitespace + string.punctuation + "]+", line.strip()) for line in open("alice1.txt") if line.strip()]
>>> unigrams = {}
>>> for s in corpus:
>>>     for w in s:
>>>         unigrams[w] = unigrams.get(w,0) + 1
>>>

```
The contents of unigrams is now a dictionary (hash) of word frequencies:
```{'and': 4, 'own': 1, 'chain': 1, ...

```
A slow but intuitive way to impose unigram frequency constraints on the model is to define a function for each of the 78 unique words indicating how many occurrences of that word is in the sentence:
```>>> fi = lambda w: lambda s: s.count(w)
>>> f = [fi(w) for w in unigrams]

```
We can define the desired expectations K = (Ki) of these functions f = (fi) as the relative frequencies in the corpus:
```>>> mean = lambda l: sum(l) * 1.0 / len(l)
>>> K = [mean([fi(line) for line in corpus]) for fi in f]

```
Suppose we add one more constraint: the average number of words per sentence in the model should equal 56.5, the mean sentence length in the corpus:
```>>> f.append(lambda s: len(s))
>>> K.append(mean([f[-1](line) for line in corpus]))

```
Each value Ki now represents the average number of occurrences of the word wi per sentence, where wi are the keys in the unigram dictionary.
We now have constraints Efi(X) = Ki; the final task before we can fit a model is to define the sample space of the sentences X. This is where this example differs from the simpler translation example above: before the sample space could be enumerated (dans, en, etc. )-now it cannot. We will instead define a probability distribution that generates sentences x according to some probability distribution p0(x). Each iteration of the algorithm will then sample many sentences until the feature functions fi under the current model pl can be estimated with sufficient accuracy. The closer the auxiliary distribution p0 is to pl |fi| (normalized) the smaller the variance of the resulting estimates.
We will first define a function that generates random words:
```>>> sampler = ftwutils.GenericDictSampler(unigrams)
>>> sampler.next()
>>> sampler.next()
('as', -4.0517849478033048)
>>> sampler.next()
('of', -3.1354942159291497)

```
The resulting sampler function is a Python 2.3+ generator that samples a word w from the distribution p0 and returns its log probability logp0(w), where p0 is the auxiliary distribution that with equal probabilities to the frequencies of words in the unigram table.
To generate entire sentences of average length 56.5:
```>>> ssampler = ftwutils.MultipleSampler(sampler, 56.5)
>>> ssampler.next()
(['of', 'or', 'making', 'into', 'pink', 'nothing', 'on', 'book', 'a', 'of', 'sitting', 'and', 'for', 'hot', 'when', 'of', 'she', 'and', 'of', 'when', 'and', 'up', 'daisy'], -94.547820172694415)
>>> ssampler.next()
(['do', 'by', 'and', 'chain', 'chain', 'no', 'chain', 'was', 'considering', 'trouble', 'White', 'on', 'ran', 'she', 'considering', 'suddenly', 'getting', 'making', 'considering', 'the', 'sister', 'well', 'sister', 'get', 'sleepy', 'do', 'the', 'conversations', 'into', 'to', 'use', 'her', 'the', 'was', 'tired', 'she', 'made', 'very', '', 'could', 'eyes', '', 'and', 'of', 'had', ... ], -547.51732838585872)
>>> ssampler.next()
(['a'], -3.6463198396951406)
>>>

```
<To be continued>

## References

[1]
Adam L. Berger, Stephen A. Della Pietra, and Vincent J. Della Pietra. A maximum entropy approach to natural language processing. Computational Linguistics, 22(1):39-71, 1996.
[2]
Adwait Ratnaparkhi. Maximum Entropy Models for Natural Language Ambiguity Resolution. PhD thesis, University of Pennsylvania, Philadelphia, PA, 1998.
[3]
Ronald Rosenfeld. A maximum entropy approach to adaptive statistical language modeling. Computer, Speech and Language, 10:187-228, 1996.

File translated from TEX by TTH, version 3.66.
On 21 Apr 2005, 18:09.