{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "## April 15th\n", "\n", "Natural Language Processing" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# TEXT\n", "\n", "This notebook serves as supporting material for topics covered in **Chapter 22 - Natural Language Processing** from the book *Artificial Intelligence: A Modern Approach*. This notebook uses implementations from [text.py](https://github.com/aimacode/aima-python/blob/master/text.py)." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "from text import *\n", "from utils import open_data\n", "from notebook import psource" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## CONTENTS\n", "\n", "* Text Models\n", "* Viterbi Text Segmentation\n", "* Information Retrieval\n", "* Information Extraction\n", "* Decoders" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## TEXT MODELS\n", "\n", "Before we start analyzing text processing algorithms, we will need to build some language models. Those models serve as a look-up table for character or word probabilities (depending on the type of model). These models can give us the probabilities of words or character sequences appearing in text. Take as example \"the\". Text models can give us the probability of \"the\", *P(\"the\")*, either as a word or as a sequence of characters (\"t\" followed by \"h\" followed by \"e\"). The first representation is called \"word model\" and deals with words as distinct objects, while the second is a \"character model\" and deals with sequences of characters as objects. Note that we can specify the number of words or the length of the char sequences to better suit our needs. So, given that number of words equals 2, we have probabilities in the form *P(word1, word2)*. For example, *P(\"of\", \"the\")*. For char models, we do the same but for chars.\n", "\n", "It is also useful to store the conditional probabilities of words given preceding words. That means, given we found the words \"of\" and \"the\", what is the chance the next word will be \"world\"? More formally, *P(\"world\"|\"of\", \"the\")*. Generalizing, *P(Wi|Wi-1, Wi-2, ... , Wi-n)*.\n", "\n", "We call the word model *N-Gram Word Model* (from the Greek \"gram\", the root of \"write\", or the word for \"letter\") and the char model *N-Gram Character Model*. In the special case where *N* is 1, we call the models *Unigram Word Model* and *Unigram Character Model* respectively.\n", "\n", "In the `text` module we implement the two models (both their unigram and n-gram variants) by inheriting from the `CountingProbDist` from `learning.py`. Note that `CountingProbDist` does not return the actual probability of each object, but the number of times it appears in our test data.\n", "\n", "For word models we have `UnigramWordModel` and `NgramWordModel`. We supply them with a text file and they show the frequency of the different words. We have `UnigramCharModel` and `NgramCharModel` for the character models.\n", "\n", "Execute the cells below to take a look at the code." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\n", " \n", " \n", " \n", "\n", "\n", "

\n", "\n", "

class UnigramWordModel(CountingProbDist):\n",
       "\n",
       "    """This is a discrete probability distribution over words, so you\n",
       "    can add, sample, or get P[word], just like with CountingProbDist. You can\n",
       "    also generate a random text, n words long, with P.samples(n)."""\n",
       "\n",
       "    def __init__(self, observations, default=0):\n",
       "        # Call CountingProbDist constructor,\n",
       "        # passing the observations and default parameters.\n",
       "        super(UnigramWordModel, self).__init__(observations, default)\n",
       "\n",
       "    def samples(self, n):\n",
       "        """Return a string of n words, random according to the model."""\n",
       "        return ' '.join(self.sample() for i in range(n))\n",
       "\n",
       "\n",
       "class NgramWordModel(CountingProbDist):\n",
       "\n",
       "    """This is a discrete probability distribution over n-tuples of words.\n",
       "    You can add, sample or get P[(word1, ..., wordn)]. The method P.samples(n)\n",
       "    builds up an n-word sequence; P.add_cond_prob and P.add_sequence add data."""\n",
       "\n",
       "    def __init__(self, n, observation_sequence=None, default=0):\n",
       "        # In addition to the dictionary of n-tuples, cond_prob is a\n",
       "        # mapping from (w1, ..., wn-1) to P(wn | w1, ... wn-1)\n",
       "        CountingProbDist.__init__(self, default=default)\n",
       "        self.n = n\n",
       "        self.cond_prob = defaultdict()\n",
       "        self.add_sequence(observation_sequence or [])\n",
       "\n",
       "    # __getitem__, top, sample inherited from CountingProbDist\n",
       "    # Note that they deal with tuples, not strings, as inputs\n",
       "\n",
       "    def add_cond_prob(self, ngram):\n",
       "        """Build the conditional probabilities P(wn | (w1, ..., wn-1)"""\n",
       "        if ngram[:-1] not in self.cond_prob:\n",
       "            self.cond_prob[ngram[:-1]] = CountingProbDist()\n",
       "        self.cond_prob[ngram[:-1]].add(ngram[-1])\n",
       "\n",
       "    def add_sequence(self, words):\n",
       "        """Add each tuple words[i:i+n], using a sliding window."""\n",
       "        n = self.n\n",
       "\n",
       "        for i in range(len(words) - n + 1):\n",
       "            t = tuple(words[i:i + n])\n",
       "            self.add(t)\n",
       "            self.add_cond_prob(t)\n",
       "\n",
       "    def samples(self, nwords):\n",
       "        """Generate an n-word sentence by picking random samples\n",
       "        according to the model. At first pick a random n-gram and\n",
       "        from then on keep picking a character according to\n",
       "        P(c|wl-1, wl-2, ..., wl-n+1) where wl-1 ... wl-n+1 are the\n",
       "        last n - 1 words in the generated sentence so far."""\n",
       "        n = self.n\n",
       "        output = list(self.sample())\n",
       "\n",
       "        for i in range(n, nwords):\n",
       "            last = output[-n+1:]\n",
       "            next_word = self.cond_prob[tuple(last)].sample()\n",
       "            output.append(next_word)\n",
       "\n",
       "        return ' '.join(output)\n",
       "\n",
       "\n",
       "class UnigramCharModel(NgramCharModel):\n",
       "    def __init__(self, observation_sequence=None, default=0):\n",
       "        CountingProbDist.__init__(self, default=default)\n",
       "        self.n = 1\n",
       "        self.cond_prob = defaultdict()\n",
       "        self.add_sequence(observation_sequence or [])\n",
       "\n",
       "    def add_sequence(self, words):\n",
       "        for word in words:\n",
       "            for char in word:\n",
       "                self.add(char)\n",
       "\n",
       "\n",
       "class NgramCharModel(NgramWordModel):\n",
       "    def add_sequence(self, words):\n",
       "        """Add an empty space to every word to catch the beginning of words."""\n",
       "        for word in words:\n",
       "            super().add_sequence(' ' + word)\n",
       "

\n", "\n", "\n" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "psource(UnigramWordModel, NgramWordModel, UnigramCharModel, NgramCharModel)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Next we build our models. The text file we will use to build them is *Flatland*, by Edwin A. Abbott. We will load it from [here](https://github.com/aimacode/aima-data/blob/a21fc108f52ad551344e947b0eb97df82f8d2b2b/EN-text/flatland.txt). In that directory you can find other text files we might get to use here." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Getting Probabilities\n", "\n", "Here we will take a look at how to read text and find the probabilities for each model, and how to retrieve them.\n", "\n", "First the word models:" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\n", " \n", " \n", " \n", "\n", "\n", "

\n", "\n", "

def words(text, reg=re.compile('[a-z0-9]+')):\n",
       "    """Return a list of the words in text, ignoring punctuation and\n",
       "    converting everything to lowercase (to canonicalize).\n",
       "    >>> words("``EGAD!'' Edgar cried.")\n",
       "    ['egad', 'edgar', 'cried']\n",
       "    """\n",
       "    return reg.findall(text.lower())\n",
       "

\n", "\n", "\n" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "psource(words)" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[(2081, 'the'), (1479, 'of'), (1021, 'and'), (1008, 'to'), (850, 'a')]\n", "[(368, ('of', 'the')), (152, ('to', 'the')), (152, ('in', 'the')), (86, ('of', 'a')), (80, ('it', 'is'))]\n", "0.0036724740723330495\n", "0.00114584557527324\n" ] } ], "source": [ "flatland = open_data(\"EN-text/flatland.txt\").read()\n", "wordseq = words(flatland)\n", "\n", "P1 = UnigramWordModel(wordseq)\n", "P2 = NgramWordModel(2, wordseq)\n", "\n", "print(P1.top(5))\n", "print(P2.top(5))\n", "\n", "print(P1['an'])\n", "print(P2[('i', 'was')])" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "193395 34037\n" ] } ], "source": [ "print (len(flatland), len(wordseq))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "How many unique words?" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "4629" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "len(set(wordseq))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We see that the most used word in *Flatland* is 'the', with 2081 occurrences, while the most used sequence is 'of the' with 368 occurrences. Also, the probability of 'an' is approximately 0.003, while for 'i was' it is close to 0.001. Note that the strings used as keys are all lowercase. For the unigram model, the keys are single strings, while for n-gram models we have n-tuples of strings.\n", "\n", "Below we take a look at how we can get information from the conditional probabilities of the model, and how we can generate the next word in a sequence." ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Conditional Probabilities Table: {'in': 2, 'simulating': 1, 'surprised': 1, 'sitting': 1, 'rapt': 1, 'afraid': 1, 'standing': 1, 'covered': 1, 'not': 4, 'now': 2, 'myself': 1, 'by': 1, 'approaching': 1, 'wearied': 1, 'certain': 2, 'to': 2, 'intoxicated': 1, 'rapidly': 1, 'once': 2, 'unusually': 1, 'will': 1, 'glad': 1, 'at': 2, 'quite': 1, 'considered': 1, 'keenly': 1, 'describing': 1, 'allowed': 1, 'pleased': 1, 'crushed': 1} \n", "\n", "Conditional Probability of 'once' given 'i was': 0.05128205128205128 \n", "\n", "Next word after 'i was': rapt\n" ] } ], "source": [ "flatland = open_data(\"EN-text/flatland.txt\").read()\n", "wordseq = words(flatland)\n", "\n", "P3 = NgramWordModel(3, wordseq)\n", "\n", "print(\"Conditional Probabilities Table:\", P3.cond_prob[('i', 'was')].dictionary, '\\n')\n", "print(\"Conditional Probability of 'once' given 'i was':\", P3.cond_prob[('i', 'was')]['once'], '\\n')\n", "print(\"Next word after 'i was':\", P3.cond_prob[('i', 'was')].sample())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "First we print all the possible words that come after 'i was' and the times they have appeared in the model. Next we print the probability of 'once' appearing after 'i was', and finally we pick a word to proceed after 'i was'. Note that the word is picked according to its probability of appearing (high appearance count means higher chance to get picked)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's take a look at the two character models:" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[(19208, 'e'), (13965, 't'), (12069, 'o'), (11702, 'a'), (11440, 'i')]\n", "[(5364, (' ', 't')), (4573, ('t', 'h')), (4063, (' ', 'a')), (3654, ('h', 'e')), (2967, (' ', 'i'))]\n", "0.0006028715031814578\n", "0.0032371578540395666\n" ] } ], "source": [ "flatland = open_data(\"EN-text/flatland.txt\").read()\n", "wordseq = words(flatland)\n", "\n", "P1 = UnigramCharModel(wordseq)\n", "P2 = NgramCharModel(2, wordseq)\n", "\n", "print(P1.top(5))\n", "print(P2.top(5))\n", "\n", "print(P1['z'])\n", "print(P2[('g', 'h')])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The most common letter is 'e', appearing more than 19000 times, and the most common sequence is \"\\_t\". That is, a space followed by a 't'. Note that even though we do not count spaces for word models or unigram character models, we do count them for n-gram char models.\n", "\n", "Also, the probability of the letter 'z' appearing is close to 0.0006, while for the bigram 'gh' it is 0.003." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Generating Samples\n", "\n", "Apart from reading the probabilities for n-grams, we can also use our model to generate word sequences, using the `samples` function in the word models." ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "direction so become diminished collision i feel the would or\n", "a line but since words of benefits duty after my\n", "are and at our university in some of the word\n" ] } ], "source": [ "flatland = open_data(\"EN-text/flatland.txt\").read()\n", "wordseq = words(flatland)\n", "\n", "P1 = UnigramWordModel(wordseq)\n", "P2 = NgramWordModel(2, wordseq)\n", "P3 = NgramWordModel(3, wordseq)\n", "\n", "print(P1.samples(10))\n", "print(P2.samples(10))\n", "print(P3.samples(10))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For the unigram model, we mostly get gibberish, since each word is picked according to its frequency of appearance in the text, without taking into consideration preceding words. As we increase *n* though, we start to get samples that do have some semblance of coherency and do remind a little bit of normal English. As we increase our data, these samples will get better.\n", "\n", "Let's try it. We will add to the model more data to work with and let's see what comes out." ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "credulous her demands are exorbitant but she knew not how to admire an old bachelor\n", "because of his unevenness you reply that for that very reason because he cannot help\n", "would be better for all parties if the sum were diminished one half five hundred\n", "it was impossible for them if they spoke at all to keep clear of every\n" ] } ], "source": [ "data = open_data(\"EN-text/flatland.txt\").read()\n", "data += open_data(\"EN-text/sense.txt\").read()\n", "\n", "wordseq = words(data)\n", "\n", "P3 = NgramWordModel(3, wordseq)\n", "P4 = NgramWordModel(4, wordseq)\n", "P5 = NgramWordModel(5, wordseq)\n", "P7 = NgramWordModel(7, wordseq)\n", "\n", "print(P3.samples(15))\n", "print(P4.samples(15))\n", "print(P5.samples(15))\n", "print(P7.samples(15))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Notice how the samples start to become more and more reasonable as we add more data and increase the *n* parameter. We are still a long way to go though from realistic text generation, but at the same time we can see that with enough data even rudimentary algorithms can output something almost passable." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## VITERBI TEXT SEGMENTATION\n", "\n", "### Overview\n", "\n", "We are given a string containing words of a sentence, but all the spaces are gone! It is very hard to read and we would like to separate the words in the string. We can accomplish this by employing the `Viterbi Segmentation` algorithm. It takes as input the string to segment and a text model, and it returns a list of the separate words.\n", "\n", "The algorithm operates in a dynamic programming approach. It starts from the beginning of the string and iteratively builds the best solution using previous solutions. It accomplishes that by segmentating the string into \"windows\", each window representing a word (real or gibberish). It then calculates the probability of the sequence up that window/word occurring and updates its solution. When it is done, it traces back from the final word and finds the complete sequence of words." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Implementation" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\n", " \n", " \n", " \n", "\n", "\n", "

\n", "\n", "

def viterbi_segment(text, P):\n",
       "    """Find the best segmentation of the string of characters, given the\n",
       "    UnigramWordModel P."""\n",
       "    # best[i] = best probability for text[0:i]\n",
       "    # words[i] = best word ending at position i\n",
       "    n = len(text)\n",
       "    words = [''] + list(text)\n",
       "    best = [1.0] + [0.0] * n\n",
       "    # Fill in the vectors best words via dynamic programming\n",
       "    for i in range(n+1):\n",
       "        for j in range(0, i):\n",
       "            w = text[j:i]\n",
       "            curr_score = P[w] * best[i - len(w)]\n",
       "            if curr_score >= best[i]:\n",
       "                best[i] = curr_score\n",
       "                words[i] = w\n",
       "    # Now recover the sequence of best words\n",
       "    sequence = []\n",
       "    i = len(words) - 1\n",
       "    while i > 0:\n",
       "        sequence[0:0] = [words[i]]\n",
       "        i = i - len(words[i])\n",
       "    # Return sequence of best words and overall probability\n",
       "    return sequence, best[-1]\n",
       "

\n", "\n", "\n" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "psource(viterbi_segment)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The function takes as input a string and a text model, and returns the most probable sequence of words, together with the probability of that sequence.\n", "\n", "The \"window\" is `w` and it includes the characters from *j* to *i*. We use it to \"build\" the following sequence: from the start to *j* and then `w`. We have previously calculated the probability from the start to *j*, so now we multiply that probability by `P[w]` to get the probability of the whole sequence. If that probability is greater than the probability we have calculated so far for the sequence from the start to *i* (`best[i]`), we update it." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Example\n", "\n", "The model the algorithm uses is the `UnigramTextModel`. First we will build the model using the *Flatland* text and then we will try and separate a space-devoid sentence." ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Sequence of words is: ['it', 'is', 'easy', 'to', 'read', 'words', 'without', 'spaces']\n", "Probability of sequence is: 2.273672843573388e-24\n" ] } ], "source": [ "flatland = open_data(\"EN-text/flatland.txt\").read()\n", "wordseq = words(flatland)\n", "P = UnigramWordModel(wordseq)\n", "text = \"itiseasytoreadwordswithoutspaces\"\n", "\n", "s, p = viterbi_segment(text,P)\n", "print(\"Sequence of words is:\",s)\n", "print(\"Probability of sequence is:\",p)" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "['c', 'at', 's', 'and', 'do', 'g', 's'] 0.0\n" ] } ], "source": [ "s, p = viterbi_segment('catsanddogs',P)\n", "print (s,p)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The algorithm correctly retrieved the words from the string. It also gave us the probability of this sequence, which is small, but still the most probable segmentation of the string." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## INFORMATION RETRIEVAL\n", "\n", "### Overview\n", "\n", "With **Information Retrieval (IR)** we find documents that are relevant to a user's needs for information. A popular example is a web search engine, which finds and presents to a user pages relevant to a query. Information retrieval is not limited only to returning documents though, but can also be used for other type of queries. For example, answering questions when the query is a question, returning information when the query is a concept, and many other applications. An IR system is comprised of the following:\n", "\n", "* A body (called corpus) of documents: A collection of documents, where the IR will work on.\n", "\n", "* A query language: A query represents what the user wants.\n", "\n", "* Results: The documents the system grades as relevant to a user's query and needs.\n", "\n", "* Presententation of the results: How the results are presented to the user.\n", "\n", "How does an IR system determine which documents are relevant though? We can sign a document as relevant if all the words in the query appear in it, and sign it as irrelevant otherwise. We can even extend the query language to support boolean operations (for example, \"paint AND brush\") and then sign as relevant the outcome of the query for the document. This technique though does not give a level of relevancy. All the documents are either relevant or irrelevant, but in reality some documents are more relevant than others.\n", "\n", "So, instead of a boolean relevancy system, we use a *scoring function*. There are many scoring functions around for many different situations. One of the most used takes into account the frequency of the words appearing in a document, the frequency of a word appearing across documents (for example, the word \"a\" appears a lot, so it is not very important) and the length of a document (since large documents will have higher occurrences for the query terms, but a short document with a lot of occurrences seems very relevant). We combine these properties in a formula and we get a numeric score for each document, so we can then quantify relevancy and pick the best documents.\n", "\n", "These scoring functions are not perfect though and there is room for improvement. For instance, for the above scoring function we assume each word is independent. That is not the case though, since words can share meaning. For example, the words \"painter\" and \"painters\" are closely related. If in a query we have the word \"painter\" and in a document the word \"painters\" appears a lot, this might be an indication that the document is relevant but we are missing out since we are only looking for \"painter\". There are a lot of ways to combat this. One of them is to reduce the query and document words into their stems. For example, both \"painter\" and \"painters\" have \"paint\" as their stem form. This can improve slightly the performance of algorithms.\n", "\n", "To determine how good an IR system is, we give the system a set of queries (for which we know the relevant pages beforehand) and record the results. The two measures for performance are *precision* and *recall*. Precision measures the proportion of result documents that actually are relevant. Recall measures the proportion of relevant documents (which, as mentioned before, we know in advance) appearing in the result documents." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Implementation\n", "\n", "You can read the source code by running the command below:" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\n", " \n", " \n", " \n", "\n", "\n", "

\n", "\n", "

class IRSystem:\n",
       "\n",
       "    """A very simple Information Retrieval System, as discussed in Sect. 23.2.\n",
       "    The constructor s = IRSystem('the a') builds an empty system with two\n",
       "    stopwords. Next, index several documents with s.index_document(text, url).\n",
       "    Then ask queries with s.query('query words', n) to retrieve the top n\n",
       "    matching documents. Queries are literal words from the document,\n",
       "    except that stopwords are ignored, and there is one special syntax:\n",
       "    The query "learn: man cat", for example, runs "man cat" and indexes it."""\n",
       "\n",
       "    def __init__(self, stopwords='the a of'):\n",
       "        """Create an IR System. Optionally specify stopwords."""\n",
       "        # index is a map of {word: {docid: count}}, where docid is an int,\n",
       "        # indicating the index into the documents list.\n",
       "        self.index = defaultdict(lambda: defaultdict(int))\n",
       "        self.stopwords = set(words(stopwords))\n",
       "        self.documents = []\n",
       "\n",
       "    def index_collection(self, filenames):\n",
       "        """Index a whole collection of files."""\n",
       "        prefix = os.path.dirname(__file__)\n",
       "        for filename in filenames:\n",
       "            self.index_document(open(filename).read(),\n",
       "                                os.path.relpath(filename, prefix))\n",
       "\n",
       "    def index_document(self, text, url):\n",
       "        """Index the text of a document."""\n",
       "        # For now, use first line for title\n",
       "        title = text[:text.index('\\n')].strip()\n",
       "        docwords = words(text)\n",
       "        docid = len(self.documents)\n",
       "        self.documents.append(Document(title, url, len(docwords)))\n",
       "        for word in docwords:\n",
       "            if word not in self.stopwords:\n",
       "                self.index[word][docid] += 1\n",
       "\n",
       "    def query(self, query_text, n=10):\n",
       "        """Return a list of n (score, docid) pairs for the best matches.\n",
       "        Also handle the special syntax for 'learn: command'."""\n",
       "        if query_text.startswith("learn:"):\n",
       "            doctext = os.popen(query_text[len("learn:"):], 'r').read()\n",
       "            self.index_document(doctext, query_text)\n",
       "            return []\n",
       "\n",
       "        qwords = [w for w in words(query_text) if w not in self.stopwords]\n",
       "        shortest = argmin(qwords, key=lambda w: len(self.index[w]))\n",
       "        docids = self.index[shortest]\n",
       "        return heapq.nlargest(n, ((self.total_score(qwords, docid), docid) for docid in docids))\n",
       "\n",
       "    def score(self, word, docid):\n",
       "        """Compute a score for this word on the document with this docid."""\n",
       "        # There are many options; here we take a very simple approach\n",
       "        return (log(1 + self.index[word][docid]) /\n",
       "                log(1 + self.documents[docid].nwords))\n",
       "\n",
       "    def total_score(self, words, docid):\n",
       "        """Compute the sum of the scores of these words on the document with this docid."""\n",
       "        return sum(self.score(word, docid) for word in words)\n",
       "\n",
       "    def present(self, results):\n",
       "        """Present the results as a list."""\n",
       "        for (score, docid) in results:\n",
       "            doc = self.documents[docid]\n",
       "            print(\n",
       "                ("{:5.2}|{:25} | {}".format(100 * score, doc.url,\n",
       "                                            doc.title[:45].expandtabs())))\n",
       "\n",
       "    def present_results(self, query_text, n=10):\n",
       "        """Get results for the query and present them."""\n",
       "        self.present(self.query(query_text, n))\n",
       "

\n", "\n", "\n" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "psource(IRSystem)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The `stopwords` argument signifies words in the queries that should not be accounted for in documents. Usually they are very common words that do not add any significant information for a document's relevancy.\n", "\n", "A quick guide for the functions in the `IRSystem` class:\n", "\n", "* `index_document`: Add document to the collection of documents (named `documents`), which is a list of tuples. Also, count how many times each word in the query appears in each document.\n", "\n", "* `index_collection`: Index a collection of documents given by `filenames`.\n", "\n", "* `query`: Returns a list of `n` pairs of `(score, docid)` sorted on the score of each document. Also takes care of the special query \"learn: X\", where instead of the normal functionality we present the output of the terminal command \"X\".\n", "\n", "* `score`: Scores a given document for the given word using `log(1+k)/log(1+n)`, where `k` is the number of query words in a document and `k` is the total number of words in the document. Other scoring functions can be used and you can overwrite this function to better suit your needs.\n", "\n", "* `total_score`: Calculate the sum of all the query words in given document.\n", "\n", "* `present`/`present_results`: Presents the results as a list.\n", "\n", "We also have the class `Document` that holds metadata of documents, like their title, url and number of words. An additional class, `UnixConsultant`, can be used to initialize an IR System for Unix command manuals. This is the example we will use to showcase the implementation." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Example\n", "\n", "First let's take a look at the source code of `UnixConsultant`." ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\n", " \n", " \n", " \n", "\n", "\n", "

\n", "\n", "

class UnixConsultant(IRSystem):\n",
       "\n",
       "    """A trivial IR system over a small collection of Unix man pages."""\n",
       "\n",
       "    def __init__(self):\n",
       "        IRSystem.__init__(self, stopwords="how do i the a of")\n",
       "\n",
       "        import os\n",
       "        aima_root = os.path.dirname(__file__)\n",
       "        mandir = os.path.join(aima_root, 'aima-data/MAN/')\n",
       "        man_files = [mandir + f for f in os.listdir(mandir)\n",
       "                     if f.endswith('.txt')]\n",
       "\n",
       "        self.index_collection(man_files)\n",
       "

\n", "\n", "\n" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "psource(UnixConsultant)" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['pico.txt',\n", " 'pine.txt',\n", " 'ps.txt',\n", " 'pwd.txt',\n", " 'python.txt',\n", " 'rm.txt',\n", " 'rmdir.txt',\n", " 'scp.txt',\n", " 'shred.txt',\n", " 'sort.txt',\n", " 'tail.txt',\n", " 'tar.txt',\n", " 'touch.txt',\n", " 'tr.txt',\n", " 'vi.txt',\n", " 'wc.txt',\n", " 'who.txt',\n", " 'whoami.txt',\n", " 'whois.txt',\n", " 'yes.txt',\n", " 'zip.txt',\n", " 'cut.txt',\n", " 'date.txt',\n", " 'dd.txt',\n", " 'df.txt',\n", " 'diff.txt',\n", " 'lpr.txt',\n", " 'ls.txt',\n", " 'man.txt',\n", " 'mkdir.txt',\n", " 'more.txt',\n", " 'mv.txt',\n", " 'passwd.txt',\n", " 'paste.txt',\n", " 'perl.txt',\n", " 'gzip.txt',\n", " 'head.txt',\n", " 'info.txt',\n", " 'jar.txt',\n", " 'join.txt',\n", " 'kill.txt',\n", " 'ln.txt',\n", " 'login.txt',\n", " 'cat.txt',\n", " 'cd.txt',\n", " 'chmod.txt',\n", " 'chown.txt',\n", " 'cp.txt',\n", " 'du.txt',\n", " 'echo.txt',\n", " 'expr.txt',\n", " 'find.txt',\n", " 'finger.txt',\n", " 'grep.txt']" ] }, "execution_count": 1, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import os\n", "os.listdir(\"./aima-data/MAN\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The class creates an IR System with the stopwords \"how do i the a of\". We could add more words to exclude, but the queries we will test will generally be in that format, so it is convenient. After the initialization of the system, we get the manual files and start indexing them.\n", "\n", "Let's build our Unix consultant and run a query:" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0.7682667868462166 aima-data/MAN/rm.txt\n" ] } ], "source": [ "uc = UnixConsultant()\n", "\n", "q = uc.query(\"how do I remove a file\")\n", "\n", "top_score, top_doc = q[0][0], q[0][1]\n", "print(top_score, uc.documents[top_doc].url)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We asked how to remove a file and the top result was the `rm` (the Unix command for remove) manual. This is exactly what we wanted! Let's try another query:" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0.7546722691607105 aima-data/MAN/diff.txt\n" ] } ], "source": [ "q = uc.query(\"how do I delete a file\")\n", "\n", "top_score, top_doc = q[0][0], q[0][1]\n", "print(top_score, uc.documents[top_doc].url)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Even though we are basically asking for the same thing, we got a different top result. The `diff` command shows the differences between two files. So the system failed us and presented us an irrelevant document. Why is that? Unfortunately our IR system considers each word independent. \"Remove\" and \"delete\" have similar meanings, but since they are different words our system will not make the connection. So, the `diff` manual which mentions a lot the word `delete` gets the nod ahead of other manuals, while the `rm` one isn't in the result set since it doesn't use the word at all." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## INFORMATION EXTRACTION\n", "\n", "**Information Extraction (IE)** is a method for finding occurrences of object classes and relationships in text. Unlike IR systems, an IE system includes (limited) notions of syntax and semantics. While it is difficult to extract object information in a general setting, for more specific domains the system is very useful. One model of an IE system makes use of templates that match with strings in a text.\n", "\n", "A typical example of such a model is reading prices from web pages. Prices usually appear after a dollar and consist of numbers, maybe followed by two decimal points. Before the price, usually there will appear a string like \"price:\". Let's build a sample template.\n", "\n", "With the following regular expression (*regex*) we can extract prices from text:\n", "\n", "`[$][0-9]+([.][0-9][0-9])?`\n", "\n", "Where `+` means 1 or more occurrences and `?` means atmost 1 occurrence. Usually a template consists of a prefix, a target and a postfix regex. In this template, the prefix regex can be \"price:\", the target regex can be the above regex and the postfix regex can be empty.\n", "\n", "A template can match with multiple strings. If this is the case, we need a way to resolve the multiple matches. Instead of having just one template, we can use multiple templates (ordered by priority) and pick the match from the highest-priority template. We can also use other ways to pick. For the dollar example, we can pick the match closer to the numerical half of the highest match. For the text \"Price $90, special offer $70, shipping $5\" we would pick \"$70\" since it is closer to the half of the highest match (\"$90\")." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The above is called *attribute-based* extraction, where we want to find attributes in the text (in the example, the price). A more sophisticated extraction system aims at dealing with multiple objects and the relations between them. When such a system reads the text \"$100\", it should determine not only the price but also which object has that price.\n", "\n", "Relation extraction systems can be built as a series of finite state automata. Each automaton receives as input text, performs transformations on the text and passes it on to the next automaton as input. An automata setup can consist of the following stages:\n", "\n", "1. **Tokenization**: Segments text into tokens (words, numbers and punctuation).\n", "\n", "2. **Complex-word Handling**: Handles complex words such as \"give up\", or even names like \"Smile Inc.\".\n", "\n", "3. **Basic-group Handling**: Handles noun and verb groups, segmenting the text into strings of verbs or nouns (for example, \"had to give up\").\n", "\n", "4. **Complex Phrase Handling**: Handles complex phrases using finite-state grammar rules. For example, \"Human+PlayedChess(\"with\" Human+)?\" can be one template/rule for capturing a relation of someone playing chess with others.\n", "\n", "5. **Structure Merging**: Merges the structures built in the previous steps." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Finite-state, template based information extraction models work well for restricted domains, but perform poorly as the domain becomes more and more general. There are many models though to choose from, each with its own strengths and weaknesses. Some of the models are the following:\n", "\n", "* **Probabilistic**: Using Hidden Markov Models, we can extract information in the form of prefix, target and postfix from a given text. Two advantages of using HMMs over templates is that we can train HMMs from data and don't need to design elaborate templates, and that a probabilistic approach behaves well even with noise. In a regex, if one character is off, we do not have a match, while with a probabilistic approach we have a smoother process.\n", "\n", "* **Conditional Random Fields**: One problem with HMMs is the assumption of state independence. CRFs are very similar to HMMs, but they don't have the latter's constraint. In addition, CRFs make use of *feature functions*, which act as transition weights. For example, if for observation $e_{i}$ and state $x_{i}$ we have $e_{i}$ is \"run\" and $x_{i}$ is the state ATHLETE, we can have $f(x_{i}, e_{i}) = 1$ and equal to 0 otherwise. We can use multiple, overlapping features, and we can even use features for state transitions. Feature functions don't have to be binary (like the above example) but they can be real-valued as well. Also, we can use any $e$ for the function, not just the current observation. To bring it all together, we weigh a transition by the sum of features.\n", "\n", "* **Ontology Extraction**: This is a method for compiling information and facts in a general domain. A fact can be in the form of `NP is NP`, where `NP` denotes a noun-phrase. For example, \"Rabbit is a mammal\"." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## DECODERS\n", "\n", "### Introduction\n", "\n", "In this section we will try to decode ciphertext using probabilistic text models. A ciphertext is obtained by performing encryption on a text message. This encryption lets us communicate safely, as anyone who has access to the ciphertext but doesn't know how to decode it cannot read the message. We will restrict our study to Monoalphabetic Substitution Ciphers. These are primitive forms of cipher where each letter in the message text (also known as plaintext) is replaced by another another letter of the alphabet.\n", "\n", "### Shift Decoder\n", "\n", "#### The Caesar cipher\n", "\n", "The Caesar cipher, also known as shift cipher is a form of monoalphabetic substitution ciphers where each letter is shifted by a fixed value. A shift by `n` in this context means that each letter in the plaintext is replaced with a letter corresponding to `n` letters down in the alphabet. For example the plaintext `\"ABCDWXYZ\"` shifted by `3` yields `\"DEFGZABC\"`. Note how `X` became `A`. This is because the alphabet is cyclic, i.e. the letter after the last letter in the alphabet, `Z`, is the first letter of the alphabet - `A`." ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "DEFGZABC\n" ] } ], "source": [ "plaintext = \"ABCDWXYZ\"\n", "ciphertext = shift_encode(plaintext, 3)\n", "print(ciphertext)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Decoding a Caesar cipher\n", "\n", "To decode a Caesar cipher we exploit the fact that not all letters in the alphabet are used equally. Some letters are used more than others and some pairs of letters are more probable to occur together. We call a pair of consecutive letters a bigram." ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "['th', 'hi', 'is', 's ', ' i', 'is', 's ', ' a', 'a ', ' s', 'se', 'en', 'nt', 'te', 'en', 'nc', 'ce']\n" ] } ], "source": [ "print(bigrams('this is a sentence'))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We use `CountingProbDist` to get the probability distribution of bigrams. In the latin alphabet consists of only only `26` letters. This limits the total number of possible substitutions to `26`. We reverse the shift encoding for a given `n` and check how probable it is using the bigram distribution. We try all `26` values of `n`, i.e. from `n = 0` to `n = 26` and use the value of `n` which gives the most probable plaintext." ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\n", " \n", " \n", " \n", "\n", "\n", "

\n", "\n", "

class ShiftDecoder:\n",
       "\n",
       "    """There are only 26 possible encodings, so we can try all of them,\n",
       "    and return the one with the highest probability, according to a\n",
       "    bigram probability distribution."""\n",
       "\n",
       "    def __init__(self, training_text):\n",
       "        training_text = canonicalize(training_text)\n",
       "        self.P2 = CountingProbDist(bigrams(training_text), default=1)\n",
       "\n",
       "    def score(self, plaintext):\n",
       "        """Return a score for text based on how common letters pairs are."""\n",
       "\n",
       "        s = 1.0\n",
       "        for bi in bigrams(plaintext):\n",
       "            s = s * self.P2[bi]\n",
       "\n",
       "        return s\n",
       "\n",
       "    def decode(self, ciphertext):\n",
       "        """Return the shift decoding of text with the best score."""\n",
       "\n",
       "        return argmax(all_shifts(ciphertext), key=lambda shift: self.score(shift))\n",
       "

\n", "\n", "\n" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "psource(ShiftDecoder)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Example\n", "\n", "Let us encode a secret message using Caeasar cipher and then try decoding it using `ShiftDecoder`. We will again use `flatland.txt` to build the text model" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "The code is \"Guvf vf n frperg zrffntr\"\n" ] } ], "source": [ "plaintext = \"This is a secret message\"\n", "ciphertext = shift_encode(plaintext, 13)\n", "print('The code is', '\"' + ciphertext + '\"')" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "The decoded message is \"This is a secret message\"\n" ] } ], "source": [ "flatland = open_data(\"EN-text/flatland.txt\").read()\n", "decoder = ShiftDecoder(flatland)\n", "\n", "decoded_message = decoder.decode(ciphertext)\n", "print('The decoded message is', '\"' + decoded_message + '\"')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Permutation Decoder\n", "Now let us try to decode messages encrypted by a general mono-alphabetic substitution cipher. The letters in the alphabet can be replaced by any permutation of letters. For example, if the alphabet consisted of `{A B C}` then it can be replaced by `{A C B}`, `{B A C}`, `{B C A}`, `{C A B}`, `{C B A}` or even `{A B C}` itself. Suppose we choose the permutation `{C B A}`, then the plain text `\"CAB BA AAC\"` would become `\"ACB BC CCA\"`. We can see that Caesar cipher is also a form of permutation cipher where the permutation is a cyclic permutation. Unlike the Caesar cipher, it is infeasible to try all possible permutations. The number of possible permutations in Latin alphabet is `26!` which is of the order $10^{26}$. We use graph search algorithms to search for a 'good' permutation." ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\n", " \n", " \n", " \n", "\n", "\n", "

\n", "\n", "

class PermutationDecoder:\n",
       "\n",
       "    """This is a much harder problem than the shift decoder. There are 26!\n",
       "    permutations, so we can't try them all. Instead we have to search.\n",
       "    We want to search well, but there are many things to consider:\n",
       "    Unigram probabilities (E is the most common letter); Bigram probabilities\n",
       "    (TH is the most common bigram); word probabilities (I and A are the most\n",
       "    common one-letter words, etc.); etc.\n",
       "    We could represent a search state as a permutation of the 26 letters,\n",
       "    and alter the solution through hill climbing. With an initial guess\n",
       "    based on unigram probabilities, this would probably fare well. However,\n",
       "    I chose instead to have an incremental representation. A state is\n",
       "    represented as a letter-to-letter map; for example {'z': 'e'} to\n",
       "    represent that 'z' will be translated to 'e'."""\n",
       "\n",
       "    def __init__(self, training_text, ciphertext=None):\n",
       "        self.Pwords = UnigramWordModel(words(training_text))\n",
       "        self.P1 = UnigramWordModel(training_text)  # By letter\n",
       "        self.P2 = NgramWordModel(2, words(training_text))  # By letter pair\n",
       "\n",
       "    def decode(self, ciphertext):\n",
       "        """Search for a decoding of the ciphertext."""\n",
       "        self.ciphertext = canonicalize(ciphertext)\n",
       "        # reduce domain to speed up search\n",
       "        self.chardomain = {c for c in self.ciphertext if c != ' '}\n",
       "        problem = PermutationDecoderProblem(decoder=self)\n",
       "        solution = search.best_first_graph_search(\n",
       "            problem, lambda node: self.score(node.state))\n",
       "\n",
       "        solution.state[' '] = ' '\n",
       "        return translate(self.ciphertext, lambda c: solution.state[c])\n",
       "\n",
       "    def score(self, code):\n",
       "        """Score is product of word scores, unigram scores, and bigram scores.\n",
       "        This can get very small, so we use logs and exp."""\n",
       "\n",
       "        # remake code dictionary to contain translation for all characters\n",
       "        full_code = code.copy()\n",
       "        full_code.update({x: x for x in self.chardomain if x not in code})\n",
       "        full_code[' '] = ' '\n",
       "        text = translate(self.ciphertext, lambda c: full_code[c])\n",
       "\n",
       "        # add small positive value to prevent computing log(0)\n",
       "        # TODO: Modify the values to make score more accurate\n",
       "        logP = (sum(log(self.Pwords[word] + 1e-20) for word in words(text)) +\n",
       "                sum(log(self.P1[c] + 1e-5) for c in text) +\n",
       "                sum(log(self.P2[b] + 1e-10) for b in bigrams(text)))\n",
       "        return -exp(logP)\n",
       "

\n", "\n", "\n" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "psource(PermutationDecoder)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Each state/node in the graph is represented as a letter-to-letter map. If there is no mapping for a letter, it means the letter is unchanged in the permutation. These maps are stored as dictionaries. Each dictionary is a 'potential' permutation. We use the word 'potential' because every dictionary doesn't necessarily represent a valid permutation since a permutation cannot have repeating elements. For example the dictionary `{'A': 'B', 'C': 'X'}` is invalid because `'A'` is replaced by `'B'`, but so is `'B'` because the dictionary doesn't have a mapping for `'B'`. Two dictionaries can also represent the same permutation e.g. `{'A': 'C', 'C': 'A'}` and `{'A': 'C', 'B': 'B', 'C': 'A'}` represent the same permutation where `'A'` and `'C'` are interchanged and all other letters remain unaltered. To ensure that we get a valid permutation, a goal state must map all letters in the alphabet. We also prevent repetitions in the permutation by allowing only those actions which go to a new state/node in which the newly added letter to the dictionary maps to previously unmapped letter. These two rules together ensure that the dictionary of a goal state will represent a valid permutation.\n", "The score of a state is determined using word scores, unigram scores, and bigram scores. Experiment with different weightages for word, unigram and bigram scores and see how they affect the decoding." ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\"ahed world\" decodes to \"shed could\"\n", "\"ahed woxld\" decodes to \"shew atiow\"\n" ] } ], "source": [ "ciphertexts = ['ahed world', 'ahed woxld']\n", "\n", "pd = PermutationDecoder(canonicalize(flatland))\n", "for ctext in ciphertexts:\n", " print('\"{}\" decodes to \"{}\"'.format(ctext, pd.decode(ctext)))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As evident from the above example, permutation decoding using best first search is sensitive to initial text. This is because not only the final dictionary, with substitutions for all letters, must have good score but so must the intermediate dictionaries. You could think of it as performing a local search by finding substitutions for each letter one by one. We could get very different results by changing even a single letter because that letter could be a deciding factor for selecting substitution in early stages which snowballs and affects the later stages. To make the search better we can use different definitions of score in different stages and optimize on which letter to substitute first." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.6" } }, "nbformat": 4, "nbformat_minor": 1 }