{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# NATURAL LANGUAGE PROCESSING\n", "\n", "This notebook covers chapters 22 and 23 from the book *Artificial Intelligence: A Modern Approach*, 3rd Edition. The implementations of the algorithms can be found in [nlp.py](https://github.com/aimacode/aima-python/blob/master/nlp.py).\n", "\n", "Run the below cell to import the code from the module and get started!" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import nlp\n", "from nlp import Page, HITS\n", "from nlp import Lexicon, Rules, Grammar, ProbLexicon, ProbRules, ProbGrammar\n", "from nlp import CYK_parse, Chart\n", "\n", "from notebook import psource" ] }, { "cell_type": "markdown", "metadata": { "collapsed": true }, "source": [ "## CONTENTS\n", "\n", "* Overview\n", "* Languages\n", "* HITS\n", "* Question Answering\n", "* CYK Parse\n", "* Chart Parsing" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## OVERVIEW\n", "\n", "**Natural Language Processing (NLP)** is a field of AI concerned with understanding, analyzing and using natural languages. This field is considered a difficult yet intriguing field of study, since it is connected to how humans and their languages work.\n", "\n", "Applications of the field include translation, speech recognition, topic segmentation, information extraction and retrieval, and a lot more.\n", "\n", "Below we take a look at some algorithms in the field. Before we get right into it though, we will take a look at a very useful form of language, **context-free** languages. Even though they are a bit restrictive, they have been used a lot in research in natural language processing." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## LANGUAGES\n", "\n", "Languages can be represented by a set of grammar rules over a lexicon of words. Different languages can be represented by different types of grammar, but in Natural Language Processing we are mainly interested in context-free grammars.\n", "\n", "### Context-Free Grammars\n", "\n", "A lot of natural and programming languages can be represented by a **Context-Free Grammar (CFG)**. A CFG is a grammar that has a single non-terminal symbol on the left-hand side. That means a non-terminal can be replaced by the right-hand side of the rule regardless of context. An example of a CFG:\n", "\n", "```\n", "S -> aSb | ε\n", "```\n", "\n", "That means `S` can be replaced by either `aSb` or `ε` (with `ε` we denote the empty string). The lexicon of the language is comprised of the terminals `a` and `b`, while with `S` we denote the non-terminal symbol. In general, non-terminals are capitalized while terminals are not, and we usually name the starting non-terminal `S`. The language generated by the above grammar is the language anbn for n greater or equal than 1." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Probabilistic Context-Free Grammar\n", "\n", "While a simple CFG can be very useful, we might want to know the chance of each rule occurring. Above, we do not know if `S` is more likely to be replaced by `aSb` or `ε`. **Probabilistic Context-Free Grammars (PCFG)** are built to fill exactly that need. Each rule has a probability, given in brackets, and the probabilities of a rule sum up to 1:\n", "\n", "```\n", "S -> aSb [0.7] | ε [0.3]\n", "```\n", "\n", "Now we know it is more likely for `S` to be replaced by `aSb` than by `ε`.\n", "\n", "An issue with *PCFGs* is how we will assign the various probabilities to the rules. We could use our knowledge as humans to assign the probabilities, but that is a laborious and prone to error task. Instead, we can *learn* the probabilities from data. Data is categorized as labeled (with correctly parsed sentences, usually called a **treebank**) or unlabeled (given only lexical and syntactic category names).\n", "\n", "With labeled data, we can simply count the occurrences. For the above grammar, if we have 100 `S` rules and 30 of them are of the form `S -> ε`, we assign a probability of 0.3 to the transformation.\n", "\n", "With unlabeled data we have to learn both the grammar rules and the probability of each rule. We can go with many approaches, one of them the **inside-outside** algorithm. It uses a dynamic programming approach, that first finds the probability of a substring being generated by each rule, and then estimates the probability of each rule." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Chomsky Normal Form\n", "\n", "A grammar is in Chomsky Normal Form (or **CNF**, not to be confused with *Conjunctive Normal Form*) if its rules are one of the three:\n", "\n", "* `X -> Y Z`\n", "* `A -> a`\n", "* `S -> ε`\n", "\n", "Where *X*, *Y*, *Z*, *A* are non-terminals, *a* is a terminal, *ε* is the empty string and *S* is the start symbol (the start symbol should not be appearing on the right hand side of rules). Note that there can be multiple rules for each left hand side non-terminal, as long they follow the above. For example, a rule for *X* might be: `X -> Y Z | A B | a | b`.\n", "\n", "Of course, we can also have a *CNF* with probabilities.\n", "\n", "This type of grammar may seem restrictive, but it can be proven that any context-free grammar can be converted to CNF." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Lexicon\n", "\n", "The lexicon of a language is defined as a list of allowable words. These words are grouped into the usual classes: `verbs`, `nouns`, `adjectives`, `adverbs`, `pronouns`, `names`, `articles`, `prepositions` and `conjunctions`. For the first five classes it is impossible to list all words, since words are continuously being added in the classes. Recently \"google\" was added to the list of verbs, and words like that will continue to pop up and get added to the lists. For that reason, these first five categories are called **open classes**. The rest of the categories have much fewer words and much less development. While words like \"thou\" were commonly used in the past but have declined almost completely in usage, most changes take many decades or centuries to manifest, so we can safely assume the categories will remain static for the foreseeable future. Thus, these categories are called **closed classes**.\n", "\n", "An example lexicon for a PCFG (note that other classes can also be used according to the language, like `digits`, or `RelPro` for relative pronoun):\n", "\n", "```\n", "Verb -> is [0.3] | say [0.1] | are [0.1] | ...\n", "Noun -> robot [0.1] | sheep [0.05] | fence [0.05] | ...\n", "Adjective -> good [0.1] | new [0.1] | sad [0.05] | ...\n", "Adverb -> here [0.1] | lightly [0.05] | now [0.05] | ...\n", "Pronoun -> me [0.1] | you [0.1] | he [0.05] | ...\n", "RelPro -> that [0.4] | who [0.2] | which [0.2] | ...\n", "Name -> john [0.05] | mary [0.05] | peter [0.01] | ...\n", "Article -> the [0.35] | a [0.25] | an [0.025] | ...\n", "Preposition -> to [0.25] | in [0.2] | at [0.1] | ...\n", "Conjunction -> and [0.5] | or [0.2] | but [0.2] | ...\n", "Digit -> 1 [0.3] | 2 [0.2] | 0 [0.2] | ...\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Grammar\n", "\n", "With grammars we combine words from the lexicon into valid phrases. A grammar is comprised of **grammar rules**. Each rule transforms the left-hand side of the rule into the right-hand side. For example, `A -> B` means that `A` transforms into `B`. Let's build a grammar for the language we started building with the lexicon. We will use a PCFG.\n", "\n", "```\n", "S -> NP VP [0.9] | S Conjunction S [0.1]\n", "\n", "NP -> Pronoun [0.3] | Name [0.1] | Noun [0.1] | Article Noun [0.25] |\n", " Article Adjs Noun [0.05] | Digit [0.05] | NP PP [0.1] |\n", " NP RelClause [0.05]\n", "\n", "VP -> Verb [0.4] | VP NP [0.35] | VP Adjective [0.05] | VP PP [0.1]\n", " VP Adverb [0.1]\n", "\n", "Adjs -> Adjective [0.8] | Adjective Adjs [0.2]\n", "\n", "PP -> Preposition NP [1.0]\n", "\n", "RelClause -> RelPro VP [1.0]\n", "```\n", "\n", "Some valid phrases the grammar produces: \"`mary is sad`\", \"`you are a robot`\" and \"`she likes mary and a good fence`\".\n", "\n", "What if we wanted to check if the phrase \"`mary is sad`\" is actually a valid sentence? We can use a **parse tree** to constructively prove that a string of words is a valid phrase in the given language and even calculate the probability of the generation of the sentence.\n", "\n", "\n", "\n", "The probability of the whole tree can be calculated by multiplying the probabilities of each individual rule transormation: `0.9 * 0.1 * 0.05 * 0.05 * 0.4 * 0.05 * 0.3 = 0.00000135`.\n", "\n", "To conserve space, we can also write the tree in linear form:\n", "\n", "[S [NP [Name **mary**]] [VP [VP [Verb **is**]] [Adjective **sad**]]]\n", "\n", "Unfortunately, the current grammar **overgenerates**, that is, it creates sentences that are not grammatically correct (according to the English language), like \"`the fence are john which say`\". It also **undergenerates**, which means there are valid sentences it does not generate, like \"`he believes mary is sad`\"." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Implementation\n", "\n", "In the module we have implementation both for probabilistic and non-probabilistic grammars. Both these implementation follow the same format. There are functions for the lexicon and the rules which can be combined to create a grammar object.\n", "\n", "#### Non-Probabilistic\n", "\n", "Execute the cell below to view the implemenations:" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "
\n", "def Lexicon(**rules):\n",
" """Create a dictionary mapping symbols to alternative words.\n",
" >>> Lexicon(Article = "the | a | an")\n",
" {'Article': ['the', 'a', 'an']}\n",
" """\n",
" for (lhs, rhs) in rules.items():\n",
" rules[lhs] = [word.strip() for word in rhs.split('|')]\n",
" return rules\n",
"\n",
"\n",
"def Rules(**rules):\n",
" """Create a dictionary mapping symbols to alternative sequences.\n",
" >>> Rules(A = "B C | D E")\n",
" {'A': [['B', 'C'], ['D', 'E']]}\n",
" """\n",
" for (lhs, rhs) in rules.items():\n",
" rules[lhs] = [alt.strip().split() for alt in rhs.split('|')]\n",
" return rules\n",
"\n",
"\n",
"class Grammar:\n",
"\n",
" def __init__(self, name, rules, lexicon):\n",
" """A grammar has a set of rules and a lexicon."""\n",
" self.name = name\n",
" self.rules = rules\n",
" self.lexicon = lexicon\n",
" self.categories = defaultdict(list)\n",
" for lhs in lexicon:\n",
" for word in lexicon[lhs]:\n",
" self.categories[word].append(lhs)\n",
"\n",
" def rewrites_for(self, cat):\n",
" """Return a sequence of possible rhs's that cat can be rewritten as."""\n",
" return self.rules.get(cat, ())\n",
"\n",
" def isa(self, word, cat):\n",
" """Return True iff word is of category cat"""\n",
" return cat in self.categories[word]\n",
"\n",
" def cnf_rules(self):\n",
" """Returns the tuple (X, Y, Z) for rules in the form:\n",
" X -> Y Z"""\n",
" cnf = []\n",
" for X, rules in self.rules.items():\n",
" for (Y, Z) in rules:\n",
" cnf.append((X, Y, Z))\n",
"\n",
" return cnf\n",
"\n",
" def generate_random(self, S='S'):\n",
" """Replace each token in S by a random entry in grammar (recursively)."""\n",
" import random\n",
"\n",
" def rewrite(tokens, into):\n",
" for token in tokens:\n",
" if token in self.rules:\n",
" rewrite(random.choice(self.rules[token]), into)\n",
" elif token in self.lexicon:\n",
" into.append(random.choice(self.lexicon[token]))\n",
" else:\n",
" into.append(token)\n",
" return into\n",
"\n",
" return ' '.join(rewrite(S.split(), []))\n",
"\n",
" def __repr__(self):\n",
" return '<Grammar {}>'.format(self.name)\n",
"
def ProbLexicon(**rules):\n",
" """Create a dictionary mapping symbols to alternative words,\n",
" with probabilities.\n",
" >>> ProbLexicon(Article = "the [0.5] | a [0.25] | an [0.25]")\n",
" {'Article': [('the', 0.5), ('a', 0.25), ('an', 0.25)]}\n",
" """\n",
" for (lhs, rhs) in rules.items():\n",
" rules[lhs] = []\n",
" rhs_separate = [word.strip().split() for word in rhs.split('|')]\n",
" for r in rhs_separate:\n",
" prob = float(r[-1][1:-1]) # remove brackets, convert to float\n",
" word = r[:-1][0]\n",
" rhs_rule = (word, prob)\n",
" rules[lhs].append(rhs_rule)\n",
"\n",
" return rules\n",
"\n",
"\n",
"def ProbRules(**rules):\n",
" """Create a dictionary mapping symbols to alternative sequences,\n",
" with probabilities.\n",
" >>> ProbRules(A = "B C [0.3] | D E [0.7]")\n",
" {'A': [(['B', 'C'], 0.3), (['D', 'E'], 0.7)]}\n",
" """\n",
" for (lhs, rhs) in rules.items():\n",
" rules[lhs] = []\n",
" rhs_separate = [alt.strip().split() for alt in rhs.split('|')]\n",
" for r in rhs_separate:\n",
" prob = float(r[-1][1:-1]) # remove brackets, convert to float\n",
" rhs_rule = (r[:-1], prob)\n",
" rules[lhs].append(rhs_rule)\n",
"\n",
" return rules\n",
"\n",
"\n",
"class ProbGrammar:\n",
"\n",
" def __init__(self, name, rules, lexicon):\n",
" """A grammar has a set of rules and a lexicon.\n",
" Each rule has a probability."""\n",
" self.name = name\n",
" self.rules = rules\n",
" self.lexicon = lexicon\n",
" self.categories = defaultdict(list)\n",
"\n",
" for lhs in lexicon:\n",
" for word, prob in lexicon[lhs]:\n",
" self.categories[word].append((lhs, prob))\n",
"\n",
" def rewrites_for(self, cat):\n",
" """Return a sequence of possible rhs's that cat can be rewritten as."""\n",
" return self.rules.get(cat, ())\n",
"\n",
" def isa(self, word, cat):\n",
" """Return True iff word is of category cat"""\n",
" return cat in [c for c, _ in self.categories[word]]\n",
"\n",
" def cnf_rules(self):\n",
" """Returns the tuple (X, Y, Z, p) for rules in the form:\n",
" X -> Y Z [p]"""\n",
" cnf = []\n",
" for X, rules in self.rules.items():\n",
" for (Y, Z), p in rules:\n",
" cnf.append((X, Y, Z, p))\n",
"\n",
" return cnf\n",
"\n",
" def generate_random(self, S='S'):\n",
" """Replace each token in S by a random entry in grammar (recursively).\n",
" Returns a tuple of (sentence, probability)."""\n",
" import random\n",
"\n",
" def rewrite(tokens, into):\n",
" for token in tokens:\n",
" if token in self.rules:\n",
" non_terminal, prob = weighted_choice(self.rules[token])\n",
" into[1] *= prob\n",
" rewrite(non_terminal, into)\n",
" elif token in self.lexicon:\n",
" terminal, prob = weighted_choice(self.lexicon[token])\n",
" into[0].append(terminal)\n",
" into[1] *= prob\n",
" else:\n",
" into[0].append(token)\n",
" return into\n",
"\n",
" rewritten_as, prob = rewrite(S.split(), [[], 1])\n",
" return (' '.join(rewritten_as), prob)\n",
"\n",
" def __repr__(self):\n",
" return '<Grammar {}>'.format(self.name)\n",
"
def HITS(query):\n",
" """The HITS algorithm for computing hubs and authorities with respect to a query."""\n",
" pages = expand_pages(relevant_pages(query))\n",
" for p in pages.values():\n",
" p.authority = 1\n",
" p.hub = 1\n",
" while not convergence():\n",
" authority = {p: pages[p].authority for p in pages}\n",
" hub = {p: pages[p].hub for p in pages}\n",
" for p in pages:\n",
" # p.authority ← ∑i Inlinki(p).Hub\n",
" pages[p].authority = sum(hub[x] for x in getInlinks(pages[p]))\n",
" # p.hub ← ∑i Outlinki(p).Authority\n",
" pages[p].hub = sum(authority[x] for x in getOutlinks(pages[p]))\n",
" normalize(pages)\n",
" return pages\n",
"
def CYK_parse(words, grammar):\n",
" """ [Figure 23.5] """\n",
" # We use 0-based indexing instead of the book's 1-based.\n",
" N = len(words)\n",
" P = defaultdict(float)\n",
"\n",
" # Insert lexical rules for each word.\n",
" for (i, word) in enumerate(words):\n",
" for (X, p) in grammar.categories[word]:\n",
" P[X, i, 1] = p\n",
"\n",
" # Combine first and second parts of right-hand sides of rules,\n",
" # from short to long.\n",
" for length in range(2, N+1):\n",
" for start in range(N-length+1):\n",
" for len1 in range(1, length): # N.B. the book incorrectly has N instead of length\n",
" len2 = length - len1\n",
" for (X, Y, Z, p) in grammar.cnf_rules():\n",
" P[X, start, length] = max(P[X, start, length],\n",
" P[Y, start, len1] * P[Z, start+len1, len2] * p)\n",
"\n",
" return P\n",
"
class Chart:\n",
"\n",
" """Class for parsing sentences using a chart data structure.\n",
" >>> chart = Chart(E0)\n",
" >>> len(chart.parses('the stench is in 2 2'))\n",
" 1\n",
" """\n",
"\n",
" def __init__(self, grammar, trace=False):\n",
" """A datastructure for parsing a string; and methods to do the parse.\n",
" self.chart[i] holds the edges that end just before the i'th word.\n",
" Edges are 5-element lists of [start, end, lhs, [found], [expects]]."""\n",
" self.grammar = grammar\n",
" self.trace = trace\n",
"\n",
" def parses(self, words, S='S'):\n",
" """Return a list of parses; words can be a list or string."""\n",
" if isinstance(words, str):\n",
" words = words.split()\n",
" self.parse(words, S)\n",
" # Return all the parses that span the whole input\n",
" # 'span the whole input' => begin at 0, end at len(words)\n",
" return [[i, j, S, found, []]\n",
" for (i, j, lhs, found, expects) in self.chart[len(words)]\n",
" # assert j == len(words)\n",
" if i == 0 and lhs == S and expects == []]\n",
"\n",
" def parse(self, words, S='S'):\n",
" """Parse a list of words; according to the grammar.\n",
" Leave results in the chart."""\n",
" self.chart = [[] for i in range(len(words)+1)]\n",
" self.add_edge([0, 0, 'S_', [], [S]])\n",
" for i in range(len(words)):\n",
" self.scanner(i, words[i])\n",
" return self.chart\n",
"\n",
" def add_edge(self, edge):\n",
" """Add edge to chart, and see if it extends or predicts another edge."""\n",
" start, end, lhs, found, expects = edge\n",
" if edge not in self.chart[end]:\n",
" self.chart[end].append(edge)\n",
" if self.trace:\n",
" print('Chart: added {}'.format(edge))\n",
" if not expects:\n",
" self.extender(edge)\n",
" else:\n",
" self.predictor(edge)\n",
"\n",
" def scanner(self, j, word):\n",
" """For each edge expecting a word of this category here, extend the edge."""\n",
" for (i, j, A, alpha, Bb) in self.chart[j]:\n",
" if Bb and self.grammar.isa(word, Bb[0]):\n",
" self.add_edge([i, j+1, A, alpha + [(Bb[0], word)], Bb[1:]])\n",
"\n",
" def predictor(self, edge):\n",
" """Add to chart any rules for B that could help extend this edge."""\n",
" (i, j, A, alpha, Bb) = edge\n",
" B = Bb[0]\n",
" if B in self.grammar.rules:\n",
" for rhs in self.grammar.rewrites_for(B):\n",
" self.add_edge([j, j, B, [], rhs])\n",
"\n",
" def extender(self, edge):\n",
" """See what edges can be extended by this edge."""\n",
" (j, k, B, _, _) = edge\n",
" for (i, j, A, alpha, B1b) in self.chart[j]:\n",
" if B1b and B == B1b[0]:\n",
" self.add_edge([i, k, A, alpha + [edge], B1b[1:]])\n",
"