Assignment 2: Extracting Wikilinks

Objectives

Introduction

A common technique for automatically determining the relationship between words and phrases in natural language is to do some sort of processing of a large collection of text. Wikipedia is one such useful collection. Wikipedia articles are written in wikitext, a text markup language that allows authors to indicate, among other things, what words or phrases should be displayed as links to other articles. Additionally, wikitext can use templates to present information in a standardized format. The sidebars that give basic biographical information about people are an example of a template.

For this assignment, we want to know what articles are linked to what other articles by extracting links from the wikitext, ignoring links inside templates and other special-purpose links described below.

Assignment

Write a program called Wikilinks that extracts links from wikitext read from standard input.

Wikitext has many features, but for this assignment we will pretend that wikitext markup is limited to templates and links. We want to extract the links in the input that do not appear inside templates (with some other exclusions described below).

Links are delimited by double opening or closing square brackets: [[ and ]]. The text inside the doubled brackets gives the name of the article to link to (the "link target"). So [[Joan Feigenbaum]] produces the link Joan Feigenbaum (the wiki server translates the link target to a complete URL that it sends to the browser). A link can also have an alternate label to allow the text displayed in the link to be different from the title of the article to link to. The link label, if present, is given after a pipe symbol (|) inside the doubled brackets. So [[Joan Feigenbaum|Prof. Feigenbaum]] produces the link Prof. Feigenbaum. Link labels may be empty, which indicates that the link label should be computed from the link target (the exact computation is not relevant to us here).

Templates are delimited by pairs of opening and closing curly braces: {{ and }}. Templates can be nested. Your program should not produce any output for links inside templates.

For each link not in a template, write to standard output the link target. Whitespace in the link target should be handled by removing leading and trailing whitespace and underscore characters and replacing any number of consecutive embedded (and not necessarily identical) whitespace and underscore characters replaced by a single space.

There are two exceptions to outputting links: do not output any text for a link if the first non-whitespace/underscore character is

These are special link targets that we do not want in our output. Each link target that your program outputs should be followed by a single newline character. There should be no other output written to standard output.

Templates and links are the only features of wikitext that we are concerned with. The only characters we consider to have special meaning are:

Everything else, including single braces and brackets, should be treated as a "normal" character (so legal in a link target) even if it has some special meaning in the complete formal wikitext specification.

Your program's output will be tested on inputs that obey these rules:

Your program will also be tested on inputs that do not obey those rules, and in such cases the criteria for passing a test is simply whether your program ran to completion without crashing or going into an infinite loop; the output can be anything or nothing in these cases (although ideally your program would detect violations of the rules and ouput an appropriate error message -- it is often better to detect malformed input and abort processing rather than continuing execution with unexpected and possibly dangerous consequences).

Additional requirements

Helpful functions

In addition to the functions used in the state machine example, you may find the following functions helpful:

Example

If the program is read with the following text as standard input (from the Wikipedia article Joan Feigenbaum, used under the Creative Commons Share-Alike license)
==Awards and honors==
In 2001, Feigenbaum became a fellow of the
[[Association for Computing Machinery]] for her "foundational and
highly influential contributions to cryptographic complexity theory,
authorization and trust management, massive-data-stream computation,
and algorithmic mechanism design."
<ref>{{citation|url=http://fellows.acm.org/fellow_citation.cfm?id=N967745&srt=all|title=ACM Fellows: Joan Feigenbaum|publisher=[[Association for Computing Machinery]]|accessdate=2012-12-29}}.>/ref<
In 2012 she was named a fellow of the [[American Association for the
Advancement of Science]]<ref> {{citation|title=AAAS Members Elected as
Fellows|journal=[[Science (journal)|Science]]|volume=338|date=November
30, 2012|pages=1168–1171|doi=10.1126/science.338.6111.1166
}}>/ref< and, in 2013, a member of the Connecticut Academy of
Science and Engineering. The Connecticut Technology Council chose her
as a Woman of Innovation in 2012. She acts as one of the three
award-committee members on ACM [[Association for Computing
Machinery#Special_Interest_Groups|SIGecom]] test of time
award.
<ref>{{cite web|url=http://www.sigecom.org/awardt.html|title=ACM SIGecom: Test of Time Award|website=www.sigecom.org}}>/ref<
then the output should be
Association for Computing Machinery
American Association for the Advancement of Science
Association for Computing Machinery#Special Interest Groups

Submissions

Submit your makefile, log file, and source code necessary to build the executable Wikilinks using your makefile.