Assignment 2: Extracting Wikilinks
Objectives
- to use C control flow constructs
- to use C I/O functions
- to implement a state machine
Introduction
A common technique for automatically determining the relationship between words and phrases in natural language is to do some sort of processing of a large collection of text. Wikipedia is one such useful collection. Wikipedia articles are written in wikitext, a text markup language that allows authors to indicate, among other things, what words or phrases should be displayed as links to other articles. Additionally, wikitext can use templates to present information in a standardized format. The sidebars that give basic biographical information about people are an example of a template.For this assignment, we want to know what articles are linked to what other articles by extracting links from the wikitext, ignoring links inside templates and other special-purpose links described below.
Assignment
Write a program called Wikilinks
that extracts links
from wikitext read from standard input.
Wikitext has many features, but for this assignment we will pretend that wikitext markup is limited to templates and links. We want to extract the links in the input that do not appear inside templates (with some other exclusions described below).
Links are delimited by double opening or closing square brackets:
[[
and ]]
. The text inside the doubled brackets
gives the name of the article to link to (the "link target").
So [[Joan Feigenbaum]]
produces the link
Joan Feigenbaum
(the wiki server translates the link target to a complete URL that it sends
to the browser).
A link can also have an alternate label to allow the text displayed
in the link to be different from the title of the article to link to.
The link label, if present, is given after a pipe symbol (|
)
inside the doubled brackets. So [[Joan Feigenbaum|Prof. Feigenbaum]]
produces the link
Prof. Feigenbaum.
Link labels may be empty, which indicates that the link label should
be computed from the link target (the exact computation is not relevant to
us here).
Templates are
delimited by pairs of opening and closing curly braces: {{
and }}
. Templates can be nested. Your program should
not produce any output for links inside templates.
For each link not in a template, write to standard output the link target. Whitespace in the link target should be handled by removing leading and trailing whitespace and underscore characters and replacing any number of consecutive embedded (and not necessarily identical) whitespace and underscore characters replaced by a single space.
There are two exceptions to outputting links: do not output any text for a link if the first non-whitespace/underscore character is
- a colon (
:
) (but note that links with colons in other positions should still be output); or - a double opening brace (in a link, templates can be used to simply include the values of a variable
like
[[{{CURRENT_DATE}}]]
).
Templates and links are the only features of wikitext that we are concerned with. The only characters we consider to have special meaning are:
- doubled braces and brackets (
{{ }} [[ ]]
), where sequences of more than two identical braces and brackets are processed greedily so the first two open an outer template/link, the next two open a nested template/link inside that one, and so on, with a single odd character at the end treated as text (so that{{{{{stuff}}}}}
is{stuff
inside two nested templates, which are followed by}
); - whitespace inside link targets;
- underscores inside links targets, which are treated as whitespace; and
- colons that appear as the first non-whitespace/underscore character inside a link target.
Your program's output will be tested on inputs that obey these rules:
- doubled braces and brackets will be balanced;
- doubled brackets will not be nested inside other doubled brackets;
- doubled braces, when nested inside doubled brackets, will be the first non whitespace/underscore character inside the doubled brackets and will not nest more than one level deep inside doubled brackets;
- there will be at most one pipe inside each pair of doubled brackets;
- link targets will not consist of only whitespace and underscores.
Your program will also be tested on inputs that do not obey those rules, and in such cases the criteria for passing a test is simply whether your program ran to completion without crashing or going into an infinite loop; the output can be anything or nothing in these cases (although ideally your program would detect violations of the rules and ouput an appropriate error message -- it is often better to detect malformed input and abort processing rather than continuing execution with unexpected and possibly dangerous consequences).
Additional requirements
- Your program may use no more than a constant amount of memory -- the total number of variables, the maximum size of arrays, and the depth of any recursion must not depend on the size of the input.
- You may not use dynamically allocated memory.
Helpful functions
In addition to the functions used in the state machine example, you may find the following functions helpful:-
int ungetc(int c, FILE *stream)
from thestdio
library to undo reading a character – this can be used to take a peek at the next character of input without effectively reading it// check for the beginning of a multi-line comment int ch = getchar(); if (ch == '/') { // check whether next character is an asterisk ch = getchar(); if (ch != '*') { // not an asterisk -- put it back for later processing ungetc(ch, stdin); } else { printf("begin comment\n"); } }
-
int isspace(int c)
from thectype
library to determine if a character is a whitespace character (space, tab, newline, ...)
Example
If the program is read with the following text as standard input (from the Wikipedia article Joan Feigenbaum, used under the Creative Commons Share-Alike license)==Awards and honors== In 2001, Feigenbaum became a fellow of the [[Association for Computing Machinery]] for her "foundational and highly influential contributions to cryptographic complexity theory, authorization and trust management, massive-data-stream computation, and algorithmic mechanism design." <ref>{{citation|url=http://fellows.acm.org/fellow_citation.cfm?id=N967745&srt=all|title=ACM Fellows: Joan Feigenbaum|publisher=[[Association for Computing Machinery]]|accessdate=2012-12-29}}.>/ref< In 2012 she was named a fellow of the [[American Association for the Advancement of Science]]<ref> {{citation|title=AAAS Members Elected as Fellows|journal=[[Science (journal)|Science]]|volume=338|date=November 30, 2012|pages=1168–1171|doi=10.1126/science.338.6111.1166 }}>/ref< and, in 2013, a member of the Connecticut Academy of Science and Engineering. The Connecticut Technology Council chose her as a Woman of Innovation in 2012. She acts as one of the three award-committee members on ACM [[Association for Computing Machinery#Special_Interest_Groups|SIGecom]] test of time award. <ref>{{cite web|url=http://www.sigecom.org/awardt.html|title=ACM SIGecom: Test of Time Award|website=www.sigecom.org}}>/ref<then the output should be
Association for Computing Machinery American Association for the Advancement of Science Association for Computing Machinery#Special Interest Groups
Submissions
Submit your makefile, log file, and source code necessary to build the executableWikilinks
using your makefile.