Assignment 1: Parsing GPX Files
Objectives
- to use C control flow constructs
- to use C I/O functions
- to implement a state machine
Introduction
Global Positioning System devices are used in a wide variety of applications, from ride hailing services to fitness tracking to exergaming to migratory animal tracking to illegal fishing monitoring. Many GPS devices record a GPS track, which is a record of the position of the device over time. GPS tracks are often stores in GPX files, which are a specific kind of XML file. For this assignment, we will process GPX files to extract the location information for easier use in other applications.Assignment
Write a program called ParseGPX
that extracts GPS track
information from GPX files read from standard input.
GPX, and XML in general, has many features, but for this assignment we will assume that the files we're processing are somewhat more restricted than general GPX/XML files.
In general, an XML file contains a prolog followed by XML elements.
XML elements are delimited by start and end tags (we will not consider
the special syntax for empty elements), where each tag starts with a
<
character and ends with a >
character.
Between those delimiters, a start tag contains a sequence of non-whitespace
characters giving the element type, and optionally a sequence of unique
attributes, separated from the element type by whitespace and
given as a whitespace-separated list of attribute names followed by
an equals sign (=
) and a quoted attribute value.
There may be whitespace around the equals sign, and
which characters count as whitespace at this point and elsewhere
in our XML files is determined in the same way C's
isspace
function determines what is whitespace.
The end tags have only the element type, preceded by a
forward slash (/
).
XML elements may contain text (characters not in a tag)
or other XML elements. Elements must
nest properly, so any child elements must have their start tag and end
tag both inside their parent element. The prolog of an XML file contains
tag-like items that start and end with <?
and ?>
or <!
and >
and these tag-like
items may contain a sequence of things that look like attribute names
and quoted values (see the example below).
A GPX file is a specific kind of XML file with specific kinds of
elements structured in a particular way. In particular, a GPX file
contains trkpt
elements with lat
and lon
attributes in the opening tag
whose values give the latitude and
longitude of the tracked object at some point in time. The trkpt
elements contain ele
and time
elements
whose text gives the elevation of the object and the time of the measurements
respectively.
Our task is to extract the values of the lat
and lon
attributes of the trkpt
opening
tags along
with the text of the ele
and time
elements
contained in the trkpt
elements: for each trkpt
,
output a comma-separated list of the lat
and lon
values and ele
and time
text in that order.
The contents of each piece of data should be copied verbatim to the output,
except that the quotes at the beginning and end of the value must be
removed, and we must escape any commas in any attribute values and
element text written to the output
by replacing them with ,
.
The data for each trkpt
should be written to
standard output, one trkpt
per line, with a newline at the
end of each, and no other output.
Your program's output will be tested on inputs that obey these rules (note that our inputs will follow the rules below and not the official GPX standard, so our specification is much more permissive than the official standard and you should not assume that rules from the official GPX standard carry over to our specification unless specifically listed below):
- the input will be well-formed XML with one exception (see below),
which means
- start and end tags will nest properly
- there is no whitespace between the tag start
<
and the element type or end-type character/
- there is no whitespace between the tag end-type character
/
and the element type - there may be whitespace surrounding the equals sign between an attribute and its value
- there may be whitespace before the tag end character
>
- attribute values are always quoted with
"
or'
, and the beginning and ending quote will always match - the tag end character
>
does not mean "tag end" inside a quoted value - the tag start character
<
appears nowhere except as the start of a tag - quotes only appear surrounding attribute values or inside values quoted by the other type of quote
- end tag
>
, element end/
, and attribute name/value pairing=
symbols appear only in their special contexts or inside quoted attribute values
- there will be no newline characters in the attribute values or the
text of the
ele
ortime
elements insidetrkpt
elements; -
trkpt
elements will always havelat
andlon
attributes that appear in that order (along with possibly other attributes mixed in before or after or between them); -
trkpt
elements will always contain a singleele
element and a singletime
element in that order as children and never as deeper descendants (along with possibly other descendant elements and text before, after, or between theele
andtime
child elements); -
trkpt
elements will not contain othertrkpt
elements; -
ele
andtime
elements insidetrkpt
elements will not contain other elements; -
ele
andtime
elements may appear outsidetrkpt
elements, and we do not want to extract the contents of such elements; - there is no limit on the length of an element type, attribute name or value, or text; and
- there are no
CDATA
, processing, comment elements, markup declarations, or self-closing tags.
We do relax the XML specification in two ways:
we use isspace
to determine what is whitespace,
which includes two characters (vertical tab and form feed) that the
XML specification does not allow; and we consider element types
to be not case-sensitive (but attribute names remain case-sensitive).
So, for example,
we want to extract trkpt
,
TRKPT
, tRkPt
elements, and the start and end
tags don't have to match case, so, for example, a <ELE>
start tag could be paired with a </ele>
end tag.
Your program will also be tested on inputs that do not obey those rules, and in such cases the criteria for passing a test is simply whether your program ran to completion without crashing or going into an infinite loop; the output can be anything or nothing in these cases. So your program needn't check the validity of the input file, as long as it won't crash or hang when the input is invalid. (Although well-designed programs should generally detect problems with input and alert the user with an appropriate error message rather than continuing execution with meaningless and confusing output, or, in the worst-case, dangerous consequences.)
Additional requirements
- Your program may use no more than a constant amount of memory -- the total number of variables, the maximum size of arrays, and the depth of any recursion must not depend on the size of the input.
- You may not use dynamically allocated memory.
Example
If the input is<?xml version="1.0" encoding="UTF-8"?> <!DOCTYPE GPX PUBLIC "http://upcycle.com/format" "version 1.0"> <gpx creator="StravaGPX" version="1.1" xmlns="http://www.topografix.com/GPX/1/1" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.topografix.com/GPX/1/1 http://www.topografix.com/GPX/1/1/gpx.xsd"> <metadata> <time>2018-08-24T13:49:45Z</time> </metadata> <trk> <name>Morning Ride</name> <type>1</type> <trkseg> <trkpt lat="41.3078680" lon="-72.9342120"> <ele>20.0</ele> <time>2018-08-24T13:49:45Z</time> </trkpt> <trkpt lat="41.3078680" lon="-72.9342120"> <ele>20.0</ele> <time>2018-08-24T13:49:46Z</time> </trkpt> <trkpt lat="41.3078810" lon="72.9342590W"> <ele>20.0</ele> <time>2018-08-24T13:49:49Z</time> </trkpt> </trkseg> </trk> </gpx>then the output must be
41.3078680,-72.9342120,20.0,2018-08-24T13:49:45Z 41.3078680,-72.9342120,20.0,2018-08-24T13:49:46Z 41.3078810,72.9342590W,20.0,2018-08-24T13:49:49Z
Submissions
Submit your source code, a makefile that produces an executable calledParseGPX
as its default target, and your
log.