Assignment 1: Parsing GPX Files

Objectives

to use C control flow constructs
to use C I/O functions
to implement a state machine

Introduction

Global Positioning System devices are used in a wide variety of applications, from ride hailing services to fitness tracking to exergaming to migratory animal tracking to illegal fishing monitoring. Many GPS devices record a GPS track, which is a record of the position of the device over time. GPS tracks are often stores in GPX files, which are a specific kind of XML file. For this assignment, we will process GPX files to extract the location information for easier use in other applications.

Assignment

Write a program called ParseGPX that extracts GPS track information from GPX files read from standard input.

GPX, and XML in general, has many features, but for this assignment we will assume that the files we're processing are somewhat more restricted than general GPX/XML files.

In general, an XML file contains a prolog followed by XML elements. XML elements are delimited by start and end tags (we will not consider the special syntax for empty elements), where each tag starts with a < character and ends with a > character. Between those delimiters, a start tag contains a sequence of non-whitespace characters giving the element type, and optionally a sequence of unique attributes, separated from the element type by whitespace and given as a whitespace-separated list of attribute names followed by an equals sign (=) and a quoted attribute value. There may be whitespace around the equals sign, and which characters count as whitespace at this point and elsewhere in our XML files is determined in the same way C's isspace function determines what is whitespace. The end tags have only the element type, preceded by a forward slash (/).

XML elements may contain text (characters not in a tag) or other XML elements. Elements must nest properly, so any child elements must have their start tag and end tag both inside their parent element. The prolog of an XML file contains tag-like items that start and end with <? and ?> or <! and > and these tag-like items may contain a sequence of things that look like attribute names and quoted values (see the example below).

A GPX file is a specific kind of XML file with specific kinds of elements structured in a particular way. In particular, a GPX file contains trkpt elements with lat and lon attributes in the opening tag whose values give the latitude and longitude of the tracked object at some point in time. The trkpt elements contain ele and time elements whose text gives the elevation of the object and the time of the measurements respectively.

Our task is to extract the values of the lat and lon attributes of the trkpt opening tags along with the text of the ele and time elements contained in the trkpt elements: for each trkpt, output a comma-separated list of the lat and lon values and ele and time text in that order. The contents of each piece of data should be copied verbatim to the output, except that the quotes at the beginning and end of the value must be removed, and we must escape any commas in any attribute values and element text written to the output by replacing them with ,. The data for each trkpt should be written to standard output, one trkpt per line, with a newline at the end of each, and no other output.

Your program's output will be tested on inputs that obey these rules (note that our inputs will follow the rules below and not the official GPX standard, so our specification is much more permissive than the official standard and you should not assume that rules from the official GPX standard carry over to our specification unless specifically listed below):

the input will be well-formed XML with one exception (see below), which means
- start and end tags will nest properly
- there is no whitespace between the tag start < and the element type or end-type character /
- there is no whitespace between the tag end-type character / and the element type
- there may be whitespace surrounding the equals sign between an attribute and its value
- there may be whitespace before the tag end character >
- attribute values are always quoted with " or ', and the beginning and ending quote will always match
- the tag end character > does not mean "tag end" inside a quoted value
- the tag start character < appears nowhere except as the start of a tag
- quotes only appear surrounding attribute values or inside values quoted by the other type of quote
- end tag >, element end /, and attribute name/value pairing = symbols appear only in their special contexts or inside quoted attribute values
there will be no newline characters in the attribute values or the text of the ele or time elements inside trkpt elements;
trkpt elements will always have lat and lon attributes that appear in that order (along with possibly other attributes mixed in before or after or between them);
trkpt elements will always contain a single ele element and a single time element in that order as children and never as deeper descendants (along with possibly other descendant elements and text before, after, or between the ele and time child elements);
trkpt elements will not contain other trkpt elements;
ele and time elements inside trkpt elements will not contain other elements;
ele and time elements may appear outside trkpt elements, and we do not want to extract the contents of such elements;
there is no limit on the length of an element type, attribute name or value, or text; and
there are no CDATA, processing, comment elements, markup declarations, or self-closing tags.

We do relax the XML specification in two ways: we use isspace to determine what is whitespace, which includes two characters (vertical tab and form feed) that the XML specification does not allow; and we consider element types to be not case-sensitive (but attribute names remain case-sensitive). So, for example, we want to extract trkpt, TRKPT, tRkPt elements, and the start and end tags don't have to match case, so, for example, a <ELE> start tag could be paired with a </ele> end tag.

Your program will also be tested on inputs that do not obey those rules, and in such cases the criteria for passing a test is simply whether your program ran to completion without crashing or going into an infinite loop; the output can be anything or nothing in these cases. So your program needn't check the validity of the input file, as long as it won't crash or hang when the input is invalid. (Although well-designed programs should generally detect problems with input and alert the user with an appropriate error message rather than continuing execution with meaningless and confusing output, or, in the worst-case, dangerous consequences.)

Additional requirements

Your program may use no more than a constant amount of memory -- the total number of variables, the maximum size of arrays, and the depth of any recursion must not depend on the size of the input.
You may not use dynamically allocated memory.

Example

If the input is

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE GPX PUBLIC "http://upcycle.com/format" "version 1.0">
<gpx creator="StravaGPX" version="1.1" xmlns="http://www.topografix.com/GPX/1/1" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.topografix.com/GPX/1/1 http://www.topografix.com/GPX/1/1/gpx.xsd">
<metadata>
  <time>2018-08-24T13:49:45Z</time>
 </metadata>
 <trk>
  <name>Morning Ride</name>
  <type>1</type>
  <trkseg>
   <trkpt lat="41.3078680" lon="-72.9342120">
    <ele>20.0</ele>
    <time>2018-08-24T13:49:45Z</time>
   </trkpt>
   <trkpt lat="41.3078680" lon="-72.9342120">
    <ele>20.0</ele>
    <time>2018-08-24T13:49:46Z</time>
   </trkpt>
   <trkpt lat="41.3078810" lon="72.9342590W">
    <ele>20.0</ele>
    <time>2018-08-24T13:49:49Z</time>
   </trkpt>
  </trkseg>
 </trk>
</gpx>

then the output must be

41.3078680,-72.9342120,20.0,2018-08-24T13:49:45Z
41.3078680,-72.9342120,20.0,2018-08-24T13:49:46Z
41.3078810,72.9342590W,20.0,2018-08-24T13:49:49Z

Submissions

Submit your source code, a makefile that produces an executable called ParseGPX as its default target, and your log.