Assignment 1: Parsing GPX Files

Objectives

Introduction

Global Positioning System devices are used in a wide variety of applications, from ride hailing services to fitness tracking to exergaming to migratory animal tracking to illegal fishing monitoring. Many GPS devices record a GPS track, which is a record of the position of the device over time. GPS tracks are often stores in GPX files, which are a specific kind of XML file. For this assignment, we will process GPX files to extract the location information for easier use in other applications.

Assignment

Write a program called ParseGPX that extracts GPS track information from GPX files read from standard input.

GPX, and XML in general, has many features, but for this assignment we will assume that the files we're processing are somewhat more restricted than general GPX/XML files.

In general, an XML file contains a prolog followed by XML elements. XML elements are delimited by start and end tags (we will not consider the special syntax for empty elements), where each tag starts with a < character and ends with a > character. Between those delimiters, a start tag contains a sequence of non-whitespace characters giving the element type, and optionally a sequence of unique attributes, separated from the element type by whitespace and given as a whitespace-separated list of attribute names followed by an equals sign (=) and a quoted attribute value. There may be whitespace around the equals sign, and which characters count as whitespace at this point and elsewhere in our XML files is determined in the same way C's isspace function determines what is whitespace. The end tags have only the element type, preceded by a forward slash (/).

XML elements may contain text (characters not in a tag) or other XML elements. Elements must nest properly, so any child elements must have their start tag and end tag both inside their parent element. The prolog of an XML file contains tag-like items that start and end with <? and ?> or <! and > and these tag-like items may contain a sequence of things that look like attribute names and quoted values (see the example below).

A GPX file is a specific kind of XML file with specific kinds of elements structured in a particular way. In particular, a GPX file contains trkpt elements with lat and lon attributes in the opening tag whose values give the latitude and longitude of the tracked object at some point in time. The trkpt elements contain ele and time elements whose text gives the elevation of the object and the time of the measurements respectively.

Our task is to extract the values of the lat and lon attributes of the trkpt opening tags along with the text of the ele and time elements contained in the trkpt elements: for each trkpt, output a comma-separated list of the lat and lon values and ele and time text in that order. The contents of each piece of data should be copied verbatim to the output, except that the quotes at the beginning and end of the value must be removed, and we must escape any commas in any attribute values and element text written to the output by replacing them with &comma;. The data for each trkpt should be written to standard output, one trkpt per line, with a newline at the end of each, and no other output.

Your program's output will be tested on inputs that obey these rules (note that our inputs will follow the rules below and not the official GPX standard, so our specification is much more permissive than the official standard and you should not assume that rules from the official GPX standard carry over to our specification unless specifically listed below):

We do relax the XML specification in two ways: we use isspace to determine what is whitespace, which includes two characters (vertical tab and form feed) that the XML specification does not allow; and we consider element types to be not case-sensitive (but attribute names remain case-sensitive). So, for example, we want to extract trkpt, TRKPT, tRkPt elements, and the start and end tags don't have to match case, so, for example, a <ELE> start tag could be paired with a </ele> end tag.

Your program will also be tested on inputs that do not obey those rules, and in such cases the criteria for passing a test is simply whether your program ran to completion without crashing or going into an infinite loop; the output can be anything or nothing in these cases. So your program needn't check the validity of the input file, as long as it won't crash or hang when the input is invalid. (Although well-designed programs should generally detect problems with input and alert the user with an appropriate error message rather than continuing execution with meaningless and confusing output, or, in the worst-case, dangerous consequences.)

Additional requirements

Example

If the input is
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE GPX PUBLIC "http://upcycle.com/format" "version 1.0">
<gpx creator="StravaGPX" version="1.1" xmlns="http://www.topografix.com/GPX/1/1" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.topografix.com/GPX/1/1 http://www.topografix.com/GPX/1/1/gpx.xsd">
<metadata>
  <time>2018-08-24T13:49:45Z</time>
 </metadata>
 <trk>
  <name>Morning Ride</name>
  <type>1</type>
  <trkseg>
   <trkpt lat="41.3078680" lon="-72.9342120">
    <ele>20.0</ele>
    <time>2018-08-24T13:49:45Z</time>
   </trkpt>
   <trkpt lat="41.3078680" lon="-72.9342120">
    <ele>20.0</ele>
    <time>2018-08-24T13:49:46Z</time>
   </trkpt>
   <trkpt lat="41.3078810" lon="72.9342590W">
    <ele>20.0</ele>
    <time>2018-08-24T13:49:49Z</time>
   </trkpt>
  </trkseg>
 </trk>
</gpx>
  
then the output must be
41.3078680,-72.9342120,20.0,2018-08-24T13:49:45Z
41.3078680,-72.9342120,20.0,2018-08-24T13:49:46Z
41.3078810,72.9342590W,20.0,2018-08-24T13:49:49Z
  

Submissions

Submit your source code, a makefile that produces an executable called ParseGPX as its default target, and your log.