## CS 200: Utilities in Python

This notebook mirrors the <a target=ee href="https://developers.google.com/edu/python/utilities">Google Python Course: Utilities</a>

<script language="JavaScript">
    document.write("Last modified: " + document.lastModified)
</script>

### File System -- os, os.path, shutil

The *os* and *os.path* modules include many functions to interact with the file system. The *shutil* module can copy files.

<ul>
    <li> <a target=yy href="https://docs.python.org/3/library/os.html">os module docs</a>

<li> filenames = os.listdir(dir) -- list of filenames in that directory path (not including . and ..). The filenames are just the names in the directory, not their absolute paths.

<li>  os.path.join(dir, filename) -- given a filename from the above list, use this to put the dir and filename together to make a path

<li> os.path.abspath(path) -- given a path, return an absolute form, e.g. /home/nick/foo/bar.html

<li> os.path.dirname(path), os.path.basename(path) -- given dir/foo/bar.html, return the dirname "dir/foo" and basename "bar.html"

<li> os.path.exists(path) -- true if it exists

<li> os.mkdir(dir_path) -- makes one dir, os.makedirs(dir_path) makes all the needed dirs in this path

<li> shutil.copy(source-path, dest-path) -- copy a file (dest path directories should exist)
</ul>


In [1]:
import os

In [2]:
os.listdir(".")

['.swirl',
 '__pycache__',
 'nn',
 '.old2018',
 'sh',
 'sklearn',
 '.old2017',
 'f0201.py',
 'f0831.py',
 'cs200.ipynb',
 'notebook.py',
 '0201.html',
 '0203.html',
 'psource.py',
 'collatz.py',
 'python.html',
 '0201nb.html',
 'google-python-exercises',
 '0201.script',
 '0201nb.ipynb',
 '0208.html',
 '0203.script',
 '0210.html',
 '0215.html',
 'f0909.py',
 'Introduction.ipynb',
 'Strings.ipynb',
 '0208.script',
 'Introduction.html',
 'cs200.html',
 'Strings.html',
 'testfile',
 'Lists.html',
 '0210.script',
 'Lists.ipynb',
 'Sorting.html',
 'Sorting.ipynb',
 'newdir',
 'recursion.py',
 'fib.py',
 'DictFiles.ipynb',
 'RegExp.ipynb',
 'Utilities.ipynb',
 '0217.html',
 '0224.html',
 'linux.words',
 'DictFiles.html',
 'puzzle.py',
 'Recursion.html',
 'Recursion.ipynb',
 'listcomp.py',
 '0217.script',
 '0301.html',
 'gullible.png',
 'Listcomp.html',
 'knapsack.py',
 'Listcomp.ipynb',
 '0224.script',
 'retest.py',
 'RegExp.html',
 'mt.py',
 'is-this-going.jpg',
 'Utilities.html',
 'hw3a.pyc

In [3]:
p = os.path.abspath('.')

In [4]:
p

'/home/httpd/html/zoo/classes/cs200/lectures'

In [5]:
f = os.path.join(p,'retest.py')

In [6]:
f

'/home/httpd/html/zoo/classes/cs200/lectures/retest.py'

In [7]:
os.path.dirname(f)

'/home/httpd/html/zoo/classes/cs200/lectures'

In [8]:
os.path.basename(f)

'retest.py'

In [9]:
os.path.exists(f)

True

In [10]:
newdir = os.path.join(p,"newdir")

In [11]:
os.path.exists(newdir)

True

In [12]:
os.mkdir(newdir)

FileExistsError: [Errno 17] File exists: '/home/httpd/html/zoo/classes/cs200/lectures/newdir'

In [13]:
os.path.exists(newdir)

True

In [14]:
newpath = os.path.join(newdir, 'a/b/c')

In [15]:
newpath

'/home/httpd/html/zoo/classes/cs200/lectures/newdir/a/b/c'

In [16]:
os.path.exists(newpath)

True

In [17]:
os.makedirs(newpath)  ## create all needed directories

FileExistsError: [Errno 17] File exists: '/home/httpd/html/zoo/classes/cs200/lectures/newdir/a/b/c'

In [18]:
os.path.exists(newpath)

True

In [19]:
import shutil

In [20]:
shutil.copy('./retest.py', newdir)

'/home/httpd/html/zoo/classes/cs200/lectures/newdir/retest.py'

In [22]:
os.listdir(newdir)

['a', 'retest.py']

<h3 id="running"> Running External Processes -- <strike>commands</strike> subprocess</h3>

The *commands* module is a simple way to run an external command and capture its output.  <b>The commands module is no longer available in Python 3. Use the subprocess module instead.</b>

<ul>
    <li> <a target=qq href="https://docs.python.org/3/library/subprocess.html">subprocess  module docs</a>

<li> (status, output) = subprocess.getstatusoutput(cmd) -- runs the command, waits for it to exit, and returns its status int and output text as a tuple. The command is run with its standard output and standard error combined into the one output text. The status will be non-zero if the command failed. Since the standard-err of the command is captured, if it fails, we need to print some indication of what happened.

<li> output = subprocess.getoutput(cmd) -- as above, but without the status int.

<li> There is no subprocess.getstatus() but we can define a similar function.

<li> If you want more control over the running of the sub-process, see the "popen2" module (http://docs.python.org/lib/module-popen2.html)

<li> There is also a simple os.system(cmd) which runs the command and dumps its output onto your output and returns its error code. This works if you want to run the command but do not need to capture its output into your python data structures.

In [23]:
import subprocess

In [24]:
(status, output) = subprocess.getstatusoutput("date")

In [26]:
status

0

In [27]:
output

'Mon 01 Mar 2021 02:46:52 PM EST'

In [28]:
subprocess.getoutput("date")

'Mon 01 Mar 2021 02:48:24 PM EST'

In [6]:
subprocess.getstatus("date")

AttributeError: module 'subprocess' has no attribute 'getstatus'

In [29]:
def mygetstatus(cmd):
    (status, output) = subprocess.getstatusoutput(cmd)
    return status

In [30]:
mygetstatus('date')

0

In [31]:
mygetstatus('xsxsxs')

127

In [33]:
mygetstatus('ls /djdjd')

2

### Exceptions

An exception represents a run-time error that halts the normal execution at a particular line and transfers control to error handling code. This section just introduces the most basic uses of exceptions. For example a run-time error might be that a variable used in the program does not have a value (NameError .. you've probably seen that one a few times), or a file open operation error because that a does not exist (IOError). (See [[http://docs.python.org/tut/node10.html][exception docs]])

In [34]:
joe

NameError: name 'joe' is not defined

In [35]:
fd = open('somefile','r')

FileNotFoundError: [Errno 2] No such file or directory: 'somefile'

In [37]:
fd

NameError: name 'fd' is not defined

Without any error handling code (as we have done thus far), a run-time exception just halts the program with an error message. That's a good default behavior, and you've seen it many times. You can add a "try/except" structure to your code to handle exceptions, like this:

In [38]:
try:
    file = open("somefile",'r')
    text = file.read()
    file.close()
except FileNotFoundError:
    print ("somefile not found.  Sorry.")
print ("We keep on going.")

somefile not found.  Sorry.
We keep on going.


The try: section includes the code which might throw an exception. The except: section holds the code to run if there is an exception. If there is no exception, the except: section is skipped (that is, that code is for error handling only, not the "normal" case for the code). You can get a pointer to the exception object itself with syntax "except IOError, e: .. (e points to the exception object)".

A later notebook will explore exceptions in greater detail.

### HTTP -- urllib and urlparse

#### Video:

See <a target=qq href="https://www.youtube.com/watch?v=LosIGgon_KM">urllib video</a> from Socratica.

The module *urllib* provides url fetching -- making a url look like a file from which you can read. It contains four other modules: request, error, parse, and robotparse.  Here are some useful methods from the request and parse modules.

<ul>
    <li> <a target=ww href="https://docs.python.org/3/library/urllib.html">urllib module docs</a>

<li> ufile = urllib.request.urlopen(url) -- returns a file like object for that url

<li> text = ufile.read() -- can read from it, like a file (readlines(), for loops,  etc. also work)
<li> 
info = ufile.info() -- the meta info for that request. info.gettype() is the mime time, e.g. 'text/html'

<li> baseurl = ufile.geturl() -- gets the "base" url for the request, which may be different from the original because of redirects

<li> urllib.request.urlretrieve(url, filename) -- downloads the url data to the given file path

<li> urllib.parse.urljoin(baseurl, url) -- given a url that may or may not be full, and the baseurl of the page it comes from, return a full url. Use geturl() above to provide the base url.
</ul>

In [39]:
import urllib

In [40]:
cs200url = 'https://zoo.cs.yale.edu/classes/cs200/index.html'

In [41]:
ufile = urllib.request.urlopen(cs200url)

In [42]:
type(ufile)

http.client.HTTPResponse

In [43]:
text = ufile.read()

In [44]:
type(text)

bytes

In [46]:
text[0:40]

b'<HTML>\n<HEAD>\n<TITLE>CPSC 200 - Introduc'

In [47]:
len(text)

2749

In [48]:
ufile.info()

<http.client.HTTPMessage at 0x7f4244267250>

In [49]:
type(ufile.info())

http.client.HTTPMessage

In [50]:
baseurl = ufile.geturl()

In [51]:
baseurl

'https://zoo.cs.yale.edu/classes/cs200/index.html'

The for loop and iterate through the url page, just as it can iterate through a file, line by line.

In [52]:
count = 0
for line in urllib.request.urlopen(cs200url):
    count += 1
    if count < 10:
        print (line.decode('utf-8'), end='')

<HTML>
<HEAD>
<TITLE>CPSC 200 - Introduction to Information Systems</TITLE>
<BASE TARGET="_top">  <!-- Prevent Classes*V2 from branding links -->
</HEAD>
<BODY>
<dl>
  <dt><h1>CPSC 200 - Introduction to Information Systems</h1>
  <dt><h2>SPRING 2021</h2>


Now we try the urlretrieve() method.

In [53]:
urllib.request.urlretrieve(cs200url, "copyofcs200url")

('copyofcs200url', <http.client.HTTPMessage at 0x7f42442061c0>)

In [54]:
with open('copyofcs200url','r') as f:
    count = 0
    for line in f:
        count += 1
        if count > 10:
            pass
        else:
            print (line, end='')

<HTML>
<HEAD>
<TITLE>CPSC 200 - Introduction to Information Systems</TITLE>
<BASE TARGET="_top">  <!-- Prevent Classes*V2 from branding links -->
</HEAD>
<BODY>
<dl>
  <dt><h1>CPSC 200 - Introduction to Information Systems</h1>
  <dt><h2>SPRING 2021</h2>
</dl>


In [55]:
urllib.parse.urljoin(baseurl, cs200url)

'https://zoo.cs.yale.edu/classes/cs200/index.html'

We can explore these objects uring the dir() command and related functions.  These techniques are generally useful, beyond the urllib module.

In [56]:
dir(urllib.request)

['AbstractBasicAuthHandler',
 'AbstractDigestAuthHandler',
 'AbstractHTTPHandler',
 'BaseHandler',
 'CacheFTPHandler',
 'ContentTooShortError',
 'DataHandler',
 'FTPHandler',
 'FancyURLopener',
 'FileHandler',
 'HTTPBasicAuthHandler',
 'HTTPCookieProcessor',
 'HTTPDefaultErrorHandler',
 'HTTPDigestAuthHandler',
 'HTTPError',
 'HTTPErrorProcessor',
 'HTTPHandler',
 'HTTPPasswordMgr',
 'HTTPPasswordMgrWithDefaultRealm',
 'HTTPPasswordMgrWithPriorAuth',
 'HTTPRedirectHandler',
 'HTTPSHandler',
 'MAXFTPCACHE',
 'OpenerDirector',
 'ProxyBasicAuthHandler',
 'ProxyDigestAuthHandler',
 'ProxyHandler',
 'Request',
 'URLError',
 'URLopener',
 'UnknownHandler',
 '__all__',
 '__builtins__',
 '__cached__',
 '__doc__',
 '__file__',
 '__loader__',
 '__name__',
 '__package__',
 '__spec__',
 '__version__',
 '_cut_port_re',
 '_ftperrors',
 '_have_ssl',
 '_localhost',
 '_noheaders',
 '_opener',
 '_parse_proxy',
 '_proxy_bypass_macosx_sysconf',
 '_randombytes',
 '_safe_gethostbyname',
 '_splitattr',
 '_sp

In [57]:
ufile.length

0

In [10]:
ufile.status

200

In [14]:
dir(ufile)

['__abstractmethods__',
 '__class__',
 '__del__',
 '__delattr__',
 '__dict__',
 '__dir__',
 '__doc__',
 '__enter__',
 '__eq__',
 '__exit__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__iter__',
 '__le__',
 '__lt__',
 '__module__',
 '__ne__',
 '__new__',
 '__next__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 '_abc_impl',
 '_checkClosed',
 '_checkReadable',
 '_checkSeekable',
 '_checkWritable',
 '_check_close',
 '_close_conn',
 '_get_chunk_left',
 '_method',
 '_peek_chunked',
 '_read1_chunked',
 '_read_and_discard_trailer',
 '_read_next_chunk_size',
 '_read_status',
 '_readall_chunked',
 '_readinto_chunked',
 '_safe_read',
 '_safe_readinto',
 'begin',
 'chunk_left',
 'chunked',
 'close',
 'closed',
 'code',
 'debuglevel',
 'detach',
 'fileno',
 'flush',
 'fp',
 'getcode',
 'getheader',
 'getheaders',
 'geturl',
 'headers',
 'info',
 'isatty',
 'isclosed',

In [58]:
ufile.url

'https://zoo.cs.yale.edu/classes/cs200/index.html'

### Neat trick for analyzing a module's properties and methods.

In [33]:
for x in dir(ufile):
    if x.startswith('_'):
        pass
    else:
        print (x, '\t\t', getattr(ufile, x), '\n')

begin 		 <bound method HTTPResponse.begin of <http.client.HTTPResponse object at 0x7f9764df25b0>> 

chunk_left 		 UNKNOWN 

chunked 		 False 

close 		 <bound method HTTPResponse.close of <http.client.HTTPResponse object at 0x7f9764df25b0>> 

closed 		 False 

code 		 200 

debuglevel 		 0 

detach 		 <built-in method detach of HTTPResponse object at 0x7f9764df25b0> 

fileno 		 <bound method HTTPResponse.fileno of <http.client.HTTPResponse object at 0x7f9764df25b0>> 

flush 		 <bound method HTTPResponse.flush of <http.client.HTTPResponse object at 0x7f9764df25b0>> 

fp 		 None 

getcode 		 <bound method HTTPResponse.getcode of <http.client.HTTPResponse object at 0x7f9764df25b0>> 

getheader 		 <bound method HTTPResponse.getheader of <http.client.HTTPResponse object at 0x7f9764df25b0>> 

getheaders 		 <bound method HTTPResponse.getheaders of <http.client.HTTPResponse object at 0x7f9764df25b0>> 

geturl 		 <bound method HTTPResponse.geturl of <http.client.HTTPResponse object at 0x7f9764d

In [59]:
dir(os)

['CLD_CONTINUED',
 'CLD_DUMPED',
 'CLD_EXITED',
 'CLD_TRAPPED',
 'DirEntry',
 'EX_CANTCREAT',
 'EX_CONFIG',
 'EX_DATAERR',
 'EX_IOERR',
 'EX_NOHOST',
 'EX_NOINPUT',
 'EX_NOPERM',
 'EX_NOUSER',
 'EX_OK',
 'EX_OSERR',
 'EX_OSFILE',
 'EX_PROTOCOL',
 'EX_SOFTWARE',
 'EX_TEMPFAIL',
 'EX_UNAVAILABLE',
 'EX_USAGE',
 'F_LOCK',
 'F_OK',
 'F_TEST',
 'F_TLOCK',
 'F_ULOCK',
 'GRND_NONBLOCK',
 'GRND_RANDOM',
 'MFD_ALLOW_SEALING',
 'MFD_CLOEXEC',
 'MFD_HUGETLB',
 'MFD_HUGE_16GB',
 'MFD_HUGE_16MB',
 'MFD_HUGE_1GB',
 'MFD_HUGE_1MB',
 'MFD_HUGE_256MB',
 'MFD_HUGE_2GB',
 'MFD_HUGE_2MB',
 'MFD_HUGE_32MB',
 'MFD_HUGE_512KB',
 'MFD_HUGE_512MB',
 'MFD_HUGE_64KB',
 'MFD_HUGE_8MB',
 'MFD_HUGE_MASK',
 'MFD_HUGE_SHIFT',
 'MutableMapping',
 'NGROUPS_MAX',
 'O_ACCMODE',
 'O_APPEND',
 'O_ASYNC',
 'O_CLOEXEC',
 'O_CREAT',
 'O_DIRECT',
 'O_DIRECTORY',
 'O_DSYNC',
 'O_EXCL',
 'O_LARGEFILE',
 'O_NDELAY',
 'O_NOATIME',
 'O_NOCTTY',
 'O_NOFOLLOW',
 'O_NONBLOCK',
 'O_PATH',
 'O_RDONLY',
 'O_RDWR',
 'O_RSYNC',
 'O_SYNC',


In [60]:
os.__doc__

"OS routines for NT or Posix depending on what system we're on.\n\nThis exports:\n  - all functions from posix or nt, e.g. unlink, stat, etc.\n  - os.path is either posixpath or ntpath\n  - os.name is either 'posix' or 'nt'\n  - os.curdir is a string representing the current directory (always '.')\n  - os.pardir is a string representing the parent directory (always '..')\n  - os.sep is the (or a most common) pathname separator ('/' or '\\\\')\n  - os.extsep is the extension separator (always '.')\n  - os.altsep is the alternate pathname separator (None or '/')\n  - os.pathsep is the component separator used in $PATH etc\n  - os.linesep is the line separator in text files ('\\r' or '\\n' or '\\r\\n')\n  - os.defpath is the default search path for executables\n  - os.devnull is the file path of the null device ('/dev/null', etc.)\n\nPrograms that import and use 'os' stand a better chance of being\nportable between different platforms.  Of course, they must then\nonly use functions that a

End of Utilities notebook.

In [61]:
os.__name__

'os'

In [62]:
def f():
   return 9 

In [63]:
f()

9

In [64]:
dir(f)

['__annotations__',
 '__call__',
 '__class__',
 '__closure__',
 '__code__',
 '__defaults__',
 '__delattr__',
 '__dict__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__get__',
 '__getattribute__',
 '__globals__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__kwdefaults__',
 '__le__',
 '__lt__',
 '__module__',
 '__name__',
 '__ne__',
 '__new__',
 '__qualname__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__sizeof__',
 '__str__',
 '__subclasshook__']