# CS 200: Strings in Python

<p>
<script language="JavaScript">
    document.write("Last modified: " + document.lastModified)
</script>
<p>
This notebook mirrors the <a target=ee href="https://developer.google.com/edu/python/strings">Google Python Course: Strings</a>

### Video:

See <a target=ss href="https://www.socratica.com/lesson/strings">Strings</a> from Socratica.

### str class

Python strings are instances of the <code>str</code> class.
<code>str(object)</code> is a constructor which creates a new string from the given object.

In [66]:
str(123)

'123'

In [67]:
str(3 + 7)

'10'

Strings can be notated with single quotes ('), double quotes ("), or triple quotes (''') for strings that span multiple lines.

In [68]:
'hello world'

'hello world'

In [69]:
"hello world"

'hello world'

In [70]:
'''
wait for it:
hello world!
'''

'\nwait for it:\nhello world!\n'

The '\n' character is "newline".

In [71]:
print('\nwait for it:\nhello world!\n')


wait for it:
hello world!



### escape sequences

If you want to include a single or double quote inside a string, you can escape it with a backslash (\\).  If you want to include a backslash in a string, use two backslashes.  See <a target=tt href="http://www.python-ds.com/python-3-escape-sequences">Python 3 escape sequences</a> for more information.

In [72]:
'I can\'t wait!'

"I can't wait!"

In [73]:
"I can't wait"

"I can't wait"

In [74]:
'"Who are you?"'

'"Who are you?"'

In [75]:
print("heaven \\ hell")

heaven \ hell


### string methods

The str class has a boatload of methods.

In [76]:
dir(str)

['__add__',
 '__class__',
 '__contains__',
 '__delattr__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__getitem__',
 '__getnewargs__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__iter__',
 '__le__',
 '__len__',
 '__lt__',
 '__mod__',
 '__mul__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__rmod__',
 '__rmul__',
 '__setattr__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 'capitalize',
 'casefold',
 'center',
 'count',
 'encode',
 'endswith',
 'expandtabs',
 'find',
 'format',
 'format_map',
 'index',
 'isalnum',
 'isalpha',
 'isascii',
 'isdecimal',
 'isdigit',
 'isidentifier',
 'islower',
 'isnumeric',
 'isprintable',
 'isspace',
 'istitle',
 'isupper',
 'join',
 'ljust',
 'lower',
 'lstrip',
 'maketrans',
 'partition',
 'removeprefix',
 'removesuffix',
 'replace',
 'rfind',
 'rindex',
 'rjust',
 'rpartition',
 'rsplit',
 'rstrip',
 'split',
 'splitlines',
 'startswith',
 'strip',
 'swapcase',


### Character Case

lower() and upper() convert a string to all lowercase or all uppercase, respectively.  

Methods are invoked on instances of the object using the syntax

<code>instance.method()</code>

In [77]:
s = "   Hello World!  "

In [78]:
s

'   Hello World!  '

In [79]:
s.lower()

'   hello world!  '

In [80]:
s

'   Hello World!  '

In [81]:
s.upper()

'   HELLO WORLD!  '

In [82]:
s

'   Hello World!  '

Note that lower and upper methods are not destructive.  They work on a copy of the given string and do not change the original string. strip() removes leading and trailing spaces from a string.

In [83]:
s.strip()

'Hello World!'

In [84]:
s

'   Hello World!  '

### Character types: alpha, digit, space

Three common character categories are alpha, digit, and space.  There are methods to check if a given string is in one of these categories.

In [85]:
s.isalpha()

False

In [86]:
'abcde'.isalpha()

True

In [87]:
'123'.isdigit()

True

In [88]:
'123.456'.isdigit()

False

In [89]:
s.isspace()

False

In [90]:
' \n\t\r\f'.isspace()

True

Note that whitespace includes space, newline, tab, return, and formfeed.  These characters do not put ink (or pixels) on the page.

A common string operation is to match the start, end, or middle of a string.

In [91]:
s = s.strip()

In [92]:
s

'Hello World!'

In [93]:
s.startswith('Hello')

True

In [94]:
s.endswith('World!')

True

In [95]:
s.find(' ')

5

In [96]:
s

'Hello World!'

In [97]:
s.find('Q')

-1

In [98]:
x = '  xxx  '

In [99]:
y = x

In [100]:
x = x.strip()

In [101]:
x

'xxx'

In [102]:
y

'  xxx  '

In [103]:
s

'Hello World!'

5 indicates the sixth character of the string, of course.  -1 indicates that the given character is not in the string.

### String indexing

Strings are indexed using zero-based indexing.  That means that the first character of the string is string[0]

In [104]:
s[0]

'H'

Positive integers index the string from left to right, starting with 0.  Negative integers index the string from right to left, starting with -1.


<table style="font-size:20px">
<tr>    
<td> H </td><td> e </td><td> l </td><td> l </td><td> o </td><td>   </td><td> W </td><td> o </td><td> r </td><td> l </td><td> d </td><td> ! </td>
</tr><tr>
<td> 0 </td><td> 1 </td><td> 2 </td><td> 3 </td><td> 4 </td><td> 5 </td><td> 6 </td><td> 7 </td><td> 8 </td><td> 9 </td><td> 10 </td><td> 11 </td>
    </tr><tr>
    <td> -12 </td><td> -11 </td>
    <td> -10 </td><td> -9 </td><td> -8 </td><td> -7 </td><td> -6 </td><td> -5 </td><td> -4 </td><td> -3 </td><td> -2 </td><td> -1 </td>
</table>


In [105]:
s[11]

'!'

In [106]:
s[-1]

'!'

In [107]:
s[-6]

'W'

In [108]:
s[12]

IndexError: string index out of range

In [109]:
len(s)

12

In [110]:
len('')

0

In [111]:
'x'[0]

'x'

In [112]:
''[0]

IndexError: string index out of range

len() tells you how many characters are in a string or the length of the string. The empty string is of zero length. 

If you try to index the string out of bounds, you thrown an IndexError.

### String slices

You can specify a subset of a string using the <code>[start:end]</code> notation, where start is inclusive, but end is not.

If you leave off the start or end, you get beginning or end of the string.  Thus, [:] gives you the entire string, or more precisely, a copy of the string.

In [113]:
s[1:4]

'ell'

In [114]:
s[4:]

'o World!'

In [115]:
s[:4]

'Hell'

In [116]:
s[:]  ## useful python idiom for copying a string

'Hello World!'

In [117]:
s[-6:-2]

'Worl'

In [118]:
s[:-1]

'Hello World'

### id() gives the memory address

As noted above, you can make a copy of a string with the <code>[:]</code> slice, which makes a new copy of the string.  The <code>id()</code> function gives you the memory address of an object.  Objects with the same address are identical.


In [123]:
s = "a string"

In [124]:
id(s)

139683819578672

In [125]:
s2 = s

In [126]:
id(s2)

139683819578672

s and s2 have the same address in memory.  They are identical. 

<code>==</code> compares values.

<code>is</code> compares memory addresses.

In [127]:
s == s2

True

In [128]:
s is s2

True

s and s2 have the same value and the same address in memory.

In [129]:
s3 = s[:]

In [130]:
s3

'a string'

In [131]:
id(s3)

139683819578672

In [132]:
s == s3

True

In [133]:
s is s3

True

In [134]:
id(s)

139683819578672

Note: even though s3 is a copy of s, it has the same address because Python <b>interns</b> strings to save storage.  When you create a string, Python checks to see if there is already a string with that value.  If so, it just reuses it.



In [135]:
s = s +''

In [136]:
id(s)

139683819578672

In [137]:
s is s3

True

In [138]:
s = s + 'x'

In [139]:
id(s)

139684083129072

In [140]:
s is s3

False

### extended string slices

You  may use a third parameter to specify the step for the slice, <code>string[start:end:step]</code>

In [142]:
s

'a stringx'

In [143]:
s[::2]  ## every other letter

'asrnx'

In [144]:
s[::3] ## every third letter

'atn'

In [145]:
s[::1]  ## every letter

'a stringx'

In [146]:
s[::-1]   ### Reverses the string!

'xgnirts a'

In [147]:
s == s[::-1]

False

In [148]:
x = 'radar'

In [149]:
x == x[::-1]

True

This is an easy way to check for palindromes.

<h3 id="replace">string replace, split, and join</h3>

<code>s.replace('old','new')</code> - replace every occurence of "old" in <code>s</code> with "new"

<code>s.split(delimiter)</code> - return a list of the elements of string <code>s</code> using the given delimiter to partition the string.

<code>s.join(list)</code> - splice together the sequential elements of list using the string <code>s</code> as the glue.

In [150]:
s

'a stringx'

In [151]:
s.replace('l','*')

'a stringx'

In [152]:
s

'a stringx'

Note: replace works on a copy.  It does not change the original string.

In [153]:
'good boy'.replace('good','bad')

'bad boy'

In [154]:
'good boy'.replace('good','bad').replace('boy','girl')

'bad girl'

Note: the string returned by the first replace becomes the argument for the second replace.

In [155]:
'Hello World'.replace('l','d').replace('d','l')

'Hello Worll'

You need to be careful when doing sequential replaces.

### split()

In [156]:
r = 'Romeo and Juliet'.split()

In [157]:
r

['Romeo', 'and', 'Juliet']

In [158]:
len(r)

3

<code>split</code> returns a list, which we will cover later.

In [159]:
r2 = 'Romeo and Juliet'.split(' ')

In [160]:
r2

['Romeo', 'and', 'Juliet']

The default delimiter is space.

In [161]:
r3 = '203-555-1212'.split()

In [162]:
r3

['203-555-1212']

In [163]:
r4 = '203-555-1212'.split('-')

In [164]:
r4

['203', '555', '1212']

In [165]:
r3[0].replace('-','')


'2035551212'

The r3 phone number had no spaces, so it did not get split.  Using a hyphen as a delimiter, we split r4 in three parts.

### join()

We now can glue the lists together with join.

In [166]:
' '.join(r2)

'Romeo and Juliet'

In [167]:
'***'.join(r2)

'Romeo***and***Juliet'

In [168]:
'----'.join(r4)

'203----555----1212'

### ASCII (and Unicode) characters

Computer characters from the Roman alphabet are represented as numbers using the American Standard Code for Information Interchange (<a target=aa href="http://en.wikipedia.org/wiki/ASCII">ASCII</a>)

However, there are thousands of other characters.  Those are represented using <a target=qq href="http://en.wikipedia.org/wiki/Unicode">Unicode</a>

Python has two functions, <code>ord(character)</code>, and <code>chr(number)</code> which convert between characters and numeric codes.

In [169]:
ord('A')

65

In [170]:
ord('B')

66

In [171]:
ord('a')

97

In [172]:
ord('b')

98

Notice that ASCII is designed so that sorting words by their numerical values results in sorting alphabetically.  However, upper case letters sort before lower case letters. In UNIX, the <code>ls</code> command for listing directories often reflects this property.

In [173]:
chr(65)

'A'

In [174]:
chr(97)

'a'

We can specify Unicode characters using hexidecimal (base 16) notation.

In [175]:
0xA

10

In [176]:
0x10

16

In [177]:
0x3b4

948

In [178]:
chr(0x3b4)  ## delta

'δ'

In [179]:
chr(0x3b5)  ## epsilon

'ε'

In [180]:
chr(0x3bb)  ## lambda

'λ'

In [181]:
chr(0x394)  ## DELTA

'Δ'

In [182]:
chr(0x395)  ## EPSILON

'Ε'

In [183]:
chr(0x39b) ## LAMBDA

'Λ'

### encode and decode

We can convert a string into an array of bytes, with a specified encoding.  The Python string method <b>encode</b> has lots of options.  
See <a target=dkdk href="https://docs.python.org/3/library/stdtypes.html#str.encode">encode()</a> and
<a target=ekek href="https://docs.python.org/3/library/codecs.html#standard-encodings</a>list of standard encodings</a>

In [6]:
s = 'café'

In [7]:
len(s)

4

In [8]:
b = s.encode('utf8')

In [10]:
b

b'caf\xc3\xa9'

In [9]:
s.encode('UTF-8')

b'caf\xc3\xa9'

The above example is from Chapter 4 of Fluent Python.

The str 'café' has four Unicode characters.


Encode str to bytes using UTF-8 encoding.


bytes literals have a b prefix.


bytes b has five bytes (the code point for “é” is encoded as two bytes in UTF-8).



In [13]:
b.decode('utf8')

'café'

Decode bytes to str using UTF-8 encoding.

See Fluent Python for more details.

End of strings notebook.

<p></p>
<script language="JavaScript">
    document.write("Last modified: " + document.lastModified)
</script>