CS 200: Strings in Python¶

This notebook mirrors the Google Python Course: Strings

Video:¶

See Strings from Socratica.

str class¶

Python strings are instances of the str class. str(object) is a constructor which creates a new string from the given object.

In [66]:
str(123)
Out[66]:
'123'
In [67]:
str(3 + 7)
Out[67]:
'10'

Strings can be notated with single quotes ('), double quotes ("), or triple quotes (''') for strings that span multiple lines.

In [68]:
'hello world'
Out[68]:
'hello world'
In [69]:
"hello world"
Out[69]:
'hello world'
In [70]:
'''
wait for it:
hello world!
'''
Out[70]:
'\nwait for it:\nhello world!\n'

The '\n' character is "newline".

In [71]:
print('\nwait for it:\nhello world!\n')
wait for it:
hello world!

escape sequences¶

If you want to include a single or double quote inside a string, you can escape it with a backslash (\). If you want to include a backslash in a string, use two backslashes. See Python 3 escape sequences for more information.

In [72]:
'I can\'t wait!'
Out[72]:
"I can't wait!"
In [73]:
"I can't wait"
Out[73]:
"I can't wait"
In [74]:
'"Who are you?"'
Out[74]:
'"Who are you?"'
In [75]:
print("heaven \\ hell")
heaven \ hell

string methods¶

The str class has a boatload of methods.

In [76]:
dir(str)
Out[76]:
['__add__',
 '__class__',
 '__contains__',
 '__delattr__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__getitem__',
 '__getnewargs__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__iter__',
 '__le__',
 '__len__',
 '__lt__',
 '__mod__',
 '__mul__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__rmod__',
 '__rmul__',
 '__setattr__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 'capitalize',
 'casefold',
 'center',
 'count',
 'encode',
 'endswith',
 'expandtabs',
 'find',
 'format',
 'format_map',
 'index',
 'isalnum',
 'isalpha',
 'isascii',
 'isdecimal',
 'isdigit',
 'isidentifier',
 'islower',
 'isnumeric',
 'isprintable',
 'isspace',
 'istitle',
 'isupper',
 'join',
 'ljust',
 'lower',
 'lstrip',
 'maketrans',
 'partition',
 'removeprefix',
 'removesuffix',
 'replace',
 'rfind',
 'rindex',
 'rjust',
 'rpartition',
 'rsplit',
 'rstrip',
 'split',
 'splitlines',
 'startswith',
 'strip',
 'swapcase',
 'title',
 'translate',
 'upper',
 'zfill']

Character Case¶

lower() and upper() convert a string to all lowercase or all uppercase, respectively.

Methods are invoked on instances of the object using the syntax

instance.method()

In [77]:
s = "   Hello World!  "
In [78]:
s
Out[78]:
'   Hello World!  '
In [79]:
s.lower()
Out[79]:
'   hello world!  '
In [80]:
s
Out[80]:
'   Hello World!  '
In [81]:
s.upper()
Out[81]:
'   HELLO WORLD!  '
In [82]:
s
Out[82]:
'   Hello World!  '

Note that lower and upper methods are not destructive. They work on a copy of the given string and do not change the original string. strip() removes leading and trailing spaces from a string.

In [83]:
s.strip()
Out[83]:
'Hello World!'
In [84]:
s
Out[84]:
'   Hello World!  '

Character types: alpha, digit, space¶

Three common character categories are alpha, digit, and space. There are methods to check if a given string is in one of these categories.

In [85]:
s.isalpha()
Out[85]:
False
In [86]:
'abcde'.isalpha()
Out[86]:
True
In [87]:
'123'.isdigit()
Out[87]:
True
In [88]:
'123.456'.isdigit()
Out[88]:
False
In [89]:
s.isspace()
Out[89]:
False
In [90]:
' \n\t\r\f'.isspace()
Out[90]:
True

Note that whitespace includes space, newline, tab, return, and formfeed. These characters do not put ink (or pixels) on the page.

A common string operation is to match the start, end, or middle of a string.

In [91]:
s = s.strip()
In [92]:
s
Out[92]:
'Hello World!'
In [93]:
s.startswith('Hello')
Out[93]:
True
In [94]:
s.endswith('World!')
Out[94]:
True
In [95]:
s.find(' ')
Out[95]:
5
In [96]:
s
Out[96]:
'Hello World!'
In [97]:
s.find('Q')
Out[97]:
-1
In [98]:
x = '  xxx  '
In [99]:
y = x
In [100]:
x = x.strip()
In [101]:
x
Out[101]:
'xxx'
In [102]:
y
Out[102]:
'  xxx  '
In [103]:
s
Out[103]:
'Hello World!'

5 indicates the sixth character of the string, of course. -1 indicates that the given character is not in the string.

String indexing¶

Strings are indexed using zero-based indexing. That means that the first character of the string is string[0]

In [104]:
s[0]
Out[104]:
'H'

Positive integers index the string from left to right, starting with 0. Negative integers index the string from right to left, starting with -1.

H e l l o W o r l d !
0 1 2 3 4 5 6 7 8 9 10 11
-12 -11 -10 -9 -8 -7 -6 -5 -4 -3 -2 -1
In [105]:
s[11]
Out[105]:
'!'
In [106]:
s[-1]
Out[106]:
'!'
In [107]:
s[-6]
Out[107]:
'W'
In [108]:
s[12]
---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
Input In [108], in <cell line: 1>()
----> 1 s[12]

IndexError: string index out of range
In [109]:
len(s)
Out[109]:
12
In [110]:
len('')
Out[110]:
0
In [111]:
'x'[0]
Out[111]:
'x'
In [112]:
''[0]
---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
Input In [112], in <cell line: 1>()
----> 1 ''[0]

IndexError: string index out of range

len() tells you how many characters are in a string or the length of the string. The empty string is of zero length.

If you try to index the string out of bounds, you thrown an IndexError.

String slices¶

You can specify a subset of a string using the [start:end] notation, where start is inclusive, but end is not.

If you leave off the start or end, you get beginning or end of the string. Thus, [:] gives you the entire string, or more precisely, a copy of the string.

In [113]:
s[1:4]
Out[113]:
'ell'
In [114]:
s[4:]
Out[114]:
'o World!'
In [115]:
s[:4]
Out[115]:
'Hell'
In [116]:
s[:]  ## useful python idiom for copying a string
Out[116]:
'Hello World!'
In [117]:
s[-6:-2]
Out[117]:
'Worl'
In [118]:
s[:-1]
Out[118]:
'Hello World'

id() gives the memory address¶

As noted above, you can make a copy of a string with the [:] slice, which makes a new copy of the string. The id() function gives you the memory address of an object. Objects with the same address are identical.

In [123]:
s = "a string"
In [124]:
id(s)
Out[124]:
139683819578672
In [125]:
s2 = s
In [126]:
id(s2)
Out[126]:
139683819578672

s and s2 have the same address in memory. They are identical.

== compares values.

is compares memory addresses.

In [127]:
s == s2
Out[127]:
True
In [128]:
s is s2
Out[128]:
True

s and s2 have the same value and the same address in memory.

In [129]:
s3 = s[:]
In [130]:
s3
Out[130]:
'a string'
In [131]:
id(s3)
Out[131]:
139683819578672
In [132]:
s == s3
Out[132]:
True
In [133]:
s is s3
Out[133]:
True
In [134]:
id(s)
Out[134]:
139683819578672

Note: even though s3 is a copy of s, it has the same address because Python interns strings to save storage. When you create a string, Python checks to see if there is already a string with that value. If so, it just reuses it.

In [135]:
s = s +''
In [136]:
id(s)
Out[136]:
139683819578672
In [137]:
s is s3
Out[137]:
True
In [138]:
s = s + 'x'
In [139]:
id(s)
Out[139]:
139684083129072
In [140]:
s is s3
Out[140]:
False

extended string slices¶

You may use a third parameter to specify the step for the slice, string[start:end:step]

In [142]:
s
Out[142]:
'a stringx'
In [143]:
s[::2]  ## every other letter
Out[143]:
'asrnx'
In [144]:
s[::3] ## every third letter
Out[144]:
'atn'
In [145]:
s[::1]  ## every letter
Out[145]:
'a stringx'
In [146]:
s[::-1]   ### Reverses the string!
Out[146]:
'xgnirts a'
In [147]:
s == s[::-1]
Out[147]:
False
In [148]:
x = 'radar'
In [149]:
x == x[::-1]
Out[149]:
True

This is an easy way to check for palindromes.

string replace, split, and join

s.replace('old','new') - replace every occurence of "old" in s with "new"

s.split(delimiter) - return a list of the elements of string s using the given delimiter to partition the string.

s.join(list) - splice together the sequential elements of list using the string s as the glue.

In [150]:
s
Out[150]:
'a stringx'
In [151]:
s.replace('l','*')
Out[151]:
'a stringx'
In [152]:
s
Out[152]:
'a stringx'

Note: replace works on a copy. It does not change the original string.

In [153]:
'good boy'.replace('good','bad')
Out[153]:
'bad boy'
In [154]:
'good boy'.replace('good','bad').replace('boy','girl')
Out[154]:
'bad girl'

Note: the string returned by the first replace becomes the argument for the second replace.

In [155]:
'Hello World'.replace('l','d').replace('d','l')
Out[155]:
'Hello Worll'

You need to be careful when doing sequential replaces.

split()¶

In [156]:
r = 'Romeo and Juliet'.split()
In [157]:
r
Out[157]:
['Romeo', 'and', 'Juliet']
In [158]:
len(r)
Out[158]:
3

split returns a list, which we will cover later.

In [159]:
r2 = 'Romeo and Juliet'.split(' ')
In [160]:
r2
Out[160]:
['Romeo', 'and', 'Juliet']

The default delimiter is space.

In [161]:
r3 = '203-555-1212'.split()
In [162]:
r3
Out[162]:
['203-555-1212']
In [163]:
r4 = '203-555-1212'.split('-')
In [164]:
r4
Out[164]:
['203', '555', '1212']
In [165]:
r3[0].replace('-','')
Out[165]:
'2035551212'

The r3 phone number had no spaces, so it did not get split. Using a hyphen as a delimiter, we split r4 in three parts.

join()¶

We now can glue the lists together with join.

In [166]:
' '.join(r2)
Out[166]:
'Romeo and Juliet'
In [167]:
'***'.join(r2)
Out[167]:
'Romeo***and***Juliet'
In [168]:
'----'.join(r4)
Out[168]:
'203----555----1212'

ASCII (and Unicode) characters¶

Computer characters from the Roman alphabet are represented as numbers using the American Standard Code for Information Interchange (ASCII)

However, there are thousands of other characters. Those are represented using Unicode

Python has two functions, ord(character), and chr(number) which convert between characters and numeric codes.

In [169]:
ord('A')
Out[169]:
65
In [170]:
ord('B')
Out[170]:
66
In [171]:
ord('a')
Out[171]:
97
In [172]:
ord('b')
Out[172]:
98

Notice that ASCII is designed so that sorting words by their numerical values results in sorting alphabetically. However, upper case letters sort before lower case letters. In UNIX, the ls command for listing directories often reflects this property.

In [173]:
chr(65)
Out[173]:
'A'
In [174]:
chr(97)
Out[174]:
'a'

We can specify Unicode characters using hexidecimal (base 16) notation.

In [175]:
0xA
Out[175]:
10
In [176]:
0x10
Out[176]:
16
In [177]:
0x3b4
Out[177]:
948
In [178]:
chr(0x3b4)  ## delta
Out[178]:
'δ'
In [179]:
chr(0x3b5)  ## epsilon
Out[179]:
'ε'
In [180]:
chr(0x3bb)  ## lambda
Out[180]:
'λ'
In [181]:
chr(0x394)  ## DELTA
Out[181]:
'Δ'
In [182]:
chr(0x395)  ## EPSILON
Out[182]:
'Ε'
In [183]:
chr(0x39b) ## LAMBDA
Out[183]:
'Λ'

encode and decode¶

We can convert a string into an array of bytes, with a specified encoding. The Python string method encode has lots of options.
See encode() and <a target=ekek href="https://docs.python.org/3/library/codecs.html#standard-encodings</a>list of standard encodings</a>

In [6]:
s = 'café'
In [7]:
len(s)
Out[7]:
4
In [8]:
b = s.encode('utf8')
In [10]:
b
Out[10]:
b'caf\xc3\xa9'
In [9]:
s.encode('UTF-8')
Out[9]:
b'caf\xc3\xa9'

The above example is from Chapter 4 of Fluent Python.

The str 'café' has four Unicode characters.

Encode str to bytes using UTF-8 encoding.

bytes literals have a b prefix.

bytes b has five bytes (the code point for “é” is encoded as two bytes in UTF-8).

In [13]:
b.decode('utf8')
Out[13]:
'café'

Decode bytes to str using UTF-8 encoding.

See Fluent Python for more details.

End of strings notebook.

In [ ]: