This notebook mirrors the Google Python Course: Strings
See Strings from Socratica.
Python strings are instances of the str
class.
str(object)
is a constructor which creates a new string from the given object.
str(123)
'123'
str(3 + 7)
'10'
Strings can be notated with single quotes ('), double quotes ("), or triple quotes (''') for strings that span multiple lines.
'hello world'
'hello world'
"hello world"
'hello world'
'''
wait for it:
hello world!
'''
'\nwait for it:\nhello world!\n'
The '\n' character is "newline".
print('\nwait for it:\nhello world!\n')
wait for it: hello world!
If you want to include a single or double quote inside a string, you can escape it with a backslash (\). If you want to include a backslash in a string, use two backslashes. See Python 3 escape sequences for more information.
'I can\'t wait!'
"I can't wait!"
"I can't wait"
"I can't wait"
'"Who are you?"'
'"Who are you?"'
print("heaven \\ hell")
heaven \ hell
The str class has a boatload of methods.
dir(str)
['__add__', '__class__', '__contains__', '__delattr__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__getitem__', '__getnewargs__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__iter__', '__le__', '__len__', '__lt__', '__mod__', '__mul__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__rmod__', '__rmul__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', 'capitalize', 'casefold', 'center', 'count', 'encode', 'endswith', 'expandtabs', 'find', 'format', 'format_map', 'index', 'isalnum', 'isalpha', 'isascii', 'isdecimal', 'isdigit', 'isidentifier', 'islower', 'isnumeric', 'isprintable', 'isspace', 'istitle', 'isupper', 'join', 'ljust', 'lower', 'lstrip', 'maketrans', 'partition', 'removeprefix', 'removesuffix', 'replace', 'rfind', 'rindex', 'rjust', 'rpartition', 'rsplit', 'rstrip', 'split', 'splitlines', 'startswith', 'strip', 'swapcase', 'title', 'translate', 'upper', 'zfill']
lower() and upper() convert a string to all lowercase or all uppercase, respectively.
Methods are invoked on instances of the object using the syntax
instance.method()
s = " Hello World! "
s
' Hello World! '
s.lower()
' hello world! '
s
' Hello World! '
s.upper()
' HELLO WORLD! '
s
' Hello World! '
Note that lower and upper methods are not destructive. They work on a copy of the given string and do not change the original string. strip() removes leading and trailing spaces from a string.
s.strip()
'Hello World!'
s
' Hello World! '
Three common character categories are alpha, digit, and space. There are methods to check if a given string is in one of these categories.
s.isalpha()
False
'abcde'.isalpha()
True
'123'.isdigit()
True
'123.456'.isdigit()
False
s.isspace()
False
' \n\t\r\f'.isspace()
True
Note that whitespace includes space, newline, tab, return, and formfeed. These characters do not put ink (or pixels) on the page.
A common string operation is to match the start, end, or middle of a string.
s = s.strip()
s
'Hello World!'
s.startswith('Hello')
True
s.endswith('World!')
True
s.find(' ')
5
s
'Hello World!'
s.find('Q')
-1
x = ' xxx '
y = x
x = x.strip()
x
'xxx'
y
' xxx '
s
'Hello World!'
5 indicates the sixth character of the string, of course. -1 indicates that the given character is not in the string.
Strings are indexed using zero-based indexing. That means that the first character of the string is string[0]
s[0]
'H'
Positive integers index the string from left to right, starting with 0. Negative integers index the string from right to left, starting with -1.
H | e | l | l | o | W | o | r | l | d | ! | |
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 |
-12 | -11 | -10 | -9 | -8 | -7 | -6 | -5 | -4 | -3 | -2 | -1 |
s[11]
'!'
s[-1]
'!'
s[-6]
'W'
s[12]
--------------------------------------------------------------------------- IndexError Traceback (most recent call last) Input In [108], in <cell line: 1>() ----> 1 s[12] IndexError: string index out of range
len(s)
12
len('')
0
'x'[0]
'x'
''[0]
--------------------------------------------------------------------------- IndexError Traceback (most recent call last) Input In [112], in <cell line: 1>() ----> 1 ''[0] IndexError: string index out of range
len() tells you how many characters are in a string or the length of the string. The empty string is of zero length.
If you try to index the string out of bounds, you thrown an IndexError.
You can specify a subset of a string using the [start:end]
notation, where start is inclusive, but end is not.
If you leave off the start or end, you get beginning or end of the string. Thus, [:] gives you the entire string, or more precisely, a copy of the string.
s[1:4]
'ell'
s[4:]
'o World!'
s[:4]
'Hell'
s[:] ## useful python idiom for copying a string
'Hello World!'
s[-6:-2]
'Worl'
s[:-1]
'Hello World'
As noted above, you can make a copy of a string with the [:]
slice, which makes a new copy of the string. The id()
function gives you the memory address of an object. Objects with the same address are identical.
s = "a string"
id(s)
139683819578672
s2 = s
id(s2)
139683819578672
s and s2 have the same address in memory. They are identical.
==
compares values.
is
compares memory addresses.
s == s2
True
s is s2
True
s and s2 have the same value and the same address in memory.
s3 = s[:]
s3
'a string'
id(s3)
139683819578672
s == s3
True
s is s3
True
id(s)
139683819578672
Note: even though s3 is a copy of s, it has the same address because Python interns strings to save storage. When you create a string, Python checks to see if there is already a string with that value. If so, it just reuses it.
s = s +''
id(s)
139683819578672
s is s3
True
s = s + 'x'
id(s)
139684083129072
s is s3
False
You may use a third parameter to specify the step for the slice, string[start:end:step]
s
'a stringx'
s[::2] ## every other letter
'asrnx'
s[::3] ## every third letter
'atn'
s[::1] ## every letter
'a stringx'
s[::-1] ### Reverses the string!
'xgnirts a'
s == s[::-1]
False
x = 'radar'
x == x[::-1]
True
This is an easy way to check for palindromes.
s.replace('old','new')
- replace every occurence of "old" in s
with "new"
s.split(delimiter)
- return a list of the elements of string s
using the given delimiter to partition the string.
s.join(list)
- splice together the sequential elements of list using the string s
as the glue.
s
'a stringx'
s.replace('l','*')
'a stringx'
s
'a stringx'
Note: replace works on a copy. It does not change the original string.
'good boy'.replace('good','bad')
'bad boy'
'good boy'.replace('good','bad').replace('boy','girl')
'bad girl'
Note: the string returned by the first replace becomes the argument for the second replace.
'Hello World'.replace('l','d').replace('d','l')
'Hello Worll'
You need to be careful when doing sequential replaces.
r = 'Romeo and Juliet'.split()
r
['Romeo', 'and', 'Juliet']
len(r)
3
split
returns a list, which we will cover later.
r2 = 'Romeo and Juliet'.split(' ')
r2
['Romeo', 'and', 'Juliet']
The default delimiter is space.
r3 = '203-555-1212'.split()
r3
['203-555-1212']
r4 = '203-555-1212'.split('-')
r4
['203', '555', '1212']
r3[0].replace('-','')
'2035551212'
The r3 phone number had no spaces, so it did not get split. Using a hyphen as a delimiter, we split r4 in three parts.
We now can glue the lists together with join.
' '.join(r2)
'Romeo and Juliet'
'***'.join(r2)
'Romeo***and***Juliet'
'----'.join(r4)
'203----555----1212'
Computer characters from the Roman alphabet are represented as numbers using the American Standard Code for Information Interchange (ASCII)
However, there are thousands of other characters. Those are represented using Unicode
Python has two functions, ord(character)
, and chr(number)
which convert between characters and numeric codes.
ord('A')
65
ord('B')
66
ord('a')
97
ord('b')
98
Notice that ASCII is designed so that sorting words by their numerical values results in sorting alphabetically. However, upper case letters sort before lower case letters. In UNIX, the ls
command for listing directories often reflects this property.
chr(65)
'A'
chr(97)
'a'
We can specify Unicode characters using hexidecimal (base 16) notation.
0xA
10
0x10
16
0x3b4
948
chr(0x3b4) ## delta
'δ'
chr(0x3b5) ## epsilon
'ε'
chr(0x3bb) ## lambda
'λ'
chr(0x394) ## DELTA
'Δ'
chr(0x395) ## EPSILON
'Ε'
chr(0x39b) ## LAMBDA
'Λ'
We can convert a string into an array of bytes, with a specified encoding. The Python string method encode has lots of options.
See encode() and
<a target=ekek href="https://docs.python.org/3/library/codecs.html#standard-encodings</a>list of standard encodings</a>
s = 'café'
len(s)
4
b = s.encode('utf8')
b
b'caf\xc3\xa9'
s.encode('UTF-8')
b'caf\xc3\xa9'
The above example is from Chapter 4 of Fluent Python.
The str 'café' has four Unicode characters.
Encode str to bytes using UTF-8 encoding.
bytes literals have a b prefix.
bytes b has five bytes (the code point for “é” is encoded as two bytes in UTF-8).
b.decode('utf8')
'café'
Decode bytes to str using UTF-8 encoding.
See Fluent Python for more details.
End of strings notebook.