In this lesson we’ll see how packages are useful in the context of tacking problems programmatically. The problem we will use here as an example is: download a data file that we know is hosted on a web site and then do some analysis of those data.
In this example, we are going to access and analyse a data set that we created: Cities from Wikipedia Data. Have a look at those data in your browser now by pasting the URL into a new tab. We’ll use the urlopen package to help us access the data.
What’s a Package?
We saw in a previous Code Camp Lesson how functions are a useful programming tool to enable us to re-use chunks of code. Basically, a function is a way to do something to something in a portable, easy-to-use little bundle of code.
However, some steps in a program are done so many times by so many people that, eventually, someone writes a package that bundles up those operations into something easy to use that saves you having to figure out the gory details.
Q: How is a package different from a function?
A: A package is just a bundle of useful functions.
There’s a little more to it than this, but simplest way to think of this is that when we type import foo then we are including functions from the foo package in our code. You’ll see how this works below.
The point of the bundle-of-functions is that they can help us to achieve quite a lot very quickly since we don’t need to reinvent the wheel and can just make use of someone else’s code. In the same way that we won’t mark you down for Googling the answer to a coding question, we won’t mark you down for using someone else’s package to help you get going with your programming. That’s the whole point!
Often, if you’re not sure where to start Google (or StackOverflow) is the place to go:
Q: But why have packages at all, why not just copy and paste all the functions that we need into our script directly?
A: Because it keeps our code ‘clean’ and allows us to have functions with the same name that do different things.
Imagine the following:
You want to write a program to read data from two different web sites (e.g.NOMIS and the London Data Store)
It turns out that both already have handy Python packages that help you to read the data if you give them the URL of the page to read
Even better, both packages have a simple function called read_data(...) that takes a URL and extracts the data that you want from that page as a CSV file, except…
Because the web sites are completely different the read_data(...) functions are specific to that one web site!
So, if both packages have functions callend read_data(), then after you have done:
import nomis_readerimport ldn_data_store_reader
How does Python know which read_data you want to use?
The answer is the namespace: you can only access the NOMIS read_data function by typing nomis.read_data(...) and you can only access the Data Store read_data function by typing ldn_data_store_reader.read_data(...).
Import… As…
The length and awkwardness of that last line is why you can write this:
import ldn_data_store_reader as ldn
Because the alias ("... as ldn") lets you type ldn.read_data(...) and Python will know exactly which function you mean!
Packages in Context
The first step to writing a program is thinking about your goal and the steps required to achieve that. We don’t write programs like we write essays: all at once by writing a whole lot of code and then hoping for the best when we hit ‘submit’.
When you’re tackling a programming problem you break it down into separate, simpler steps, and then tick them off one by one. Doing this gets easier as you become more familiar with programming, but it remains crucial and, in many cases, good programmers in large companies spend more time on design than they do on actual coding.
To a computer, reading data from a remote location (e.g. a web site halfway around the world) is not really any different from reading one that’s sitting on your your local hard drive (e.g. on your desktop). To simplify things a great deal: the computer really just needs to know the location of the file and an appropriate protocol for accessing that file (e.g. http, https, ftp, local…) and then a clever programming language like Python will typically have packages that can kind of take of the rest.
In all cases – local and remote – you use the package to handle the hard bit of knowing how to actually ‘read’ data (because all files are just 1s and 0s of data) at the device level and then Python gives you back a ‘file handle’ that helps you to achieve things like ‘read a line’ or ‘close an open file’. You can think of a filehandle as something that gives you a ‘grip’ on a file-like object no matter where or what it is, and the package is the way that this magic is achieved.
Let’s take a common problem:
download a data file that we know is hosted on a web site and then do some analysis of those data…
We need to break this seemingly hard problem down into something simpler and can do this by thinking about it as three separate steps:
First we want to read a remote file (e.g. a file somewhere the planet),
Then we want to turn it into a local data structure (e.g. a list or a dictionary),
Finally we want to perform some calculations on the data (e.g. calculate the mean).
We can tackle each of those problems in turn, getting the first bit working, then adding the second bit, etc. It’s just like using lego to build something: you take the same pieces and assemble them in different ways to produce different things.
Step 1: Reading a Remote File
So, as I said, in Step #1 we are going to download a file hosted on a remote web site at Cities from Wikipedia Data (this can also be stored as a bit-link to make it easier to copy+paste and avoid really long lines: http://bit.ly/2vrUFKi)
We aren’t going to to try to turn it into data or otherwise make ‘sense’ of it yet, we just want to get it. We are then going to build from this first step towards more substantial exercises and, eventually, you could easily request Megabytes of data in real-time according to flexibly-specified parameters!
Because we’re accessing data from a URL we will use the urlopenfunction from the urllib.requestpackage.
If you’re wondering how we know to use this function and package, you might google something like: read remote csv file python 3 which in turn might get you to a StackOverflow question and anwer like this.
Getting help with packages
Of course, just knowing that you need urlopen doesn’t necessarily help you to actually use it. In addition to finding example code on StackOverflow, you can also ask the package itself for help with dir and help.
dir
The ‘Dive into Python’ web site will tell you that “dir returns a list of the attributes and methods of any object”. That introduces yet another term (‘methods’) that we don’t want to get into right now, but everything in Python is an object and so dir will give you help with packages, variables, functions… you name it.
What dir gives you is information about things you can potentially do: it’s like navigating the menu of a web site – you aren’t yet looking at the information you need, you’re trying to figure out if the site even has what you need. So dir on a package will give you a list of the functions (and any variables) that the person who created the package has provided.
Typically, the information given by dir is highly abbreviated and is really just a prelude to using help.
help
The help function gives you the actual detail you need about how to use a particular function: what are the inputs, what are the outputs, and what will the function actually do ?
Both of these can be used on a package, a function, or a variable that you’ve created; for example:
import mathhelp(math.acos)
Help on built-in function acos in module math:
acos(x, /)
Return the arc cosine (measured in radians) of x.
The result is between 0 and pi.
Let’s see this in action!
from urllib.request import urlopenprint("dir(urlopen) returns:\n")print(dir(urlopen))
help(urlopen) returns:
Help on function urlopen in module urllib.request:
urlopen(url, data=None, timeout=<object object at 0x1047207d0>, *, cafile=None, capath=None, cadefault=False, context=None)
Open the URL url, which can be either a string or a Request object.
*data* must be an object specifying additional data to be sent to
the server, or None if no such data is needed. See Request for
details.
urllib.request module uses HTTP/1.1 and includes a "Connection:close"
header in its HTTP requests.
The optional *timeout* parameter specifies a timeout in seconds for
blocking operations like the connection attempt (if not specified, the
global default timeout setting will be used). This only works for HTTP,
HTTPS and FTP connections.
If *context* is specified, it must be a ssl.SSLContext instance describing
the various SSL options. See HTTPSConnection for more details.
The optional *cafile* and *capath* parameters specify a set of trusted CA
certificates for HTTPS requests. cafile should point to a single file
containing a bundle of CA certificates, whereas capath should point to a
directory of hashed certificate files. More information can be found in
ssl.SSLContext.load_verify_locations().
The *cadefault* parameter is ignored.
This function always returns an object which can work as a
context manager and has the properties url, headers, and status.
See urllib.response.addinfourl for more detail on these properties.
For HTTP and HTTPS URLs, this function returns a http.client.HTTPResponse
object slightly modified. In addition to the three new methods above, the
msg attribute contains the same information as the reason attribute ---
the reason phrase returned by the server --- instead of the response
headers as it is specified in the documentation for HTTPResponse.
For FTP, file, and data URLs and requests explicitly handled by legacy
URLopener and FancyURLopener classes, this function returns a
urllib.response.addinfourl object.
Note that None may be returned if no handler handles the request (though
the default installed global OpenerDirector uses UnknownHandler to ensure
this never happens).
In addition, if proxy settings are detected (for example, when a *_proxy
environment variable like http_proxy is set), ProxyHandler is default
installed and makes sure the requests are handled through the proxy.
None
A Challenge for you!
Fix the code to set the url variable, then use it with the urlopen function to open it.
from urllib.request import urlopenurl ="https://raw.githubusercontent.com/kingsgeocomp/geocomputation/master/CitiesWithWikipediaData-simple.csv"response = urlopen(url)datafile = response.read().decode('utf-8')print("datafile variable is of type: '"+ datafile.__class__.__name__+"'.\n")
datafile variable is of type: 'str'.
Note that the datafile variable is of type string (because we decoded it as such). If we hadn’t decoded it, the result would have been of type bytes which wouldn’t be as easy for us (humans) to work with.
Now that we’ve read our data as text, we can print it to check.
So this is definitely text, but even though it has been nicely formatted visually in the script, right now it’s actually not in a very convenient format to work with in our code. The datafilestr object is actually a single string of text that Python is interpreting as having line breaks. We can see this by printing the ‘raw’ object using the repr function:
Note the \n character that Python is using to determine when to print a new line for us to read nicely.
To split the text into individual lines ourselves (ready to work with in our code), we can use the handily named .splitlines() method (more on methods below):
url ="https://raw.githubusercontent.com/kingsgeocomp/geocomputation/master/CitiesWithWikipediaData-simple.csv"response = urlopen(url)datafile = response.read().decode('utf-8').splitlines()print("datafile variable is of type: '"+ datafile.__class__.__name__+"'.\n")
datafile variable is of type: 'list'.
Note now, that the data variable has type list. When we print this datafilelist object (without repr) the \n characters have gone and the list elements are split where those new line characters were previously (look carefully at where the ' are):
The last row should be 10,Sheffield,10,-163545.3257,7055177.403,685368.
If you’ve managed to get the code above to run and have received 11 rows of text in response to your urlopen query then, congratulations, you’ve now read a text file sitting on a server in, I think, Alberta, Canada and Python didn’t care.
Step 2: Turning Text into Data
We now need to work on turning the response we got to our urlopen request into useful data. You’ll notice that we are dealing with a CSV (Comma-Separated Value) file and that the format is quite simple since none of the rows have fields that themselves contain commas. So to turn this into data we just need to split the row into separate fields using the commas.
In the code below, dir('string') lists the available function for strings (because 'string' is itself a String; we could just as easily written dir('foo') or dir('supercalifragilisticexpialidocious') because ‘foo’ and ‘supercalifragilisticexpialidocious’ are also strings and so have the same functions available.
In the output below, the functions that start and end with __ are generally considered private, so you can skip over these and focus on the ones further down that are designed to be useful to programmers. Can you spot the method that is most likely to be useful?
A Brief Musical Interlude
Just in case you need help pronouncing supercalifragilisticexpialidocious:
Remember that you can find out what methods are supported by a string using dir(<string>):
I’m going to save you some time (this time!) and tell you that we’re interested in the split method. Why not use the help function to figure out how to make use of it?
help('supercalifragilisticexpialidocious'.split)
Help on built-in function split:
split(sep=None, maxsplit=-1) method of builtins.str instance
Return a list of the substrings in the string, using sep as the separator string.
sep
The separator used to split the string.
When set to None (the default value), will split on any whitespace
character (including \\n \\r \\t \\f and spaces) and will discard
empty strings from the result.
maxsplit
Maximum number of splits (starting from the left).
-1 (the default value) means no limit.
Note, str.split() is mainly useful for data that has been intentionally
delimited. With natural text that includes punctuation, consider using
the regular expression module.
Now, using the output of the help command, how would you use split to turn that word into a list like this: ['sup','rcalifragilisticexpialidocious']? If you replace the ??? with the right bits of code then running the block below will print out “You got it!”. You only need to change the ??? and nothing else!
OK, so you’ve tracked down the way to split a string using a delimiter and even how to limit the number of ‘words’ that come out of the split operation. And you already saw another of these methods above (i.e. splitlines). We work a lot with strings, so it’s handy to get to know the readily-available methods well.
Let’s test string splitting using our sample data (the last line of the ‘simple’ CSV file) to make sure it works the way we think it does… We want to turn the string below into a list like this:
Hopefully you can see that a is a list and b is a str (string). Because a is a list we can easily access each element. For example:
test = datafile[-1].split(',')print("The population of "+ test[1] +" is "+ test[-1])
The population of Sheffield is 685368
It is much more difficult to access the individual pieces of information from the string…
You can hopefully see how we’re breaking a complex problem down into a set of increments , each of which is a bit easier to write and understand.
Step 3: Example Analysis
Now that we understand why the list is a (basic) kind of data structure, we can do things with the list-of-lists:
total =0count =0for idx,row inenumerate(datafile): r = row.split(',')if(idx >0):print(f"Adding {r[-1]} to total, count is now {count}") total = total +int(r[-1]) count +=1mean = total / count print(f"The mean population of sampled UK cities is: {mean:,}")
Adding 9787426 to total, count is now 0
Adding 2553379 to total, count is now 1
Adding 2440986 to total, count is now 2
Adding 1777934 to total, count is now 3
Adding 1209143 to total, count is now 4
Adding 864122 to total, count is now 5
Adding 855569 to total, count is now 6
Adding 774891 to total, count is now 7
Adding 729977 to total, count is now 8
Adding 685368 to total, count is now 9
The mean population of sampled UK cities is: 2,167,879.5
Recap
This session about packages (just urllib in this case) also took in several other concepts: how to access Python’s built-in help; how to use built-in functions (aka. methods); and how to piece together what we have covered to create a simple data processing script that reads a remote file of UK cities and calculates the mean population of those cities. There’s lots more ground to cover, but we’re nearly at the end of the foundational concepts and from there it will be about adding complexity and sophistication, not fundamentally new ideas.