# Key Concepts

An Introduction to Lists and Dictionaries

In this notebook we are going to (briefly) look at two key concepts in
Python (lists and dictionaries) as well as the basics of something
called a package. This will hopefully help a lot with the content over
the next three days!

> **Connections**
>
> You will find links here to the Code Camp sessions on
> [Functions](https://jreades.github.io/code-camp/lessons/Functions.html)
> and
> [Packages](https://jreades.github.io/code-camp/lessons/Packages.html),
> as well as to this week‚Äôs lectures on
> [Functions](https://jreades.github.io/fsds/sessions/week3.html#lectures)
> and
> [Packages](https://jreades.github.io/fsds/sessions/week3.html#lectures).

## Lists

Like a list on your phone, a Python list is just an ordered collection
of ‚Äòthings‚Äô. They could be pretty juch *any* thing. Groceries. Largest
Cities in the World. Most famous Indian actors. It doesn‚Äôt matter.

In [None]:
# A list is *create* using square brackets with items separated by commas
my_list = ['Apples','Bananas','Lentils','Cleaning supplies',4,'A new hoover']

# Basic info
print(f"Type of my_list is {type(my_list)}")
print(f"Length of my_list is {len(my_list)}")
print()

# The first item in the list
print(my_list[0])
print()

# The last item in the list
print(my_list[-1])
print() 

# Loop over the list
for i in my_list:
    print(i)
print()

# Slightly different loop
for i in range(0,len(my_list)):
    print(f"Item {i} is {my_list[i]}")

## Dictionaries

A dictionary (or ‚Äòdict‚Äô for short) in Python is *kind* of like a real
dictionary: you look up values using a word or other ‚Äòkey‚Äô!

In [None]:
# A list is *create* using square brackets with items separated by commas
my_dict = {'Apples':'Tasty','Bananas':'Tasty','Lentils':'Tasty','Cleaning supplies':'Not Tasty',4:'Not Tasty','A new hoover':'Not Tasty'}

# Basic info
print(f"Type of my_dict is {type(my_dict)}")
print(f"Size of my_dict is {len(my_dict)}")
print() 

# The first item in the list
print(my_dict['Apples'])
print()

# The last item in the list
print(my_dict['A new hoover'])
print() 

# Loop over the list
for k in my_dict.keys():
    print(f"{k} has value {my_dict[k]}")

## Packages

A package is just some code that someone has written and shared with the
rest of the world. Some of these are built into Python, some are ones
that we can install ourselves after installing Python. Here we make use
of *four* packages using variations of the `import` statement‚Ä¶

In [None]:
import urllib.request
from PIL import Image
from io import BytesIO
import matplotlib.pyplot as plt

url = "https://jreades.github.io/jaipur/lectures/img/Octocat.png"

# Get the data
response = urllib.request.urlopen(url)
image_data = response.read()

# Load the image
img = Image.open(BytesIO(image_data))

# Display the image
plt.imshow(img)

## The Task

Our basic task is to read a CSV file from a server and turn it into
‚Äòdata‚Äô that we can use. This might sound hard. It *is* hard when you‚Äôre
just starting out in programming. But it is *not* hard for a computer‚Ä¶
*iff* we can figure out what to tell it to do *and* make use of work
that other people have done for us!

### Break Down the Problem

#### Step 1. Analyse the Problem

We ***don‚Äôt*** write programs like we write essays: writing a whole lot
of code and then hoping for the best when we hit ‚Äòrun‚Äô.¬†You want to
break it down into simple steps, and then tick them off one by one.
Doing this gets easier as you become more familiar with programming.

So for this problem we might *start* with:

-   [ ] Find the data
-   [ ] Download the data
-   [ ] Read the data
-   [ ] Load the data

We might or might not need all of these steps. Or some steps might be
easy, while others are hard! But now we can tackle each of those in
turn: get the first bit working, then add the second bit, etc. It‚Äôs just
like using Lego: you take the same pieces and assemble them in different
ways to produce different things.

#### Step 2. Functions & Packages

Some steps in a program are done so many times by so many people that,
eventually, someone writes a *package* that bundles up those operations
into something easy to use. Packages can help us to achieve quite a lot
very quickly since we just use someone else‚Äôs code. Often, if you‚Äôre not
sure where to start, Google (or StackOverflow) is the place to go:

[`how to read text file on web server python`](https://www.google.co.uk/search?q=how+to+read+text+file+on+web+server+python&oq=how+to+read+text+file+on+web+server+python&aqs=chrome..69i57.629j0j7&sourceid=chrome&ie=UTF-8)

Boom!

## Reading a Remote File

So, we are going to [download a
file](https://orca.casa.ucl.ac.uk/~jreades/jaipur/Wikipedia-Cities-simple.csv),
but we **aren‚Äôt going to do antything else**. This is step #1, then we
tackle the rest of the steps!

Because we‚Äôre accessing data from a ‚ÄòURL‚Äô we need to use the `urlopen`
[function](https://docs.python.org/3.0/library/urllib.request.html?highlight=urlopen#urllib.request.urlopen)
from the `urllib.request`
[package](https://docs.python.org/3.0/library/urllib.request.html). If
you‚Äôre wondering how we know to use this function and package, you might
google something like: [read remote csv file
python3](https://www.google.com/search?q=read+remote+csv+file+python3)
which in turn might get you to a StackOverflow question and answer like
[this](https://stackoverflow.com/questions/36965864/opening-a-url-with-urllib-in-python-3).

``` python
from urllib.request import urlopen
help(urlopen)
```

    Help on function urlopen in module urllib.request:

    urlopen(url, data=None, timeout=<object object at 0x104c2c8c0>, *, cafile=None, capath=None, cadefault=False, context=None)
        Open the URL url, which can be either a string or a Request object.

        *data* must be an object specifying additional data to be sent to
        the server, or None if no such data is needed.  See Request for
        details.

        urllib.request module uses HTTP/1.1 and includes a "Connection:close"
        header in its HTTP requests.

        The optional *timeout* parameter specifies a timeout in seconds for
        blocking operations like the connection attempt (if not specified, the
        global default timeout setting will be used). This only works for HTTP,
        HTTPS and FTP connections.

        If *context* is specified, it must be a ssl.SSLContext instance describing
        the various SSL options. See HTTPSConnection for more details.

        The optional *cafile* and *capath* parameters specify a set of trusted CA
        certificates for HTTPS requests. cafile should point to a single file
        containing a bundle of CA certificates, whereas capath should point to a
        directory of hashed certificate files. More information can be found in
        ssl.SSLContext.load_verify_locations().

        The *cadefault* parameter is ignored.


        This function always returns an object which can work as a
        context manager and has the properties url, headers, and status.
        See urllib.response.addinfourl for more detail on these properties.

        For HTTP and HTTPS URLs, this function returns a http.client.HTTPResponse
        object slightly modified. In addition to the three new methods above, the
        msg attribute contains the same information as the reason attribute ---
        the reason phrase returned by the server --- instead of the response
        headers as it is specified in the documentation for HTTPResponse.

        For FTP, file, and data URLs and requests explicitly handled by legacy
        URLopener and FancyURLopener classes, this function returns a
        urllib.response.addinfourl object.

        Note that None may be returned if no handler handles the request (though
        the default installed global OpenerDirector uses UnknownHandler to ensure
        this never happens).

        In addition, if proxy settings are detected (for example, when a *_proxy
        environment variable like http_proxy is set), ProxyHandler is default
        installed and makes sure the requests are handled through the proxy.

As you can see, there is *lot* of information here about how things
work. A *lot* of it won‚Äôt make much sense at the moment. That‚Äôs ok.
*Some* of this doesn‚Äôt make much sense to me, but that‚Äôs because this is
the *full* documentation trying to cover *all* the bases. You don‚Äôt need
to read every line of this, what you are looking is information about
things like the ‚Äòsignature‚Äô (what parameters the function accepts) and
its output. Of course, you can also *just Google it*!

> **Tip**
>
> Remember that you can use `dir(...)` and `help(...)` to investigate
> what a package offers.

Before you start working on the code, why not open the data file
[directly in your
browser](https://orca.casa.ucl.ac.uk/~jreades/jaipur/Wikipedia-Cities-simple.csv)?
It‚Äôs pretty small, and it will give you a sense of what is going on.

In [None]:
from urllib.request import URLError
from urllib.request import urlopen

url = 'https://orca.casa.ucl.ac.uk/~jreades/jaipur/Wikipedia-Cities-simple.csv'

# Read the URL into variable called 'response'
# using the function that we imported above
try:
    response = urlopen(url)
except URLError as e:
    print("Unable to connect to URL!")
    print(e)

# Now read from the stream, decoding so that we get actual text
raw = response.read()

# You might want to explore what `__class__` and `__name__`
# are doing, but basically the give us a way of finding out what
# is 'behind' more complex variables

print(f"'raw' variable is of type: '{raw.__class__.__name__}'.")
print(f"Raw content is:\n{raw[:75]}...\n")

data = raw.decode('utf-8')

print(f"'data' variable is of type: '{data.__class__.__name__}'.")
print(f"Decoded content is:\n{data[:75]}...")

> **Note**
>
> Notice that the `raw` data has the format `b'...'` with all of the
> data seemingly on one line, while the *decoded* version in `data` is
> ‚Äòcorrectly‚Äô structured with lines! The ‚Äòraw‚Äô data is in *bytecode*
> format which is not, strictly, a `string`. It only becomes a string
> when we ‚Äòdecode it‚Äô to `utf-8` (which is the ‚Äòencoding‚Äô of text that
> supports most human languages). While the computer doesn‚Äôt
> particularly care, we do!

Remember that you can treat strings *as lists*, so when we `print` below
we cut off the output using the `list[:<Some Number>]` syntax.

In [None]:
print(f"There are {len(data)} characters in the data variable.")
print(f"The first 125 characters are: '{data[:125]}'") # Notice that '\n' count here!

So this is definitely text, but it doesn‚Äôt (yet) look entirely like the
data *we* see because it‚Äôs still just one long string, and not *data*
which has individual records on each line. To split the text into
individual lines, we can use the handily named `.splitlines()` method
(more on methods below):

In [None]:
rows = data.splitlines()
print(f"'rows' variable is of type: {rows.__class__.__name__}'.")

Note now, how the *data* variable has type `list`. So to view the data
as we see them in the original online file, we can now use a `for` loop
to print out each element of the `list` (each element being a row of the
original online file):

In [None]:
print(f"There are {len(rows)} rows of data.")
print("\n".join(rows[0:2])) # New syntax alert! notice we can *join* list elements

That‚Äôs a little hard to read, though something has clearly changed.
Let‚Äôs try printing the last row:

In [None]:
print(rows[-1])

**Congratulations!** You‚Äôve now read a text file sitting on a server in,
I think, Canada and Python *didn‚Äôt care*. You‚Äôve also converted a
plain-text file to a row-formatted list.

## Text into Data

We now need to work on turning the list into useful data. We got partway
there by splitting on line-breaks (`splitlines()`), but now we need to
get columns for each line. You‚Äôll notice that we are dealing with a
*CSV* (Comma-Separated Value) file and that the format *looks* quite
simple‚Ä¶ So, in theory, to turn this into data we ‚Äòjust‚Äô need to *split*
each row into separate fields using the commas.

There‚Äôs a handy function associated with strings called `split`:

In [None]:
test = rows[-1].split(',')
print(test)
print(f"The population of {test[0]} is {int(test[1]):,}")

I‚Äôd say that we‚Äôre now getting quite close to something that looks like
‚Äòreal data‚Äô: I know how to convert a raw response from a web server into
a string, to split that string into rows, and can even access individual
elements from a row!

## The Advantages of a Package

There are two problems to the `data.splitlines()` and `row.split(',')`
approach! One of them is visible (though not obvious) in the examples
above, the other is not.

1.  `10` and `'10'` are *not* the same thing. To comma-format the
    population of Sheffield you‚Äôll see that I had to do `int(...)` in
    order to turn `'685368'` into a number. So our approach so far
    doesn‚Äôt know anything about the *type* of data we‚Äôre working with.
2.  We are also implicitly *assuming* that commas can only appear at
    field boundaries (i.e.¬†that they can only appear to separate one
    column of data from the next). In other words, just using
    `split(',')` doesn‚Äôt work if *any* of the fields can themselves
    contain a comma!
3.  There‚Äôs actually a *third* potential issue, but it‚Äôs so rare that we
    would need to take a completely different approach to deal with it:
    we are also assuming that newlines (`\n`) can only appear at record
    boundaries (i.e.¬†that the can only appear to separate one row of
    data from the next). In those cases, using `splitlines()` also
    doesn‚Äôt work, but this situation is (thankfully) very rare indeed.

This is where using code that someone *else* who is much more interested
(and knowledgeable) has written and contributed is helpful: we don‚Äôt
need to think through how to deal with this sort of thing ourselves, we
can just find a library that does what we need and make use of *its*
functionality. I‚Äôve given you the skeleton of the answer below, but
you‚Äôll need to do a little Googling to find out how to
`"read csv python"`.

**Note:** For now just focus on problem #2.

In [None]:
from urllib.request import urlopen
import csv

response = urlopen(url)
raw = response.read()

# Now take the raw data, decode it, and then
# pass it over to the CSV reader function
csvfile  = csv.reader(raw.decode('utf-8').splitlines()) 

urlData = [] # Somewhere to store the data
for row in csvfile:              
    urlData.append( row )

print("urlData has " + str(len(urlData)) + " rows and " + str(len(urlData[0])) + " columns.")
print(urlData[-1]) # Check it worked!

If it worked, then you should have this output:

To you that might look a lot *worse* that the data that you originally
had, but to a computer that list-of-lists is something it can work with;
check it out:

In [None]:
for u in urlData[1:6]: # For each row in the first 5 items in list
    print(f"The city of '{u[0]}' has a population of {int(u[1]):,}") # Print out the name and pop

> **Note**
>
> Why did I use `urlData[1:]` instead of `urlData`?
>
> If you print `urlData[0]` you‚Äôll see that this is the ‚Äòheader‚Äô row
> that tells us what each column contains! So if we try to convert the
> column name to an integer (`int(u[1])`) we will get an error!
>
> The advantage of using the `csv` library over plain old `string.split`
> is that the csv library knows how to deal with fields that contain
> commas (*e.g.* `"Cardfiff, Caerdydd"` or
> `"An Amazing 4 Bedroom Home, Central London, Sleeps 12"`) and so is
> much more flexible and consistent that our naive `split` approach.

Let‚Äôs try this with a ‚Äòbigger‚Äô data set‚Ä¶ In an ideal world, the ‚Äòpower‚Äô
of code is that once we‚Äôve solved the problem *once*, we‚Äôve solved it
more generally as well. So let‚Äôs try with the ‚Äòscaled-up‚Äô data set and
see waht happens!

In [None]:
from urllib.request import urlopen
import csv

url = "https://orca.casa.ucl.ac.uk/~jreades/jaipur/Wikipedia-Cities.csv"
response = urlopen(url)
raw = response.read()

csvfile = csv.reader(raw.decode('utf-8').splitlines())

urlData = [] # Somewhere to store the data

for row in csvfile:              
    urlData.append( row )

print(f"urlData has {len(urlData)} rows and {len(urlData[0])} columns.")

for u in urlData[70:]:  # For each row in the list
    print(f"The city of '{u[0]}' has a population of {u[1]}") # Print out the name and pop

> **What mistake have I made here?**
>
> I have assumed that, just because the files have similar names, they
> must also have similar layouts!
>
> ``` python
> print(f"The URL's data labels are: {', '.join(urlData[0])}")
> ```
>
>     The URL's data labels are: City, Region, Founded, Population, URL, Longitude, Latitude

## Insight!

So, although the code was basically the same for both of these files
(good), we would need to change quite a bit in order to print out the
*same* information from different versions of the *same data*. So our
code is rather **brittle**.

One of the issues is that our *instincts* about how to manage data
doesn‚Äôt align with how the computer can most *efficiently* manage it. We
make the mistake of thinking that the computer needs to do things that
same way that we do when reading text and so assume that we need to:

1.  Represent the rows as a list.
2.  Represent the columns as a list for each row.

This thinking suggests that the ‚Äòright‚Äô data structure would clearly be
a list-of-lists (LoLs!), but if you understand what happened here then
the next section will make a *lot* more sense!

## Why ‚ÄòObvious‚Äô is Not Always ‚ÄòRight‚Äô

> **üîó Connections**
>
> This section builds on the material covered by the [DOLs to
> Data](https://jreades.github.io/fsds/sessions/week3.html#lectures)
> lecture.

> **Difficulty: Hard.**

But you need to be careful assuming that, just because something is hard
for you to read, it‚Äôs also hard for a computer to read! The way a
computer ‚Äòthinks‚Äô and the way that we think doesn‚Äôt always line up
naturally. Experienced programmers can think their way *around* a
problem by working *with* the computer, rather than against it.

Some issues to consider:

-   Is the first row of data *actually* data, or is it *about* data?
-   Do we really care about column *order*, or do we just care about
    being able to pick the *correct* column?

Let‚Äôs apply this approach to the parsing of our data‚Ä¶

### Understanding What‚Äôs an ‚ÄòAppropriate‚Äô Data Structure

If you stop to think about it, then our list-of-lists approach to the
data isn‚Äôt very easy to navigate. Notice that if the position or name of
a column changes then we need to change our program *every* time we
re-run it! It‚Äôs not very easy to read *either* since we don‚Äôt really
know what `u[5]` is supposed to be. That way lies all kinds of potential
errors!

Also consider that, in order to calculate out even a simple aggregate
such as the `sum` of a field for all rows we need to step through a lot
of irrelevant data as well: we have to write a `for` loop and then step
through each row with an ‚Äòaccumulator‚Äô (somewhere to store the total).
That‚Äôs slow.

That doesn‚Äôt make much sense since this should all be *easier* and
*faster* in Python than in Excel, but right now it‚Äôs *harder*, and quite
possibly *slower* as well! So how does the experienced programmer get
around this? ‚ÄòSimple‚Äô (i.e.¬†neither simple, nor obvious, until you know
the answer): she realises that the data is organised the wrong way! We
humans tend to think in rows of data: this apartment has the following
*attributes* (price, location, etc.), or that city has the following
*attributes* (population, location). We read across the row because
that‚Äôs the easiest way for *us* to think about it. But, in short, a
list-of-lists does *not* seem to be the right way to store this data!

Crucially, a computer doesn‚Äôt have to work that way. For a computer,
it‚Äôs as easy to read *down* a column as it is to read *across* a row.
**In fact, it‚Äôs easier**, because each column has the same *type* of
data: one column contains names (strings), another column contains
prices (integers), and other columns contain other types of data
(floats, etc.). Better still, the order of the columns often doesn‚Äôt
matter as long as we know what the columns are called: it‚Äôs easier to
ask for the ‚Äòdescription column‚Äô than it is to ask for the 6th column
since, for all we know, the description column might be in a different
place for different files but they are all (relatively) likely to use
the ‚Äòdescription‚Äô label for the column itself.

### A Dictionary of Lists to the Rescue

So, if we don‚Äôt care about column order, only row order, then a
dictionary of lists would be a nice way to handle things. And why should
we care about column order? With our CSV files above we already saw what
a pain it was to fix things when the layout of the columns changed from
one data set to the next. If, instead, we can just reference the
‚Äòdescription‚Äô column then it doesn‚Äôt matter where that column actually
is. Why is that?

Well, here are the first four rows of data from a list-of-lists for city
sizes:

``` python
myData = [
  ['id', 'Name', 'Rank', 'Longitude', 'Latitude', 'Population'], 
  ['1', 'Greater London', '1', '-18162.92767', '6711153.709', '9787426'], 
  ['2', 'Greater Manchester', '2', '-251761.802', '7073067.458', '2553379'], 
  ['3', 'West Midlands', '3', '-210635.2396', '6878950.083', '2440986']
]
```

Now, here‚Äôs how it would look as a dictionary of lists organised by
*column*, and *not* by row:

In [None]:
myData = {
    'id'         : [0, 1, 2, 3, 4, 5],
    'Name'       : ['London', 'Manchester', 'Birmingham','Edinburgh','Inverness','Lerwick'],
    'Rank'       : [1, 2, 3, 4, 5, 6],
    'Longitude'  : [-0.128, -2.245, -1.903, -3.189, -4.223, -1.145],
    'Latitude'   : [51.507, 53.479, 52.480, 55.953, 57.478, 60.155],
    'Population' : [9787426, 2705000, 1141816, 901455, 70000, 6958],
}

print(myData['Name'])
print(myData['Population'])

What does this do better? Well, for starters, we know that everything in
the ‚ÄòName‚Äô column will be a string, and that everything in the
‚ÄòLongitude‚Äô column is a float, while the ‚ÄòPopulation‚Äô column contains
integers. So that‚Äôs made life easier already, but the real benefit is
coming up‚Ä¶

### Behold the Power of the DoL

Now let‚Äôs look at what you can do with this‚Ä¶ but first we need to import
one *more* package that you‚Äôre going to see a *lot* over the rest of
term: `numpy` (Numerical Python), which is used *so* much that most
people simply refer to it as `np`. This is a *huge* package in terms of
features, but right now we‚Äôre interested only in the basic arithmatic
functions: `mean`, `max`, and `min`.

> **We‚Äôll step through most of these in detail below.**

Find the latitude of Manchester:

In [None]:
city = "Manchester"
lat = myData['Latitude'][ myData['Name'].index(city) ]
print(f"{city}'s latitude is {lat}")

Print the location of Lerwick:

In [None]:
city = "Lerwick"
print(f"The town of {city} can be found at " + 
      f"{abs(myData['Longitude'][myData['Name'].index(city)])}¬∫W, {myData['Latitude'][myData['Name'].index(city)]}¬∫N")

Find the easternmost city:

In [None]:
city = myData['Name'][ myData['Longitude'].index( max(myData['Longitude']) ) ]
print(f"The easternmost city is: {city}")

Find the `mean` population of the cities using a handy package called
numpy:

In [None]:
import numpy as np
mean = np.mean(myData['Population'])
print(f"The mean population is: {mean}")

> **Warning**
>
> **Stop!** Look closely at what is going on. There‚Äôs a *lot* of content
> to process in the code above, so do *not* rush blindly on if this is
> confusing. Try pulling it apart into pieces and then reassemble it.
> Start with the bits that you understand and then *add* complexity.

We‚Äôll go through each one in turn, but they nearly all work in the same
way and the really key thing is that you‚Äôll notice that we no longer
have any loops (which are slow) just `index` or `np.<function>` (which
is *very* fast).

### The Population of Manchester

The code can look pretty daunting, so let‚Äôs break it down into two
parts. What would you get if you ran just this code?

In [None]:
myData['Population'][1]

Remember that this is a dictionary-of-lists (DoL). So, Python first
looks for a key named `Population` in the myData dictionary. It finds
out that the value associated with this key is a *list* and in this
example, it just pulls out the second value (index `1`). Does **that
part** make sense?

Now, to the second part:

In [None]:
myData['Name'].index('Manchester')

Here we look in the dictionary for the key `Name` and find that that‚Äôs
*also* a list. All we‚Äôre doing here is asking Python to find the index
of ‚ÄòManchester‚Äô for us in that list. And
`myData['Name'].index('Manchester')` gives us back a `1`, so *instead*
of just writing `myData['Population'][1]` we can replace the `1` with
`myData['Name'].index('Manchester')`! Crucially, notice the complete
*absence* of a for loop?

Does that make sense? If it does then you should be having a kind of an
ü§Ø moment because what we‚Äôve done by taking a column view, rather than a
row view, is to make Python‚Äôs `index()` command do the work for us.
Instead of having to look through each row for a field that matches
‚ÄòName‚Äô and then check to see if it‚Äôs ‚ÄòManchester‚Äô, we‚Äôve pointed Python
at the right column immediately and asked it to find the match (which it
can do very quickly). Once we have a match then we *also* have the row
number to go and do the lookup in the ‚ÄòPopulation‚Äô column because the
index *is* the row number!

### The Easternmost City

Where this approach really comes into its own is on problems that
involve maths. To figure out the easternmost city in this list we need
to find the *maximum* Longitude and then use *that* value to look up the
city name. So let‚Äôs do the same process of pulling this apart into two
steps. Let start with the easier bit:

In [None]:
myData['Name'][0]

That would give us the name of a city, but we don‚Äôt just want the first
city in the list, we want the one with the maximum longitude. To achieve
*that* we need to somehow replace the `0` with the ***index of the
maximum longitude***. Let‚Äôs break this down further:

1.  We first need to *find* the maximum longitude.
2.  We then need to *find* the **index** of that maximum longitude.

So Step 1 would be:

In [None]:
max_lon = max(myData['Longitude'])

Because the `max(...)` helps us to find the maximum longitude in the
Longitude list. Now that we have that we can proceed to Step 2:

In [None]:
myData['Longitude'].index(max_lon)

So now we ask Python to find the position of `max_lon` in the list. But
rather than doing this in two steps we can combine into one if we write
it down to make it easier to read:

In [None]:
myData['Longitude'].index(
    max(myData['Longitude'])
)

There‚Äôs the same `.index` which tells us that Python is going to look
for something in the list associated with the `Longitude` key. All we‚Äôve
done is change what‚Äôs *inside* that index function to
`max(myData['Longitude'])`. This is telling Python to find the *maximum*
value in the `myData['Longitude']` list. So to explain this in three
steps, what we‚Äôre doing is:

-   Finding the maximum value in the Longitude column (we know there
    must be one, but we don‚Äôt know what it is!),
-   Finding the index (position) of that maximum value in the Longitude
    column (now that we know what the value is!),
-   Using that index to read a value out of the Name column.

I *am* a geek, but that‚Äôs pretty cool, right? In one line of code we
managed to quickly find out where the data we needed was even though it
involved three discrete steps. Think about how much work you‚Äôd have to
do if you were still thinking in *rows*, not *columns*!

### The Location of Lerwick

Lerwick is a small town in [the Shetlands](https://www.shetland.org/),
way up to the North of mainland U.K. and somewhere I‚Äôve wanted to go
ever since I got back from [Orkney](https://www.orkney.com/)‚Äìbut then I
spent my honeymoon in the far North of
[Iceland](https://www.westfjords.is/), so perhaps I just don‚Äôt like
being around lots of people‚Ä¶ üôÉ

Anyway, this one *might* be a tiny bit easier conceptually than the
other problems, except that I‚Äôve deliberately used a slightly different
way of showing the output that might be confusing:

Print the location of Lerwick:

In [None]:
city = "Lerwick"
print(f"The town of {city} can be found at " + 
      f"{abs(myData['Longitude'][myData['Name'].index(city)])}¬∫W, {myData['Latitude'][myData['Name'].index(city)]}¬∫N")

The first thing to do is to pull apart the `print` statement: you can
see that this is actually just two ‚Äòf-strings‚Äô joined by a `+`‚Äìhaving
that at the end of the line tells Python that it should carry on to the
next line. That‚Äôs a handy way to make your code a little easier to read.
If you‚Äôre creating a list and it‚Äôs getting a little long, then you can
also continue a line using a `,` as well!

#### The first f-string

The first string will help you to make sense of the second: f-strings
allow you to ‚Äòinterpolate‚Äô a variable into a string directly rather than
having to have lots of `str(x) + " some text " + str(y)`. You can write
`f"{x} some text {y}"` and Python will automatically convert the
variables `x` and `y` to strings and replace `{x}` with the *value of
`x`* and `{y}` with the *value of `y`*.

So here `f"The town of {city} can be found at "` becomes
`f"The town of Lerwick can be found at "` because `{city}` is replaced
by the value of the variable `city`. This makes for code that is easier
for humans to read and so I‚Äôd consider that a good thing.

#### The second f-string

This one is hard because there‚Äôs just a *lot* of code there. But, again,
if we start with what we recognise that it gets just a little bit more
manageable‚Ä¶ Also, it stands to reason that the only difference between
the two outputs is that one asks for the ‚ÄòLongitude‚Äô and the other for
the ‚ÄòLatitude‚Äô. So if you can make sense of one you have *automatically*
made sense of the other and don‚Äôt need to work it all out.

Let‚Äôs start with a part that you might recognise:

In [None]:
myData['Name'].index(city)

You‚Äôve *got* this. This is just asking Python to work out the index of
Lerwick (because `city = 'Lerwick'`). So it‚Äôs a number. 5 in this case.
And we can then think, ‚ÄôOK so what does this return:

In [None]:
myData['Longitude'][5]

And the answer is `-1.145`. That‚Äôs the Longitude of Lerwick! There‚Äôs
just *one* last thing: notice that we‚Äôre talking about degrees West
here. So the answer isn‚Äôt a negative (because negative West degrees
would be *East*!), it‚Äôs the *absolute* value. And that is the final
piece of the puzzle: `abs(...)` gives us the absolute value of a number!

In [None]:
help(abs)

### The Average City Size

Here we‚Äôre going to ‚Äòcheat‚Äô a little bit: rather than writing our own
function, we‚Äôre going to import a package and use someone *else‚Äôs*
function. The `numpy` package contains a *lot* of useful functions that
we can call on (if you don‚Äôt believe me, add ‚Äú`dir(np)`‚Äù on a new line
after the `import` statement), and one of them calculates the average of
a list or array of data.

In [None]:
print(f"The mean population is {np.mean(myData['Population'])}")

This is where our new approach really comes into its own: because all of
the population data is in one place (a.k.a. a *series* or column), we
can just throw the whole list into the `np.mean` function rather than
having to use all of those convoluted loops and counters. Simples,
right?

No, not *simple* at all, but we‚Äôve come up with a way to *make* it
simple.

### Recap!

So the *really* clever bit in all of this isn‚Äôt switching from a
list-of-lists to a dictionary-of-lists, it‚Äôs recognising that the
dictionary-of-lists is a *better* way to work *with* the data that we‚Äôre
trying to analyse and that that there are useful functions that we can
exploit to do the heavy lifting for us. Simply by changing the way that
we stored the data in a ‚Äòdata structure‚Äô (i.e.¬†complex arrangement of
lists, dictionaries, and variables) we were able to do away with lots of
for loops and counters and conditions, and reduce many difficult
operations to something that could be done on one line!

## Appending a Column

Here's an example of where this approach comes into its own: let's calculate a z-score column based on the population of the cities in the data set.

### Calculate Mean

Let‚Äôs start by calculating the sample mean (use Google:
`Python numpy mean...`):

In [None]:
import numpy as np
# Use numpy functions to calculate mean and standard deviation
mean = np.mean(myData['Population'])
print(f"City distribution has a mean of {mean:,.0f}.")

### Calculate Standard Deviation

Now let‚Äôs do the standard deviation:

In [None]:
# Use numpy functions to calculate mean and standard deviation
std  = np.std(myData['Population'])
print(f"City distribution has a standard deviation of {std:,.2f}.")

So the `numpy` package gives us a way to calculate the mean and standard
deviation *quickly* and without having to reinvent the wheel. The other
potentially new thing here is `{std:,.2f}`. This is about [string
formatting](https://www.w3schools.com/python/ref_string_format.asp) and
the main thing to recognise is that this means ‚Äòformat this float with
commas separating the thousands/millions and 2 digits to the right‚Äô. The
link I‚Äôve provided uses the slightly older approach of `<str>.format()`
but the formatting approach is the same.

### For Loops Without For Loops

> **Difficulty level: Medium.**

Now we‚Äôre going to see something called a **List Comprehension**.

In Python you will see code like this a lot: `[x for x in list]`. This
syntax is known as a ‚Äòlist comprehension‚Äô and is basically a `for` loop
on one line with the output being assigned to a list. So we can apply an
operation (converting to a string, subtracting a value, etc.) to every
item in a list without writing out a full for loop.

Here‚Äôs a quick example just to show you what‚Äôs going on:

In [None]:
demo = range(0,10) # <- a *range* of numbers between 0 and 9 (stop at 10)
print([x**2 for x in demo]) # square every element of demo

Now let‚Äôs apply this to our problem. We calculated the the mean and
standard deviation above, so now we want to apply the z-score formula to
every element of the Population list‚Ä¶ Remember that the format for the
z-score (when dealing with a sample) is:

$$
z = \frac{x - \bar{x}}{s}
$$

And the population standard deviation (by which I mean, if you are
dealing with *all* the data, and not a subsample as we are here) is:

$$
z = \frac{x - \mu}{\sigma}
$$

In [None]:
rs = [(x - mean)/std for x in myData['Population']] # rs == result set
print([f"{x:.3f}" for x in rs])

### Appending

> **Difficulty level: trivial**

And now let‚Äôs add it to the data set:

In [None]:
myData['Std. Population'] = rs
print(myData['Std. Population'])

And just to show how everything is in a single data structure:

In [None]:
for c in myData['Name']:
    idx = myData['Name'].index(c)
    print(f"{c} has a population of {myData['Population'][idx]:,} and standardised score of {myData['Std. Population'][idx]:.3f}")

## What's It All Mean

Why have we done all of this today? Where are we going?

Well, here's another way:

In [None]:
import pandas as pd

df = pd.read_csv('https://orca.casa.ucl.ac.uk/~jreades/jaipur/Wikipedia-Cities.csv')

In [None]:
df.head()

In [None]:
print(f"The mean city population is {df['Population'].mean():0.2f}")

print(f"The westernmost city is {df[df['Longitude']==df['Longitude'].min()].City.iloc[0]}")