Classification

Jon Reades

Spot the Difference

On Maps

  • Group observations by ‘class’.
  • Typically based on 1-D distribution.
  • Classes are assigned by user choice.

On Labels

  • Label observations by ‘class’.
  • Typically based on model outputs.
  • Labels are assigned by user feedback.

Map Classification Choices

  • Assign classes manually.
  • Split range evenly.
  • Split data evenly
  • Split data according to distribution
  • Split data according to their similarity to each other.

In Practice…

Mapclassify

Mapclassify (part of PySAL) provides a wide range of classifiers:

No Parameters k Parameter
BoxPlot UserDefined
StdMean Percentiles
MaxP Quantiles
HeadTailBreaks Natural Breaks
EqualInterval Maximum Breaks
JenksCaspall/Sampled/Forced
FisherJenks/Sampled

k will a user-specified number of classes or binning criterion.

Raw

User Defined

Interval Count
( -inf, 125000.00] 0
( 125000.00, 250000.00] 4
( 250000.00, 925000.00] 865
( 925000.00, 1500000.00] 85
(1500000.00, 4500000.00] 29

Box Plot

Interval Count
( -inf, -31429.25] 0
( -31429.25, 391267.00] 246
( 391267.00, 495010.00] 246
( 495010.00, 673064.50] 245
( 673064.50, 1095760.75] 175
(1095760.75, 4416659.00] 70

Standard Deviations

Interval Count
( -inf, -171366.63] 0
(-171366.63, 216174.43] 0
( 216174.43, 991256.55] 892
( 991256.55, 1378797.61] 53
(1378797.61, 4416659.00] 38

Max P

Interval Count
[ 226536.00, 346594.00] 142
( 346594.00, 461577.00] 279
( 461577.00, 529197.00] 140
( 529197.00, 530662.00] 3
( 530662.00, 613465.00] 115
( 613465.00, 842387.00] 167
( 842387.00, 4416659.00] 137

Head Tail Breaks

Interval Count
[ 226536.00, 603715.49] 670
( 603715.49, 976290.79] 218
( 976290.79, 1508985.73] 66
(1508985.73, 2257581.55] 16
(2257581.55, 2826007.08] 9
(2826007.08, 3553496.25] 3
(3553496.25, 4416659.00] 1

Equal Interval

Interval Count
[ 226536.00, 825125.00] 842
( 825125.00, 1423714.00] 108
(1423714.00, 2022303.00] 17
(2022303.00, 2620892.00] 10
(2620892.00, 3219481.00] 4
(3219481.00, 3818070.00] 1
(3818070.00, 4416659.00] 1

Quantiles

Interval Count
[ 226536.00, 346009.00] 140
( 346009.00, 405677.86] 140
( 405677.86, 461959.29] 140
( 461959.29, 529612.86] 141
( 529612.86, 639488.86] 140
( 639488.86, 827691.43] 140
( 827691.43, 4416659.00] 141

Natural Breaks

Interval Count
[ 226536.00, 433543.00] 356
( 433543.00, 605879.00] 316
( 605879.00, 842387.00] 174
( 842387.00, 1179615.00] 80
(1179615.00, 1866335.00] 39
(1866335.00, 2762387.00] 14
(2762387.00, 4416659.00] 4

Maximum Breaks

Interval Count
[ 226536.00, 1688895.00] 961
(1688895.00, 1926265.50] 4
(1926265.50, 2278155.50] 5
(2278155.50, 2929865.50] 9
(2929865.50, 3349991.00] 2
(3349991.00, 3959682.50] 1
(3959682.50, 4416659.00] 1

Fisher Jenks

Interval Count
[ 226536.00, 435961.00] 363
( 435961.00, 607480.00] 310
( 607480.00, 842387.00] 173
( 842387.00, 1179615.00] 80
(1179615.00, 1866335.00] 39
(1866335.00, 2762387.00] 14
(2762387.00, 4416659.00] 4

Jenks Caspall

Interval Count
[ 226536.00, 365741.00] 188
( 365741.00, 441979.00] 187
( 441979.00, 520791.00] 167
( 520791.00, 638474.00] 160
( 638474.00, 890055.00] 156
( 890055.00, 1626454.00] 103
(1626454.00, 4416659.00] 22

Summary

The choice of classification scheme should be data- and distribution-led. This is simply a demonstration of how different schemes can shape your understanding of the data.

Code (Useful Tips)

Setting up the classes:

kl = 7
cls = [mapclassify.BoxPlot, ...,  mapclassify.JenksCaspall]

Setting up the loop:

for cl in cls:
    try: 
        m = cl(ppd.Value, k=kl)
    except TypeError:
        m = cl(ppd.Value)
    
    f = plt.figure()
    gs = f.add_gridspec(nrows=2, ncols=1, height_ratios=[1,4])

    ax1 = f.add_subplot(gs[0,0])
    ...

    ax2 = f.add_subplot(gs[1,0])
    ...

Code (Useful Tips)

Setting up the distribution:

    ax1 = f.add_subplot(gs[0,0])
    sns.kdeplot(ppd.Value, ax=ax1, color='r')
    ax1.ticklabel_format(style='plain', axis='x') 

    y = ax1.get_ylim()[1]
    for b in m.bins:
        ax1.vlines(b, 0, y, linestyles='dotted')

Code (Useful Tips)

Adjusting the legend text:

def replace_legend_items(legend, mapping):
    for txt in legend.texts:
        for k,v in mapping.items():
            if txt.get_text() == str(k):
                txt.set_text(v)

Setting up the map:

    ax2 = f.add_subplot(gs[1,0])
    ppd.assign(cl=m.yb).plot(column='cl', k=len(m.bins), categorical=True, legend=True, ax=ax2)
    
    mapping = dict([(i,s) for i,s in enumerate(m.get_legend_classes())])
    ax2.set_axis_off()
    replace_legend_items(ax2.get_legend(), mapping)