We now have to consider two more types of clustering:
Consider:
Does this matter?
Algorithm | Pros | Cons | Geographically Aware? |
---|---|---|---|
k-Means | Fast. Deterministic. | Every observation to a cluster. | N. |
DBSCAN | Allows for clusters and outliers. | Slower. Choice of \(\epsilon\) critical. Can end up with all outliers. | N, but implicit in \(\epsilon\). |
OPTICS | Fewer parameters than DBSCAN. | Even slower. | N, but implicit in \(\epsilon\). |
Hierarchical/ HDBSCAN | Can cut at any number of clusters. | No ‘ideal’ solution. | Y, with connectivity parameter. |
ADBSCAN | Scales. Confidence levels. | May need large data set to be useful. Choice of \(\epsilon\) critical. | Y. |
Max-p | Coherent regions returned. | Very slow if model poorly specified. | Y. |
Many clustering algorithms rely on a distance specification (usually \(\epsilon\)). So to set this threshold:
n Dimensions | How to Set | Examples |
---|---|---|
2 or 3 | Theory/Empirical Data | Walking speed; Commute distance |
2 or 3 | K/L Measures | Plot with Simulation for CIs to identify significant ‘knees’. |
3 | Marked Point Pattern? | |
> 3 | kNN | Calculate average kNN distance based on some expectation of connectivity. |
Specialist in consumer segmentation and geodemographics (bit.ly/2jMRhAW).
Most retail companies will have their own segmentation scheme. Competitors: CACI, Nielsen, etc.
OAC set up as ‘open source’ alternative to Mosaic:
Clustering • Jon Reades