Randomness

Jon Reades

Many things are surprisingly non-random…

Benford’s Law, which has applications in data science and fraud detection.

Reproducibility: Good or Bad?

Depends on the problem:

  • Banking and encryption?
  • Sampling and testing?
  • Reproducing research/documentation?

Not Very Good Encryption

Cyphertext Output
ROT0 To be or not to be, That is the question
ROT1 Up cf ps opu up cf, Uibu jt uif rvftujpo
ROT2 Vq dg qt pqv vq dg, Vjcv ku vjg swguvkqp
ROT9 Cx kn xa wxc cx kn, Cqjc rb cqn zdnbcrxw

ROT is known as the Caesar Cypher, but since the transformation is simple (A..Z+=x) decryption is easy now. How can we make this harder?

Python is Random

import random
random.randint(0,10)
random.randint(0,10)
random.randint(0,10)
random.randint(0,10)

See also: random.randrange, random.choice, random.sample, random.random, random.gauss, etc.

And Repeat…

import random
size = 10
results = [0] * size

tests = 100000
while tests > 0:
    results[random.randint(0,len(results)-1)] += 1
    tests -= 1

for i in range(0,len(results)):
    print(f"{i} -> {results[i]}")

Aaaaaaaaaaand Repeat

import random 
from matplotlib import pyplot as plt
import numpy as np

size = 1000
data = [0] * size

tests = 10000000
while tests > 0:
    data[random.randint(0,len(data)-1)] += 1
    tests -= 1

fig = plt.figure()
plt.bar(np.arange(0,len(data)), data)
fig.savefig('Random.png', dpi=150, transparent=True)

Aaaaaaaaaaand Repeat

Seeding Keys

Computers actually use pseudo-random number generators. If they are initialised with a seed the will generate the same sequence.

Setting a Seed

Two main libraries where seeds are set:

import random
random.seed(42)

import numpy as np
np.random.seed(42)

Seeds and State

import random
random.seed(42)
st = random.getstate()
for r in range(0,3):
    random.setstate(st)
    print(f"Repetition {r}:")
    ints = []
    for i in range(0,10):
        ints.append(random.randint(0,10))
    print(f"\t{ints}")

Question!

Where would you use a mix of randomness and reproducbility as part of a data analysis process?

Other Applications

Hashing

Checking for changes (usally in a security context).

import hashlib # Can take a 'salt' (similar to a 'seed')

r1 = hashlib.md5('CASA Intro to Programming'.encode())
print(f"The hashed equivalent of r1 is: {r1.hexdigest()}")

r2 = hashlib.md5('CASA Intro to Programming '.encode())
print(f"The hashed equivalent of r2 is: {r2.hexdigest()}")

r3 = hashlib.md5('CASA Intro to Programming'.encode())
print(f"The hashed equivalent of r3 is: {r3.hexdigest()}")

Outputs:

"The hashed equivalent of r1 is: acd601db5552408851070043947683ef"
"The hashed equivalent of r2 is: 4458e89e9eb806f1ac60acfdf45d85b6"
"The hashed equivalent of r3 is: acd601db5552408851070043947683ef"

And Note…

import requests
night = requests.get("http://www.gutenberg.org/ebooks/1514.txt.utf-8")
print(f"The text is {night.text[30:70]}")
print(f"The text is {len(night.text):,} characters long")
hash = hashlib.md5(night.text.encode())
print(f"This can be hashed into: {hash.hexdigest()}")

Outputs:

"The text is A Midsummer Night's Dream by Shakespeare"
"The text is 112,127 characters long"
"This can be hashed into: cce0d35b8b2c4dafcbde3deb983fec0a"

JupyterLab Password

You may have noticed this in Docker:

'sha1:5b1c205a53e14e:0ce169b9834984347d62b20b9a82f6513355f72d'

How this was generated:

import uuid, hashlib
salt = uuid.uuid4().hex[:16] # Truncate salt
password = 'casa2021'        # Set password

# Here we combine the password and salt to 
# 'add complexity' to the hash
hashed_password = hashlib.sha1(password.encode() + 
                  salt.encode()).hexdigest()
print(':'.join(['sha1',salt,hashed_password]))

Encryption & Security

Simple hashing algorithms are not normally secure enough for full encryption. Genuine security training takes a whole degree + years of experience.

Areas to look at if you get involved in applications:

  • Public and Private Key Encryption (esp. OpenSSL)
  • Privileges used by Applications (esp. Docker)
  • Revocable Tokens (e.g. for APIs)
  • Injection Attacks (esp. for SQL using NULL-byte and similar)