Mathematical Statistics in Common Lisp

kqr

Tags:

I have a book on statistics¹¹ Matematisk Statistik; Kerstin Vännman; Studentlitteratur; 1990. which I have inherited from my father²² Or my mother. Based on the handwriting, they have both made notes in the margin.. It is one of those old things that are very obviously written on a typewriter with blank spaces where they drew all weird symbols and graphs by hand.

This book is a really, really good intro to statistics. It is incredibly clear, systematic and well-illustrated. I have still not found a statistics text of comparable useability. At least not to someone like me, whose dictionary does not contain an entry for continuous.

Statistics is the only way you can make definite statements about anything which involves probability. Which is … everything, really. I want to learn it.

Tell me and I’ll forget; show me and I may remember; involve me and I’ll understand.

To involve myself, I’ll be implementing calculations in Common Lisp – a great language for fiddling around in, and doing exploratory programming.

In[1]:

(in-package :cl-user)
(defpackage :org.xkqr.dev.cl-stats
  (:use :cl))
(in-package :org.xkqr.dev.cl-stats)

Chapter 1: Random Variations

The book opens with a short description of what statistics is, and then mentions that

The question is not if a person in the industry will use statistics; the question is how well they will do it when they need to.

The basic format for dealing with statistical data is as a flat list of measurements, or observations. For now, we’ll just represent this as a list also in Lisp, but in order to retain flexibility for later, we’ll define a generic method for getting a list of measurements out of some object.

In[2]:

(defgeneric get-values (data)
  (:documentation "Get a list of values from a data object."))

The implementation of this method for lists is trivial; we just return the list we get.

In[3]:

(defmethod get-values ((data list))
  data)

Now that we can get the values of observations, let’s borrow some examples from the book!

Example 1.1: Circuit Boards

If more than 1 % of the circuit boards shipped by a supplier are defective, the supplier has to pay a fine. The circuit boards come in boxes of 10,000 each, and you don’t have the resources to test all of them. A friend suggests you set up a station where you unpack 200 circuit boards randomly from each box and test them. The number of broken circuit boards among the 200 tested from the first 80 boxes delivered are recorded in the following list.

In[4]:

(defparameter *ch1-circuit-boards*
  '(1 2 1 0 3 3 4 2 4 7 4 1 1 0 0 1 1 0 0 4
    1 2 2 2 2 2 2 5 2 2 3 5 1 2 2 4 0 1 4 1
    5 1 3 3 1 1 3 2 1 4 2 1 3 2 1 1 4 3 1 3
    5 2 2 4 1 3 3 0 0 1 2 4 3 2 0 3 1 1 1 1))

Example 1.3: Height of a Building

Ten people on the street are asked to estimate the height of an old building that is about to be demolished. Their responses are recorded into the following list.

In[5]:

(defparameter *ch1-building-height*
  '(13 13 12 11 14 12 15 13 11 13))

Descriptive Statistics

There are some important ways in which we can statistically describe a data set. We often need to talk about the sample size, which is simply how many values we have recorded. We would expect the sample size to equal the length of a list of recorded values.

Note that we don’t need this to be a generic method, from a programming perspective; it’s implemented only in terms of the already generic get-values. However, we make it generic for performance reasons: some types of data objects may have much more efficient ways of computing the sample size than converting them to a list and counting off data points. However, this default method will work for anything we can plug into get-values.

In[6]:

(defmethod sample-size (data)
  (length (get-values data)))

In order to see more clearly what we are doing, we’ll define a function to display the results of our operations.

In[7]:

(defmacro test-operation (operation &rest dataset)  
  (labels ((describe-operation (data)
             (list data operation
                   (funcall operation (symbol-value data)))))
    `(format t "~:{~a ~a: ~a~%~}"
             (quote ,(loop for data in dataset collect
                        (describe-operation data))))))

Using this, we can find out the sample sizes of the examples we’ve seen so far.

In[8]:

(test-operation sample-size
                *ch1-circuit-boards*
                *ch1-building-height*)

Out[1]:

*CH1-CIRCUIT-BOARDS* SAMPLE-SIZE: 80
*CH1-BUILDING-HEIGHT* SAMPLE-SIZE: 10

These results make sense, given that we checked 80 boxes of circuit boards, and asked 10 people about the height of the building.

Another common description of a data set is its arithmetic mean, which should be familiar to most. It is also what a lot of people mean when they say “average”, although it is just one way to measure the average.

In the general case, the arithmetic mean is computed as the total sum of the observed measurements divided by the number of measurements observed.

In[9]:

(defmethod arithmetic-mean (data)
  (/ (apply #'+ (get-values data))
     (length (get-values data))))

So what are the arithmetic means of our examples?

In[10]:

(test-operation arithmetic-mean
                *ch1-circuit-boards*
                *ch1-building-height*)

Out[2]:

*CH1-CIRCUIT-BOARDS* ARITHMETIC-MEAN: 21/10
*CH1-BUILDING-HEIGHT* ARITHMETIC-MEAN: 127/10

The reason this printout looks kind of funny is that Common Lisp has a type for fractional numbers, so we lose no precision even if the arithmetic mean would be, e.g., ⅓, which is hard to represent in both binary and decimal. If we wanted a more user-friendly printout with decimals, we could just format it that way when it’s time to display it:

In[11]:

(format t "~f" (arithmetic-mean *ch1-building-height*))

Out[3]:

12.7

In order to talk about the median, which is another type of average, I’d first like to modify the observations class to make some things more convenient to compute. I’d like the observations to be stored in sorted order.

The generic lessp method to compare objects of arbitrary type does not exist in the Common Lisp standard, so we’ll have to implement it for the types we deal with. To begin with, that’s only numbers.

In[12]:

(defmethod lessp ((a number) (b number))
  (< a b))

Then we’ll create a class representing a sorted list.

In[13]:

(defclass observations ()
  ((values
    :reader get-values
    :type list)))

(defun observe (data)
  "Constructor for observations class."
  (let ((observed (make-instance 'observations)))
    (setf (slot-value observed 'values)
          (sort data #'lessp))
    observed))

In[14]:

(defclass avl-tree ()
  ((key
    :initarg :node-key
    :reader node-key)
   (left
    :initarg :left-child
    :reader left-child
    :type (or null llrb-tree))
   (right
    :initarg :right-child
    :reader right-child
    :type (or null llrb-tree)))
  (:documentation
   "A node in an AVL tree."))

(defun insert (new-value tree)
  "Return a balanced TREE with NEW-VALUE inserted."

  (labels      
      ((tree-height (node)
         "Get height of tree rooted at NODE."
         (if (not node)
             0
             (1+ (max (tree-height (left-child node))
                      (tree-height (right-child node))))))

       (balance-factor (node)
         "Get balance factor of subtree rooted at NODE."
         (ecase (- (tree-height (right-child node))
                   (tree-height (left-child node)))
           (-2 :imbalanced-left)
           (-1 :left-heavy)
           ( 0 :balanced)
           (+1 :right-heavy)
           (+2 :imbalanced-right)))

       (rotate-left (tree)
         "Return TREE rotated left."
         (with-slots (key height left right) tree 
           (avl-node
            (node-key right) 
            (avl-node key left (left-child right))
            (right-child right))))

       (rotate-right (tree)
         "Return TREE rotated right."
         (with-slots (key height left right) tree 
           (avl-node
            (node-key left)
            (left-child left)
            (avl-node key (right-child left) right))))

       (make-node (key left right)
         "Binary tree node with KEY and children LEFT, RIGHT."
         (make-instance
          'avl-tree :node-key key
          :left-child left :right-child right))

       (avl-node (key &optional left right)
         "AVL tree node with KEY and children LEFT, RIGHT."
         (let ((node (make-node key left right))) 
           (ecase (balance-factor node)
             ((:left-heavy :balanced :right-heavy)
              node)

             (:imbalanced-left
              (ecase (balance-factor left)
                (:left-heavy
                 (rotate-right node))
                (:right-heavy
                 (avl-node key (rotate-left left) right))))

             (:imbalanced-right
              (ecase (balance-factor right)
                (:left-heavy
                 (avl-node key left (rotate-right right)))
                (:right-heavy
                 (rotate-left node))))))))

    (if (not tree)
        (avl-node new-value nil nil)
        (let* ((new-left (if (lessp new-value (node-key tree))
                             (insert new-value (left-child tree))
                             (left-child tree)))
               (new-right (if (lessp new-value (node-key tree))
                              (right-child tree)
                              (insert new-value (right-child tree)))))
          (avl-node (node-key tree)
                    new-left
                    new-right)))))

(defun fold (tree)
  (when tree
    (concatenate
     'list
     (fold (left-child tree))
     (list (node-key tree))
     (fold (right-child tree)))))

Out[4]:

OBSERVE

In[15]:

(loop
   for example
   in '(*ch1-circuit-boards*
        *ch1-building-height*)
   do 
     (setf example
           (observe (get-values (symbol-value example)))))

Out[5]:

NIL

In[16]:

(get-values *ch1-building-height*)

Out[6]:

(13 13 13 13 14 15)

descriptions of data sets
- sample size
- arithmetic mean
- median
- variationsbredd R (största - minsta)
- standardavvikelse s
- variansen s^2 (s^2 = 1/(n-1) sum((xi - xtak)^2))
frequency table for grouped material

Example 1.2: Isolated Chamber

A company has put Geiger-Múller apparatii³³ I know the plural of apparatus is still apparatus. on their employees that record the amount of radiation they receive when working in an isolated chamber. It measures the total number of pulses during five seconds, and the last 200 such counts are recorded in the following frequency table.

In[17]:

(defparameter *ch1-radiation-counts*
  '(3 19 32 44 35 21 23 11 8 3 1))

Out[7]:

*CH1-RADIATION-COUNTS*

That this is a frequency table means that the 0th item in the list shows you that three of the five-second intervals had no radiation detected at all! The 4th item in the list represents how many five-second intervals had four pulses detected, and so on.

relative frequency
cum. rel. freq.

klassindela ej grupperat material
typvärde (högst histogrampinne)
kvartilavstånd (variatoinsbredd på mittersta hälften)