There are a couple reasons I keep coming back to this problem. One is that it’s a great example of how to build a machine learning model using an optimization solver. Unless you have an optimization background, it’s probably not obvious you can do this. Building a regression or classification model with a solver directly is a great way to understand the model better. And you can customize it in interesting ways, like adding epsilon insensitivity.
Another is that least squares, while most commonly used regression form, has a fatal flaw: it isn’t robust to outliers in the input data. This is because least squares minimize the sum of squared residuals, as shown in the formulation below. Here, $A$ is an $m \times n$ matrix of feature data, $b$ is a vector of observations to fit, and $x$ is a vector of coefficients the optimizer must find.
$$ \min f(x) = \Vert Ax-b \Vert^2 $$
Since the objective function minimizes squared residuals, outliers have a much bigger impact than other data. LAD regression solves this by simply summing the values of the residuals as they are.
$$ \min f(x) = \vert Ax-b \vert $$
So why isn’t this used more? Simple – least squares has a convenient analytical solution, while LAD requires an algorithm to solve. For instance, you can formulate LAD regression as a linear program, but now you need a solver.
$$ \begin{align*} \min \quad & 1’z \\ \text{s.t.}\ \quad & z \ge Ax - b \\ & z \ge b - Ax \end{align*} $$
While I like using this example, it paints a rather negative picture of squaring. If it does funny things to solvers, is there any good reason to square? Thus I’ve been on the lookout for a practical example where squaring a variable or expression makes a model more useful.
Luckily for me, Erwin Kalvelagen recently posted about using optimization to schedule team meetings. This is an application where minimizing squared values of overbooking can be beneficial – it may be worse to be triple booked than double booked.
I won’t recreate the reasoning behind Erwin’s post here. You can read his blog for that. What we’ll do is look at both the formulations in his post, along with a couple extras using Julia for code, JuMP for modeling, SCIP for optimization, and Gadfly for visualization. All model code and data are linked in the resources section at the end.
To start off, I built a new data set, which you can find in the resources section. This differentiates team membership between two types of employees: individual contributors (starting with ic
in the data), who attend meetings for 1 or 2 teams, and managers (prefixed with mgr
), who attend meetings to coordinate across multiple teams. We schedule meetings for 10 teams (prefix t
) into 3 time slots (s
).
The first model in Erwin’s post maximizes attendance. This means it tries to schedule team members for as many unique time slots as possible. It doesn’t consider overbooking.
$$ \begin{align*} \max\quad & \sum_{i,s} y_{i,s} \\ \text{s.t.}\quad& \sum_{s} x_{t,s} = 1 &\quad\forall&\ t & \text{schedule each team meeting once}\\ & y_{i,s} \le \sum_{t} m_{i,t}\ x_{t,s} &\quad\forall&\ i,s & \text{individuals attend team meetings}\\ & x_{t,s} \in \{0,1\} &\quad\forall&\ t,s\\ & y_{i,s} \in \{0,1\} &\quad\forall&\ i,s \end{align*} $$
This yields the following team schedule, with red representing a scheduled team meeting.
If we look at the manager schedules, we’ll see that every manager is completely booked. This makes sense. That’s what managers do, right? Go to meetings?
The model gets more interesting once we account for overbooking. Erwin’s post has a model that minimizes overbooking, where overbooking is the number of additional meetings in a time slot. If a team member is double booked, that’s 1 overbooking. If they are triple booked, that’s 2 overbookings.
The second model in Erwin’s post minimizes the sum of all overbookings. He does this by adding a continuous c
vector that only incurs value once a team member goes over a single meeting in a given time slot.
$$ \begin{align*} \min\quad & \sum_{i,s} c_{i,s} \\ \text{s.t.}\quad& \sum_{s} x_{t,s} = 1 &\quad\forall&\ t & \text{schedule each team meeting once}\\ & c_{i,s} \ge \sum_{t} m_{i,t}\ x_{t,s} - 1 &\quad\forall&\ i,s & \text{measure overbooking}\\ & x_{t,s} \in \{0,1\} &\quad\forall&\ t,s\\ & c_{i,s} \ge 0 &\quad\forall&\ i,s \end{align*} $$
Given our data this results in the following team schedule, which is probably not all that interesting. I’ll leave this visualization out from now on.
Where it gets interesting is plotting overbookings for the managers. Here we see that 3 manager time slots are triple booked (red), while 8 are double booked (gray).
Let’s say it’s worse to triple book (or, gasp, quadruple book) than to double book. How can the model account for this? One answer, if you have a MIQP-enabled solver, is to simply square the c
values.
$$ \begin{align*} \min\quad & \sum_{i,s} c_{i,s}^2 \\ \text{s.t.}\quad& \sum_{s} x_{t,s} = 1 &\quad\forall&\ t & \text{schedule each team meeting once}\\ & c_{i,s} \ge \sum_{t} m_{i,t}\ x_{t,s} - 1 &\quad\forall&\ i,s & \text{measure overbooking}\\ & x_{t,s} \in \{0,1\} &\quad\forall&\ t,s\\ & c_{i,s} \ge 0 &\quad\forall&\ i,s \end{align*} $$
This completely eliminates triple booking, as shown below. No manager is worse off than being double booked, which seems normal given my experiences.
The problem with this is that the solver now takes a lot longer. It’s not bad for the data in this example, but if you try it with something larger you’ll see what I mean. You can find the data generator code in the resources section.
So how can we do something similar without the computational cost? One option is to continue using MILP formulations, but in the context of hierarchical optimization. This means splitting the model into two. First, we try to minimize the maximum overbookings for any team member (the bottleneck, if you will). This involves adding a variable $b$ representing that maximum.
$$ b = \max\Bigl\{\sum_{t} m_{i,t}\ x_{t,s} - 1 : i \in I, s \in S \Bigr\} $$
Now we can simply minimize $b$ using a MILP instead of a MIQP.
$$ \begin{align*} \min\quad & b \\ \text{s.t.}\quad& \sum_{s} x_{t,s} = 1 &\quad\forall&\ t & \text{schedule each team meeting once}\\ & b \ge \sum_{t} m_{i,t}\ x_{t,s} - 1 &\quad\forall&\ i,s & \text{maximum overbooking}\\ & x_{t,s} \in \{0,1\} &\quad\forall&\ t,s \end{align*} $$
Once we solve the first model, we get the minimal value of $b$, which we call $b^*$. We can simply use $b^*$ as an upper bound for overbookings in the second original model.
$$ \begin{align*} \min\quad & \sum_{i,s} c_{i,s} \\ \text{s.t.}\quad& \sum_{s} x_{t,s} = 1 &\quad\forall&\ t & \text{schedule each team meeting once}\\ & c_{i,s} \ge \sum_{t} m_{i,t}\ x_{t,s} - 1 &\quad\forall&\ i,s & \text{measure overbooking}\\ & x_{t,s} \in \{0,1\} &\quad\forall&\ t,s\\ & 0 \le c_{i,s} \le b^* &\quad\forall&\ i,s \end{align*} $$
As we see below, this model also eliminates triple bookings, and it’s quite a bit faster to solve than the MIQP.
main.go
generates input datamembership.csv
contains input datamaximize-attendance.jl
MILP modelminimize-overbooking.jl
MILP modelminimize-overbooking-squared.jl
MIQP modelminimize-bottleneck.jl
hierarchical MILP modelsA feature I’d love for Hop is the ability to visualize DDs and monitor the search. That could work interactively, like Gecode’s GIST, or passively during the search process. This requires automatic generation of images representing potentially large diagrams. So I spent a few hours looking at graph rendering options for DDs.
We’ll start with examples of visualizations built by hand. These form a good standard for how we want DDs to look if we automate rendering. We’ll start with some examples from academic literature, look at some we’ve used in Nextmv presentations, and show an interesting example that embeds in Hugo, the popular static site generator I use for this blog.
All the literature on using Decision Diagrams (DD) for optimization that I’m aware of depicts DDs as top-down, layered, directed graphs (digraphs). Some of the diagrams we come across appear to be coded and rendered, while some are fussily created by hand with a diagramming tool.
I believe most of of the examples we find in academic literature are coded by hand and rendered using the LaTeX TikZ package. Below is one of the first diagrams that newcomers to DDs encounter. It’s from Decision Diagrams for Optimization by Bergman et al, 2016.
It doesn’t matter here what model this represents. It’s a Binary Decision Diagram (BDD), which means that each variable can be $0$ or $1$. The BDD on the left is exact, while the BDD on the right is a relaxed version of the same.
There’s quite a bit going on, so it’s worth an explanation. Let’s look at the “exact” BDD on the left first.
The “relaxed” BDD on the right overapproximates both the objective value and the set of feasible solutions of the exact BDD on the left.
Here’s another example of an exact BDD from the same book.
In this diagram, each node has a state. For example, the state of $r$ is $\{1,2,3,4,5\}$. If we start at the root node $r$ and assign $x_1 = 0$, we end up at node $u_1$ with state $\{2,3,4,5\}$.
Most other academic literature about DDs uses images similar to these.
We’ve rendered a number of DDs over the years at Nextmv. Most of these images demonstrate a concept instead of a particular model. We usually create them by hand in a diagramming tool like Whimsical, Lucidchart, or Excalidraw. I built the diagrams below by hand in Whimsical. I think the result is nice, if time consuming and fussy.
This is a representation of an exact DD. It doesn’t indicate whether this is a BDD or a Multivalued Decision Diagram (MDD). It doesn’t have any labels or variable names. It just shows what a DD search might look like in the abstract.
The restricted DD below is more involved. It addition to horizontal layers, it divides nodes into explored and deferred groups. Most of the examples I’ve seen mix different types of nodes, like exact and relaxed. I really like differentiating node types like this.
In this representation, deferred nodes are in Hop’s queue for later exploration. Thus they don’t connect to any child nodes yet. This is the kind of thing I’d like to generate with real diagrams during search so I can examine the state of the solver.
My favorite of my DD renderings so far is the next one. This shows a single-vehicle pickup-and-delivery problem. The arc labels are stops (e.g. 🐶, 🐱). The path the 🚗 follows to the terminal node is the route. The gray boxes group together nodes to merge based on state to reduce isomorphisms out of the diagram.
We also have some images like those in our post on expanders by hand. As you can see, coding these by hand gets tedious.
TikZ is a program that renders manually coded graphics, while Whimsical is a WYSIWYG diagram editor. I like the Whimsical images a lot better – they feel cleaner and easier to understand.
Hugo supports GoAT diagrams by default, so I tried that out too. Here is an arbitrary MDD with two layers. The $[[1,2],4]$ node is a relaxed node; it doesn’t really matter here what the label means.
I like the way GoAT renders this diagram. It’s very readable. Unfortunately, it isn’t easy to automate. Creating a GoAT diagram is like using ASCII as a WYSIWYG diagramming tool, as you can see from the code for that image.
.-.
.-----------+ o +-----------.
| '+' |
| | |
v v v
.-. .---------. .-.
x1 | 0 | | [[1,2],4] | | 3 |
'-' '----+----' '+'
| |
.------------+ |
| | |
v v v
.--. .--. .---.
x2 | 10 | | 20 | | 100 |
'-+' '-+' '-+-'
| | |
| v |
| .-. |
'--------->| * |<----------'
'-'
Now we’ll look at a couple options for automatically generating visualizations of DDs. These convert descriptions of graphs into images.
Graphviz is the tried and true graph visualizer. It’s used in the Go pprof
library for examining CPU and memory profiles, and lots of other places.
Graphviz accepts a language called DOT. It uses different layout engines to convert input into a visual representation. The user doesn’t have control over node position. That’s the job of the layout engine.
Here’s the same MDD as written in DOT. The start -> end
lines specify arcs in the digraph. The subgraphs organize nodes into layers. We add a dotted border around each layer and a label to say which variable it assigns. There isn’t any way of vertically centering and horizontally aligning the layer labels, so I thought it make more sense this way.
digraph G {
s1 [label = 0]
s2 [label = "[[1,2],4]"]
s3 [label = 3]
s4 [label = 10]
s5 [label = 20]
s6 [label = 100]
r -> s1 [label = 2]
r -> s2 [label = 4]
r -> s3 [label = 1]
s2 -> s4 [label = 10]
s2 -> s5 [label = 4]
s3 -> s6 [label = 2]
subgraph cluster_0 {
label = "x1"
labeljust = "l"
style = "dotted"
s1
s2
s3
}
subgraph cluster_1 {
label = "x2"
labeljust = "l"
style = "dotted"
s4
s5
s6
}
s4 -> t
s5 -> t
s6 -> t
}
The result is comprehensible if not very attractive. With some fiddling, it’s possible to improve things like the spacing around arc labels. I couldn’t figure out how to align the layer labels and boxes. It doesn’t seem possible to move the relaxed nodes into their own column either, but that limitation isn’t unique to Graphviz.
Mermaid is a JavaScript library for diagramming and charting. One can use it on the web or, presumably, embed it in an application.
Mermaid is similar to Graphviz in many ways, but it supports more diagram types. The input for that MDD in Mermaid is a bit simpler. Labels go inside arcs (e.g. -- 2 -->
), and there are more sensible rendering defaults.
graph TD
start((( )))
stop((( )))
A(0)
B("[[1,2],4]")
C(3)
D(10)
E(20)
F(100)
start -- 2 --> A
start -- 4 --> B
start -- 1 --> C
B -- 10 --> D
B -- 4 --> E
C -- 2 --> F
D --> stop
E --> stop
F --> stop
subgraph "x1 "
A; B; C
end
subgraph "x2"
D; E; F
end
The result has a lot of the same limitations as the Graphviz version, but it looks more like the GoAT version. The biggest problem, as we see below, is that it’s not possible to left-align the layer labels. They can be obscured by arcs.
This got me thinking that there isn’t a strong reason DDs have to progress downward layer by layer. They could just as easily go from left to right. If we change the opening line from graph TD
to graph LR
, then we get the following image.
I think that’s pretty nice for a generated image.
]]>But, ultimately, for several years it just felt like blogging was dead. Its space was usurped by Tweets, LinkedIn hustle posts, long form Medium content aimed at attracting talent, and other content trends. RSS feeds dried up bit by bit. That beautiful structure somewhere between a college essay and an academic preprint mostly ceased to be. Sad times, indeed.
That trend seems to be reversing. I don’t know whether it’s the result of nudges from Substack, or that all the introverts in tech finally gave up baking pandemic, but there is a lot of good content out there again! Fire up your RSS aggregators and get reading.
This post is my own reboot of “adventures in optimization,” a blog I’ve written intermittently since 2009. Unfortunately, it will take some time to move over all my old posts to Hugo. I’ll do that slowly as I create new ones. For now, I’ve ported over a couple early posts and put together a list of the active blogs I’m gleefully catching up on.
See you soon!
]]>September 20, 2024 Gurobi Summit 2024 - 📄 abstract, 🎟 registration
Accelerating Optimization AI Teams with DecisionOps
October 21, 2024 - INFORMS Annual Meeting - 📄 abstract, 🎟 registration
Solving the Weapon Target Assignment Problem with Decision Diagrams
July 30, 2024 - Nextmv Videos - 🎥 video
Operationalizing HiGHS-based MIP models and Q&A with project developers
June 27, 2024 - HiGHS Workshop 2024
Symphonic HiGHS: Operationalizing next moves with DecisionOps
June 7, 2024 - EURO Practitioners’ Forum - 📄 abstract, 🎥 video
Three model problem: Combining machine learning (ML) and operations research (OR) through horizontal computing
April 14, 2024 - INFORMS Analytics Conference - 📄 abstract
The sushi is ready. How do I deliver it? Forecast, schedule, route with DecisionOps
April 10, 2024 - Nextmv Videos - 🎥 video
Getting started with DecisionOps for decision science models using Gurobi
December 6, 2023 - PyData Global 2023 - 📄 abstract, 🧑💻️ code, 💻 slides, 🎥 video
Order up! How do I deliver it? Build on-demand logistics apps with Python, OR-Tools, and DecisionOps
November 16, 2023 - Nextmv Videos - 🎥 video
Forecast, schedule, route: 3 starter models for on-demand logistics
October 17, 2023 - INFORMS Annual Meeting - 📄 abstract
Adapting to Change in On-Demand Delivery: Unpacking a Suite of Testing Methodologies
September 20, 2023 - DecisionCAMP 2023 - 📄 abstract, 💻 slides, 🎥 video
Decision model, meet the real world: Testing optimization models for use in production environments
August 27, 2023 - DPSOLVE 2023 - 💻 slides
Implementing Decision Diagrams in Production Systems
May 11, 2023 - Nextmv Videos - 🎥 video
Several people are optimizing: Collaborative workflows for decision model operations
April 17, 2023 - INFORMS Analytics Conference - 📄 abstract
Decision Model, Meet Production: A Collaborative Workflow for Optimizing More Operations
February 16, 2023 - Nextmv Videos - 🎥 video
Decision diagrams in operations research, optimization, vehicle routing, and beyond
January 18, 2023 - Nextmv Videos - 🎥 video
In conversation with Karla Hoffman
November 16, 2022 - Nextmv Videos - 🎥 video
Decision model, meet production
October 5, 2020 - INFORMS Philadelphia Chapter - 🎥 video
Real-Time Routing for On-Demand Delivery
October 22, 2019 - INFORMS Annual Meeting - 💻 slides
Decision Diagrams for Real-Time Routing
July 6, 2017 - PyData Seattle 2017 - 📄 abstract, 🎥 video
Practical Optimization for Stats Nerds
March 5, 2017 - Data Science DC - 💻 slides
Practical Optimization for Stats Nerds
December 4, 2015 - PyData NYC 2015 - 💻 slides, 🎥 video
Optimize your Docker Infrastructure with Python
July 17, 2014 - IFORS 2014 - 📄 abstract, 💻 slides
A MIP-Based Dual Bounding Technique for the Irregular Nesting Problem
February 19, 2010 - PyCon 2010 - 🎥 video
Optimal Resource Allocation using Python
March 2024 - USPTO - 📄 patent
Fast computational generation of digital pickup and delivery plans describes algorithms for fast on-demand routing in pickup and delivery problems.
December 2023 - USPTO - 📄 patent
Prediction of travel time and determination of prediction interval describes technology for predicting travel times for on-demand delivery platforms.
June 2023 - USPTO - 📄 patent
Runners for optimization solvers and simulators describes technology for creating and executing Decision Diagram-based optimization solvers and state-based simulators in cloud environments.
September 2020 - Operations Research Forum - 📄 preprint
MIPLIBing: Seamless Benchmarking of Mathematical Optimization Problems and Metadata Extensions presents a Python library that automatically downloads queried subsets from the current versions of MIPLIB, MINLPLib, and QPLIB, provides a centralized local cache across projects, and tracks the best solution values and bounds on record for each problem.
May 2019 - Operations Research Letters - 📄 preprint
Decision diagrams for solving traveling salesman problems with pickup and delivery in real time explores the use of Multivalued Decision Diagrams and Assignment Problem inference duals for real-time optimization of TSPPDs.
October 2018 - Optimization Online - 📄 preprint
Integer Models for the Asymmetric Traveling Salesman Problem with Pickup and Delivery proposes a new ATSPPD model, new valid inequalities for the Sarin-Sherali-Bhootra ATSPPD, and studies the impact of relaxing complicating constraints in these.
September 2018 - Optimization Online - 📄 preprint
Exact Methods for Solving Traveling Salesman Problems with Pickup and Delivery in Real Time examines exact methods for solving TSPPDs with consolidation in real-time applications. It considers enumerative, Mixed Integer Programming, Constraint Programming, and hybrid optimization approaches under various time budgets.
March 2018 - Optimization Online - 📄 preprint
The Meal Delivery Routing Problem introduces the MDRP to formalize and study an important emerging class of dynamic delivery operations. It also develops optimization-based algorithms tailored to solve the courier assignment (dynamic vehicle routing) and capacity management (offline shift scheduling) problems encountered in meal delivery operations.
March 7, 2024 - Nextmv Blog
Nextmv Gurobi integration: Build, test, deploy decision models using Gurobi and DecisionOps
February 13, 2024 - Nextmv Blog
CI/CD for decision science: What is it, how does it work, and why does it matter?
February 1, 2024 - Nextmv Blog
New decision apps, an open source decision model hub, and an individual plan
December 19, 2023 - Nextmv Blog
Shift scheduling optimization: Generating shift types, planning for demand, and assigning workers
April 20, 2022 - Nextmv Blog
You need a solver. What is a solver?
March 2, 2021 - Nextmv Blog
Binaries are beautiful
March 2, 2020 - Nextmv Blog
How Hop Hops
September 13, 2018 - Grubhub Bytes
Decisions are first class citizens: an introduction to Decision Engineering
January 5, 2015 - The Yhat Blog
Currency Portfolio Optimization Using ScienceOps
November 10, 2014 - The Yhat Blog
How Yhat Does Cloud Balancing: A Case Study
I build decision automation and optimization tools.
By day, I am an optimization engineer, coder, and co-founder of Nextmv. I’m interested in hybrid optimization, decision diagrams, and mixed integer programming. My applications skew toward logistics for delivery platforms, with detours into cutting and packing.
For the past several years, I’ve worked in real-time optimization for on-demand delivery, scheduling, forecasting, and simulation. I did a MS in Operations Research by night at George Mason University, then a PhD in the same department under the advisement of Karla Hoffman.
By night, I’m an amateur cellist, and a cat and early music enthusiast. I’m also…
The Ruby Algebraic Modeling System is a simple modeling tool for formulating and solving MILPs in Ruby.
ap.cpp is an incremental primal-dual assignment problem solver written in C++. It can vastly improve propagation in hybrid optimization models that use AP relaxations. I use it within custom propagators in Gecode and in Decision Diagrams for solving the Traveling Salesman Problem with side constraints.
ap is a Go version of ap.cpp.
TSPPD Hybrid Optimization Code and TSPPD Decision Diagram Code are both used in my dissertation. The former contains C++14 code for hybrid CP and MIP models for solving TSPPDs. The latter uses a hybridized Decision Diagram implementation with an Assignment Problem inference dual inside a branch-and-bound.
TSPPDlib is a standard test set for TSPPDs. The instances are based on observed meal delivery data at Grubhub.
python-zibopt was a Python interface to the SCIP Optimization Suite. This was no longer necessary once PySCIPOpt emerged.
Chute was a simple, lightweight tool for running discrete event simulations in Python.
PyGEP was a simple library suitable for academic study of GEP (Gene Expression Programming) in Python 2.
]]>Back in October of 2011, I started toying with a model for finding magic squares using SCIP. This is a fun modeling exercise and a challenging problem. First one constructs a square matrix of integer-valued variables.
from pyscipopt import Model
# [...snip...]
m = Model()
matrix = []
for i in range(size):
row = [m.addVar(vtype="I", lb=1) for _ in range(size)]
for x in row:
m.addCons(x <= M)
matrix.append(row)
Then one adds the following constraints:
The first two constraints are trivial to implement, and relatively easy for the solver. What I do is add a single extra variable then set it equal to the sums of each row, column, and the diagonal.
sum_val = m.addVar(vtype="M")
for i in range(size):
m.addCons(sum(matrix[i]) == sum_val)
m.addCons(sum(matrix[j][i] for j in range(size)) == sum_val)
m.addCons(sum(matrix[i][i] for i in range(size)) == sum_val)
It’s the third that messes things up. You can think of this as saying, for every possible pair of integer-valued variables $x$ and $y$:
$$ x \ge y + 1 \quad \text{or} \quad x \le y - 1 $$
Why is this hard? Because we can’t add both constraints to the model. That would make it infeasible. Instead, we add write them in such a way that exactly one will be active for any any given solution. This requires, for each pair of variables, an additional binary variable $z$ and a (possibly big) constant $M$. Thus we reformulate the above as:
$$ x \ge (y + 1) - M z \ x \le (y - 1) + M (1-z) \ z \in {0,1} $$
In code this looks like:
from itertools import chain
all_vars = list(chain(*matrix))
for i, x in enumerate(all_vars):
for y in all_vars[i+1:]:
z = m.addVar(vtype="B")
m.addCons(x >= y + 1 - M*z)
m.addCons(x <= y - 1 + M*(1-z))
However, here be dragons. We may not know how big (or small) to make $M$. Generally we want it as small as possible to make the LP relaxation of our integer programming model tighter. Different values of $M$ have unpredictable effects on solution time.
Which brings us to an interesting idea:
SCIP now supports bilinear constraints. This means that I can make $M$ a variable in the above model.
import sys
try:
M = int(sys.argv[2])
except IndexError:
M = m.addVar(vtype="M", lb=size * size)
else:
assert M >= size * size
The magic square model linked to in this post provides both options. The first command line argument it requires is the matrix size. The second one, $M$, is optional. If not given, it leaves $M$ up to the solver.
An interesting exercise for the reader: Change the code to search for a minimal magic square, which minimizes either the value of $M$ or the sums of the columns, rows, and diagonal.
]]>I understand the concept of using a companion set to remove duplicates from a list while preserving the order of its elements. But what should I do if these elements are composed of smaller pieces? For instance, say I am generating combinations of numbers in which order is unimportant. How do I make a set recognize that
[1,2,3]
is the same as[3,2,1]
in this case?
There are a couple points that should help here.
While lists are unhashable and therefore cannot be put into sets, tuples are perfectly capable of this. Therefore I cannot do this.
s = set()
s.add([1,2,3])
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: unhashable type: 'list'
But this works just fine (extra space added for emphasis of tuple parentheses).
s.add( (1,2,3) )
(3,2,1)
and (1,2,3)
may not hash to the same thing, but tuples are easily sortable. If I sort them before adding them to a set, they look the same.
tuple(sorted( (3,2,1) ))
(1, 2, 3)
If I want to be a little fancier, I can user itertools.combinations
. The following generates all unique 3-digit combinations of integers from 1 to 4:
from itertools import combinations
list(combinations(range(1,5), 3))
[(1, 2, 3), (1, 2, 4), (1, 3, 4), (2, 3, 4)]
Now say I want to only find those that match some condition. I can add a filter to return, say, only those 3-digit combinations of integers from 1 to 6 that multiply to a number divisible by 10:
list(filter(
lambda x: not (x[0]*x[1]*x[2]) % 10,
combinations(range(1, 7), 3)
))
[(1, 2, 5),
(1, 4, 5),
(1, 5, 6),
(2, 3, 5),
(2, 4, 5),
(2, 5, 6),
(3, 4, 5),
(3, 5, 6),
(4, 5, 6)]
I’m actually not going to go into anything much resembling algorithmic complexity here. What I’d like to do is present a common performance anti-pattern that I see from novice programmers about once every year or so. If I can prevent one person from committing this error, this post will have achieved its goal. I’d also like to show how an intuitive understanding of time required by operations in relation to the size of data they operate on can be helpful.
Say you have a Big List of Things. It doesn’t particularly matter what these things are. Often they might be objects or dictionaries of denormalized data. In this example we’ll use numbers. Let’s generate a list of 1 million integers, each randomly chosen from the first 100 thousand natural numbers:
import random
choices = range(100000)
x = [random.choice(choices) for i in range(1000000)]
Now say you want to remove (or aggregate, or structure) duplicate data while keeping them in order of appearance. Intuitively, this seems simple enough. A first solution might involve creating a new empty list, iterating over x, and only appending those items that are not already in the new list.
order = []
for i in x:
if i not in order:
order.append(i)
Try running this. What’s wrong with it?
The issue is the conditional on line 3. In the worst case, it could look at every item in the order list for each item in x. If the list is big, as it is in our example, that wastes a lot of cycles. We can reason that we can improve the performance of our code by replacing this conditional with something faster.
Given that sets have near constant time for membership tests, one solution is to create a companion data structure, which we’ll call seen. Being a set, it doesn’t care about the order of the items, but it will allow us to test for membership quickly.
order = []
seen = set()
for i in x:
if i not in seen:
seen.add(i)
order.append(i)
Now try running this. Better?
Not that this is the best way to perform this particular action. If you aren’t familiar with it, take a look at the groupby
function from itertools
, which is what I will sometimes reach for in a case like this.
Anyone familiar at all with simulation will recognize the last item as the motivating force of the entire field. Simulation models tend to take over when systems become so complex that understanding them is prohibitive in cost and time or entirely infeasible. In a simulation, the modeler can focus on individual interactions between entities while still hoping for useful output in the form of descriptive statistics.
As such, simulations are nearly always stochastic. The output of a simulation, whether it be the mean time to service upon entering a queue or the number of fish alive in a pond, is determined by a number of random inputs. It is estimated by looking at a sample of the entire, often infinite, problem space and therefore must be described in terms of mean and variance.
For me, simulation building usually follows a process roughly like this:
The reason for creating a simulation without randomness first is that it can be difficult or impossible to verify its correctness otherwise. Thus one may focus on the simulation logic first before analyzing and adding sources of randomness.
Where the procedure breaks down is after the third step. Domain experts are often happy to share their knowledge about systems to aid in designing simulations, and typically can understand the resulting abstractions. They are also invaluable in verifying simulation output. However, they are unlikely to understand why it is necessary to add randomness to a system that they already perceive as functional. Further, doing so can be just as difficult and time consuming as the initial model development and therefore requires justification.
This can be a quandary for the model builder. How does one communicate the need to incorporate randomness to decision makers who lack understanding of probability? It is trivially easy to construct simulations that use the same input parameters but yield drastically different outputs. Consider the code below, which simulates two events occurring and counts the number of times event b happens before event a.
import random
def sim_stochastic(event_a_lambda, event_b_lambda):
# Returns 0 if event A arrives first, 1 if event B arrives first
# Calculate next arrival time for each event randomly.
event_a_arrival = random.expovariate(event_a_lambda)
event_b_arrival = random.expovariate(event_b_lambda)
return 0.0 if event_a_arrival <= event_b_arrival else 1.0
def sim_deterministic(event_a_lambda, event_b_lambda):
# Returns 0 if event A arrives first, 1 if event B arrives first
# Calculate next arrival time for each event deterministically.
event_a_arrival = 1.0 / event_a_lambda
event_b_arrival = 1.0 / event_b_lambda
return 0.0 if event_a_arrival <= event_b_arrival else 1.0
if __name__ == '__main__':
event_a_lambda = 0.3
event_b_lambda = 0.5
repetitions = 10000
for sim in (sim_stochastic, sim_deterministic):
output = [
sim(event_a_lambda, event_b_lambda)
for _ in range(repetitions)
]
event_b_first = 100.0 * (sum(output) / len(output))
print('event b is first %0.1f%% of the time' % event_b_first)
Both simulations use the same input parameter, but the second one is essentially wrong as b will always happen first. In the stochastic version, we use exponential distributions for the inputs and obtain an output that verifies our basic understanding of these distributions.
event b is first 63.0% of the time
event b is first 100.0% of the time
How about you? How do you discuss the need to model a random world with decision makers?
]]>It’s possible this will turn out like the day when Python 2.5 introduced [coroutines][coroutines]. At the time I was very excited. I spent several hours trying to convince my coworkers we should immediately abandon all our existing Java infrastructure and port it to finite state machines implemented using Python coroutines. After a day of hand waving over a proof of concept, we put that idea aside and went about our lives.
Soon after, I left for a Python shop, but in the next half decade I still never found a good place to use this interesting feature.
But it doesn’t feel like that.
As I come to terms more with switching to Python 3.2, the futures module seems similarly exciting. I wish I’d had it years ago, and it’s almost reason in itself to upgrade from Python 2.7. Who cares if none of your libraries have been ported yet?
This library lets you take any function and distribute it over a process pool. To test that out, we’ll generate a bunch of random graphs and iterate over all their cliques.
First, let’s generate some test data using the dense_gnm_random_graph
function. Our data includes 1000 random graphs, each with 100 nodes and 100 * 100 edges.
import networkx as nx
n = 100
graphs = [nx.dense_gnm_random_graph(n, n*n) for _ in range(1000)]
Now we write a function iterate over all cliques in a given graph. NetworkX provides a find_cliques
function which returns a generator. Iterating over them ensures we will run through the entire process of finding all cliques for a graph.
def iterate_cliques(g):
for _ in nx.find_cliques(g):
pass
Now we just define two functions, one for running in serial and one for running in parallel using futures
.
from concurrent import futures
def serial_test(graphs):
for g in graphs:
iterate_cliques(g)
def parallel_test(graphs, max_workers):
with futures.ProcessPoolExecutor(max_workers=max_workers) as executor:
executor.map(iterate_cliques, graphs)
Our __main__
simply generates the random graphs, samples from them, times both functions, and write CSV data to standard output.
from csv import writer
import random
import sys
import time
if __name__ == '__main__':
out = writer(sys.stdout)
out.writerow(['num graphs', 'serial time', 'parallel time'])
n = 100
graphs = [nx.dense_gnm_random_graph(n, n*n) for _ in range(1000)]
# Run with a number of different randomly generated graphs
for num_graphs in range(50, 1001, 50):
sample = random.choices(graphs, k = num_graphs)
start = time.time()
serial_test(sample)
serial_time = time.time() - start
start = time.time()
parallel_test(sample, 16)
parallel_time = time.time() - start
out.writerow([num_graphs, serial_time, parallel_time])
The output of this script shows that we get a fairly linear speedup to this code with little effort.
I ran this on a machine with 8 cores and hyperthreading. Eyeballing the chart, it looks like the speedup is roughly 5x. My system monitor shows spikes on CPU usage across cores whenever the parallel test runs.
solve.affine <- function(A, rc, x, tolerance=10^-7, R=0.999) {
# Affine scaling method
while (T) {
X_diag <- diag(x)
# Compute (A * X_diag^2 * A^t)-1 using Cholesky factorization.
# This is responsible for scaling the original problem matrix.
q <- A %*% X_diag**2 %*% t(A)
q_inv <- chol2inv(chol(q))
# lambda = q * A * X_diag^2 * c
lambda <- q_inv %*% A %*% X_diag^2 %*% rc
# c - A^t * lambda is used repeatedly
foo <- rc - t(A) %*% lambda
# We converge as s goes to zero
s <- sqrt(sum((X_diag %*% foo)^2))
# Compute new x
x <- (x + R * X_diag^2 %*% foo / s)[,]
# If s is within our tolerance, stop.
if (abs(s) < tolerance) break
}
x
}
This function accepts a matrix A
which contains all technological coefficients for an LP, a vector rc
containing its reduced costs, and an initial point x
interior to the LP’s feasible region. Optional arguments to the function include a tolerance, for detecting when the method is within an acceptable distance from the optimal point, and a value for R
, which must be strictly between 0 and 1 and controls scaling.
The method works by rescaling the matrix A
around the current solution x
. It then computes a new x
such that it remains feasible and interior, which is why R
cannot be 0 or 1. It requires a feasible interior point to start and only projects to other feasible interior points, so the right hand side of the LP is not required (it is implicit from the starting point). The shadow prices for each iteration are captured in the vector lambda, so the gap between primal and dual solutions is easy to compute.
We run this function against a 3x3 LP with a known solution:
max z = 5x1 + 4x2 + 3x3
st 2x1 + 3x2 + x3 <= 5
4x1 + x2 + 2x3 <= 11
3x1 + 4x2 + 2x3 <= 8
x1, x2, x3 >= 0
The optimal solution to this LP is:
z = 13
x1 = 2
x2 = 0
x3 = 1
This problem can be run against the affine scaling function by defining A with all necessary slack variables, and using an arbitrary feasible interior point:
A <- matrix(c(
2,3,1,1,0,0,
4,1,2,0,1,0,
3,4,2,0,0,1
), nrow=3, byrow=T)
rc <- c(5, 4, 3, 0, 0, 0)
x <- c(0.5, 0.5, 0.5, 2, 7.5, 3.5)
solution <- solve.affine(A, rc, x)
print(solution)
print(sum(solution * rc))
This provides an output vector that is very close to the optimal primal solution shown above. Since interior point methods converge asymptotically to optimal solutions, it is important to note that we can only ever get (extremely) close to our final optimal objective and decision variable values.
> print(solution)
[1] 1.999998e+00 4.268595e-07 1.000002e+00 1.280579e-06 1.000005e+00
[6] 1.280579e-06
> print(sum(solution * rc))
[1] 13.00000
For the final JAPH in this series, I implemented a simple transpiler that converts a small subset of Scheme programs to equivalent Python programs. It starts with a Scheme program that prints 'just another scheme hacker'
.
(define (output x)
(if (null? x)
""
(begin (display (car x))
(if (null? (cdr x))
(display "\n")
(begin (display " ")
(output (cdr x)))))))
(output (list "just" "another" "scheme" "hacker"))
The program then tokenizes that Scheme source, parses the token stream, and converts that into Python 3.
def output(x):
if not x:
""
else:
print(x[0], end='')
if not x[1:]:
print("\n", end='')
else:
print(" ", end='')
output(x[1:])
output(["just", "another", "python", "hacker"])
Finally it executes the resulting Python string using exec
. Obfuscation is left as an exercise for the reader.
import re
def tokenize(input):
'''Tokenizes an input stream into a list of recognizable tokens'''
token_res = (
r'\(', # open paren -> starts expression
r'\)', # close paren -> ends expression
r'"[^"]*"', # quoted string (don't support \" yet)
r'[\w?]+' # atom
)
return re.findall(r'(' + '|'.join(token_res) + ')', input)
def parse(stream):
'''Parses a token stream into a syntax tree'''
if not stream:
return []
else:
# Build a list of arguments (possibly expressions) at this level
args = []
while True:
# Get the next token
try:
x = stream.pop(0)
except IndexError:
return args
# ( and ) control the level of the tree we're at
if x == '(':
args.append(parse(stream))
elif x == ')':
return args
else:
args.append(x)
def compile(tree):
'''Compiles an Scheme Abstract Syntax Tree into near-Python'''
def compile_expr(indent, expr):
indent += 1
lines = [] # these will have [(indent, statement), ...] structure
while expr:
# Two options: expr is a string like "'" or it is a list
if isinstance(expr, str):
return [(
indent,
expr.replace('scheme', 'python').replace('\n', '\\n')
)]
else:
start = expr.pop(0)
if start == 'define':
signature = expr.pop(0)
lines.append((indent,
'def %s(%s):' % (
signature[0],
', '.join(signature[1:])
)
))
while expr:
lines.extend(compile_expr(indent, expr.pop(0)))
elif start == 'if':
# We don't support multi-clause conditionals yet
clause = compile_expr(indent, expr.pop(0))[0][1]
lines.append((indent, 'if %s:' % clause))
if_true_lines = compile_expr(indent, expr.pop(0))
if_false_lines = compile_expr(indent, expr.pop(0))
lines.extend(if_true_lines)
lines.append((indent, 'else:'))
lines.extend(if_false_lines)
elif start == 'null?':
# Only supports conditionals of the form (null? foo)
if isinstance(expr[0], str):
condition = expr.pop(0)
else:
condition = compile_expr(indent, expr.pop(0))[0][1]
return [(indent, 'not %s' % condition)]
elif start == 'begin':
# This is just a series of statements, so don't indent
while expr:
lines.extend(compile_expr(indent-1, expr.pop(0)))
elif start == 'display':
arguments = []
while expr:
arguments.append(
compile_expr(indent, expr.pop(0))[0][1]
)
lines.append((
indent,
"print(%s, end='')" % (', '.join(arguments))
))
elif start == 'car':
lines.append((indent, '%s[0]' % expr.pop(0)))
elif start == 'cdr':
lines.append((indent, '%s[1:]' % expr.pop(0)))
elif start == 'list':
arguments = []
while expr:
arguments.append(
compile_expr(indent, expr.pop(0))[0][1]
)
lines.append((indent, '[%s]' % ', '.join(arguments)))
else:
# Assume this is a function call
arguments = []
while expr:
arguments.append(
compile_expr(indent, expr.pop(0))[0][1]
)
lines.append((
indent,
"%s(%s)" % (start, ', '.join(arguments))
))
return lines
return [compile_expr(-1, expr) for expr in tree]
if __name__ == '__main__':
scheme = '''
(define (output x)
(if (null? x)
""
(begin (display (car x))
(if (null? (cdr x))
(display "\n")
(begin (display " ")
(output (cdr x)))))))
(output (list "just" "another" "scheme" "hacker"))
'''
python = ''
for expr in compile(parse(tokenize(scheme))):
python += '\n'.join([(' ' * 4 * x[0]) + x[1] for x in expr]) + '\n\n'
exec(python)
This JAPH uses a Turing machine. The machine accepts any string that ends in '\n'
and allows side effects. This lets us print the value of the tape as it encounters each character. While the idea of using lambda functions as side effects in a Turing machine is a little bizarre on many levels, we work with what we have. And Python is multi-paradigmatic, so what the heck.
import re
def turing(tape, transitions):
# The tape input comes in as a string. We approximate an infinite
# length tape via a hash, so we need to convert this to {index: value}
tape_hash = {i: x for i, x in enumerate(tape)}
# Start at 0 using our transition matrix
index = 0
state = 0
while True:
value = tape_hash.get(index, '')
# This is a modified Turing machine: it uses regexen
# and has side effects. Oh well, I needed IO.
for rule in transitions[state]:
regex, next, direction, new_value, side_effect = rule
if re.match(regex, value):
# Terminal states
if new_value in ('YES', 'NO'):
return new_value
tape_hash[index] = new_value
side_effect(value)
index += direction
state = next
break
assert 'YES' == turing('just another python hacker\n', [
# This Turing machine recognizes the language of strings that end in \n.
# Regex rule, next state, left/right = -1/+1, new value, side effect.
[ # State 0:
[r'^[a-z ]$', 0, +1, '', lambda x: print(x, end='')],
[r'^\n$', 1, +1, '', lambda x: print(x, end='')],
[r'^.*$', 0, +1, 'NO', None],
],
[ # State 1:
[r'^$', 1, -1, 'YES', None]
]
])
Obfuscation again consists of converting the above code into lambda functions using Y combinators. This is a nice programming exercise, so I’ve left it out of this post in case anyone wants to try. The Turing machine has to return 'YES'
to indicate that it accepts the string, thus the assertion. Our final obfuscated JAPH is a single expression.
assert'''YES'''==(lambda g:(lambda f:g(lambda arg:f(f)(arg)))(lambda f:g(
lambda arg: f(f)(arg))))(lambda f: lambda q:[(lambda g:(lambda f:g(lambda
arg:f(f)(arg)))(lambda f: g(lambda arg:f(f)(arg))))(lambda f: lambda x:(x
[0][0]if x[0] and __import__('re').match(x[0][0][0],x[1])else f([x[0][1:]
,x[1]]))) ([q[3][q[1]],q[2].get(q[0],'')])[4](q[2].get(q[0],'')), (lambda
g:(lambda f:g(lambda arg:f(f)(arg))) (lambda f:g(lambda arg:f(f)(arg))))(
lambda f:lambda x:(x[0][0]if x[0] and __import__('re').match(x[0][0][0],x
[1])else f([x[0][1:],x[1]])))([q[3][q[1]],q[2].get(q[0],'')])[3]if(lambda
g:(lambda f:g(lambda arg:f(f)(arg))) (lambda f:g(lambda arg:f(f)(arg))))(
lambda f:lambda x:(x[0][0]if x[0]and __import__('re').match(x[0][0][0],x[
1]) else f([x[0][1:],x[1]])))([q[3][q[1]],q[2].get(q[0],'')])[3]in('YES',
'NO')else f([q[0]+(lambda g:(lambda f:g(lambda arg:f(f)(arg)))(lambda f:g
(lambda arg:f(f)(arg))))(lambda f:lambda x:(x[0][0]if x[0]and __import__(
're').match(x[0][0][0],x[1])else f([x[0][1:], x[1]])))([q[3][q[1]], q[2].
get(q[0],'')])[2],(lambda g:(lambda f:g(lambda arg: f(f)(arg)))(lambda f:
g(lambda arg:f(f)(arg))))(lambda f:lambda x:(x[0][0]if x[0]and __import__
('re').match(x[0][0][0],x[1])else f([x[0][1:], x[1]])))([q[3][q[1]],q[2].
get(q[0],'')])[1],q[2],q[3]])][1])([0,0,{i:x for i,x in enumerate('just '
'another python hacker\n')}, [[[r'^[a-z ]$',0,+1,'',lambda x:print(x,end=
'')], [r'^\n$',1,+1,'',lambda x:print(x, end='')],[r'^.*$',0,+1,'''NO''',
lambda x:None]], [[r'''^$''',+1,-1,'''YES''', lambda x: None or None]]]])
At this point, tricking python
into printing strings via indirect means got a little boring. So I switched to obfuscating fundamental computer science algorithms. Here’s a JAPH that takes in a Huffman coded version of 'just another python hacker'
, decodes, and prints it.
# Build coding tree
def build_tree(scheme):
if scheme.startswith('*'):
left, scheme = build_tree(scheme[1:])
right, scheme = build_tree(scheme)
return (left, right), scheme
else:
return scheme[0], scheme[1:]
def decode(tree, encoded):
ret = ''
node = tree
for direction in encoded:
if direction == '0':
node = node[0]
else:
node = node[1]
if isinstance(node, str):
ret += node
node = tree
return ret
tree = build_tree('*****ju*sp*er***yct* h**ka*no')[0]
print(
decode(tree, bin(10627344201836243859174935587).lstrip('0b').zfill(103))
)
The decoding tree is like a LISP-style sequence of pairs. '*'
represents a branch in the tree while other characters are leaf nodes. This looks like the following.
(
(
(
(
('j', 'u'),
('s', 'p')
),
('e', 'r')
),
(
(
('y', 'c'),
't'
),
(' ', 'h')
)
),
(
('k', 'a'),
('n', 'o')
)
)
The actual Huffman coded version of our favorite string gets about 50% smaller represented in base-2.
0000000001000100101011010111011101010111001000110110000110100001010111111110011001111010100110000100011
There’s a catch here, which is that this is hard to obfuscate unless we turn it into a single expression. This means that we have to convert build_tree
and decode
into lambda functions. Unfortunately, they are recursive and lambda functions recurse naturally. Fortunately, we can use Y combinators to get around the problem. These are worth some study since they will pop up again in future JAPHs.
Y = lambda g: (
lambda f: g(lambda arg: f(f)(arg))) (lambda f: g(lambda arg: f(f)(arg))
)
build_tree = Y(
lambda f: lambda scheme: (
(f(scheme[1:])[0], f(f(scheme[1:])[1])[0]),
f(f(scheme[1:])[1])[1 ]
) if scheme.startswith('*') else (scheme[0], scheme[1:])
)
decode = Y(lambda f: lambda x: x[3]+x[1] if not x[2] else (
f([x[0], x[0], x[2], x[3]+x[1]]) if isinstance(x[1], str) else (
f([x[0], x[1][0], x[2][1:], x[3]]) if x[2][0] == '0' else (
f([x[0], x[1][1], x[2][1:], x[3]])
)
)
))
tree = build_tree('*****ju*sp*er***yct* h**ka*no')[0]
print(
decode([
tree,
tree,
bin(10627344201836243859174935587).lstrip('0b').zfill(103), ''
])
)
The final version is a condensed (and expanded, oddly) version of the above.
print((lambda t,e,s:(lambda g:(lambda f:g(lambda arg:f(f)(arg)))(lambda f:
g(lambda arg: f(f)(arg))))(lambda f:lambda x: x[3]+x[1]if not x[2]else f([
x[0],x[0],x[2],x[3]+x[1]])if isinstance(x[1],str)else f([x[0],x[1][0],x[2]
[1:],x[3]])if x[2][0]=='0'else f([x[0],x[1][1],x[2][1:],x[3]]))([t,t,e,s])
)((lambda g:(lambda f:g(lambda arg:f(f)(arg)))(lambda f:g(lambda arg:f(f)(
arg))))(lambda f:lambda p:((f(p[1:])[0],f(f(p[1:])[1])[0]),f(f(p[1:])[1])[
1])if p.startswith('*')else(p[0],p[1:]))('*****ju*sp*er***yct* h**ka*no')[
0],bin(10627344201836243859179756385-4820798).lstrip('0b').zfill(103),''))
Here’s a JAPH composed solely for effect. For each letter in 'just another python hacker'
it loops over each the characters ' abcdefghijklmnopqrstuvwxyz'
, printing each. Between characters it pauses for 0.05 seconds, backing up and moving on to the next if it hasn’t reached the desired one yet. This achieves a sort of rolling effect by which the final string appears on our screen over time.
import string
import sys
import time
letters = ' ' + string.ascii_lowercase
for l in 'just another python hacker':
for x in letters:
print(x, end='')
sys.stdout.flush()
time.sleep(0.05)
if x == l:
break
else:
print('\b', end='')
print()
We locate and print each letter in the string with a list comprehension. At the end we have an extra line of code (the eval statement) that gives us our newline.
[[(lambda x,l:str(print(x,end=''))+str(__import__(print.
__doc__[print.__doc__.index('stdout') - 4:print.__doc__.
index('stdout')-1]).stdout.flush()) + str(__import__(''.
join(reversed('emit'))).sleep(0o5*1.01/0x64))+str(print(
'\b',end='\x09'.strip())if x!=l else'*&#'))(x1,l1)for x1
in('\x20'+getattr(__import__(type('phear').__name__+'in'
'g'),dir(__import__(type('snarf').__name__+'ing'))[15]))
[:('\x20'+getattr(__import__(type('smear').__name__+'in'
'g'),dir(__import__(type('slurp').__name__+'ing'))[15]))
.index(l1)+1]]for l1 in'''just another python hacker''']
eval('''\x20\x09eval("\x20\x09eval('\x20 print()')")''')
No series of JAPHs would be complete without ROT13. This is the example through which aspiring Perl programmers learn to use tr
and its synonym y
. In Perl the basic ROT13 JAPH starts as:
$foo = 'whfg nabgure crey unpxre';
$foo =~ y/a-z/n-za-m/;
print $foo;
Python has nothing quite so elegant in its default namespace. However, this does give us the opportunity to explore a little used aspect of strings: the translate method. If we construct a dictionary of ordinals we can accomplish the same thing with a touch more effort.
import string
table = {
ord(x): ord(y) for x, y in zip(
string.ascii_lowercase,
string.ascii_lowercase[13:] + string.ascii_lowercase
)
}
print('whfg nabgure clguba unpxre'.translate(table))
We obfuscate the construction of this translation dictionary and, for added measure, use getattr
to find the print
function off of __builtins__
. This will likely only work in Python 3.2, since the order of attributes on __builtins__
matters.
getattr(vars()[list(filter(lambda _:'\x5f\x62'in _,dir
()))[0]], dir(vars()[list(filter(lambda _:'\x5f\x62'in
_, dir()))[0]])[list(filter(lambda _:_ [1].startswith(
'\x70\x72'),enumerate(dir(vars()[list(filter(lambda _:
'\x5f\x62'in _,dir()))[0]]))))[0][0]])(getattr('whfg '
+'''nabgure clguba unpxre''', dir('0o52')[0o116])({ _:
(_-0o124) %0o32 +0o141 for _ in range(0o141, 0o173)}))
'just another python hacker'
and converts it prior to printing. It sorts the anagram by the indices of another string, in order of their associated characters. This is sort of like a pre-digested Schwartzian transform.
x = 'upjohn tehran hectors katy'
y = '1D0HG6JFO9P5ICKAM87B24NL3E'
print(''.join(x[i] for i in sorted(range(len(x)), key=lambda p: y[p])))
Obfuscation consists mostly of using silly machinations to construct the string we use to sort the anagram.
print(''.join('''upjohn tehran hectors katy'''[_]for _ in sorted(range
(26),key=lambda p:(hex(29)[2:].upper()+str(3*3*3*3-3**4)+'HG'+str(sum(
range(4)))+'JFO'+str((1+2)**(1+1))+'P'+str(35/7)[:1]+'i.c.k.'.replace(
'.','').upper()+'AM'+str(3**2*sum(range(5))-3)+hex(0o5444)[2:].replace
(*'\x62|\x42'.split('|'))+'NL'+hex(0o076).split('x')[1].upper())[p])))
Many years ago, I was a Perl programmer. Then one day I became disillusioned at the progress of Perl 6 and decided to import this.
This seems to be a fairly common story for Perl to Python converts. While I haven’t looked back much, there are a number of things I really miss about perl
(lower case intentional). I miss having value types in a dynamic language, magical and ill-advised use of cryptocontext, and sometimes even pseudohashes because they were inexcusably weird. A language that supports so many ideas out of the box enables an extended learning curve that lasts for many years. “Perl itself is the game.”
Most of all I think I miss writing Perl poetry and JAPHs. Sadly, I didn’t keep any of those I wrote, and I’m not competent enough with the language anymore to write interesting ones. At the time I was intentionally distancing myself from a model that was largely implicit and based on archaic systems internals and moving to one that was (supposedly) explicit and simple.
After switching to Python as my primary language, I used the following email signature in a nod to this change in orientation (intended for Python 2):
print 'just another python hacker'
Recently I’ve been experimenting with writing JAPHs in Python. I think of these as “reformed JAPHs.” They accomplish the same purpose as programming exercises but in a more restricted context. In some ways they are more challenging. Creativity can be difficult in a narrowly defined landscape.
I have written a small series of reformed JAPHs which increase monotonically in complexity. Here is the first one, written in plain understandable Python 3.
import string
letters = string.ascii_lowercase + ' '
indices = [
9, 20, 18, 19, 26, 0, 13, 14, 19, 7, 4, 17, 26,
15, 24, 19, 7, 14, 13, 26, 7, 0, 2, 10, 4, 17
]
print(''.join(letters[i] for i in indices))
This is fairly simple. Instead of explicitly embedding the string 'just another python hacker'
in the program, we assemble it using the index of its letters in the string 'abcdefghijklmnopqrstuvwxyz '
. We then obfuscate through a series of minor measures:
import sys
and make a call to sys.stdout.write
.string.lowercase + ' '
by joining together the character versions of its respective ordinal values (97 to 123 and 32).'l'
and split that into a list.'''
liberally and rely on the fact that python
concatenates adjacent strings.Here’s the obfuscated version:
eval("__import__('''\x73''''''\x79''''''\x73''').sTdOuT".lower()
).write(''.join(map(lambda _:(list(map(chr,range(97,123)))+[chr(
32)])[int(_)],('''9l20l18l19''''''l26l0l13l14l19l7l4l17l26l15'''
'''l24l19l7l14l1''''''3l26l7l0l2l10l4l17''').split('l')))+'\n',)
We could certainly do more, but that’s where I left this one. Stay tuned for the next JAPH.
]]>So say you’re an economist and you actually do need to produce a realistic estimate of when China’s GDP surpasses that of the USA. Can you use such an approach? Not really. There are several simplifying assumptions the Post made that are perfectly reasonable. However, if the goal is an analytical output from a highly random system such as GDP growth, one should not assume the inputs are fixed. (I’m not saying I have any gripe with their interactive. This post has a different purpose.)
Why is this? The short answer is that randomness in any system can change its output drastically from one run to the next. Even if the mean from a deterministic analysis is correct, it tells us nothing about the variance of our output. We really need a confidence interval of years when China is likely to overtake the USA.
We’ll move in the great tradition of all simulation studies. First we pepare our input. A CSV of GDP in current US dollars for both countries from 1960 to 2009 is available from the World Bank data files. We read this into a data frame and calculate their growth rates year over year. Note that the first value for growth has to be NA.
gdp <- read.csv('gdp.csv')
gdp$USA.growth <- rep(NA, length(gdp$USA))
gdp$China.growth <- rep(NA, length(gdp$China))
for (i in 2:length(gdp$USA)) {
gdp$USA.growth[i] <- 100 * (gdp$USA[i] - gdp$USA[i-1]) / gdp$USA[i-1]
gdp$China.growth[i] <- 100 * (gdp$China[i] - gdp$China[i-1]) / gdp$China[i-1]
}
We now analyze our inputs and assign probability distributions to the annual growth rates. In a full study this would involve comparing a number of different distributions and choosing the one that fits the input data best, but that’s well beyond the scope of this post. Instead, we’ll use the poor man’s way out: plot histograms and visually verify what we hope to be true, that the distributions are normal.
And they pretty much are. That’s good enough for our purposes. Now all we need are the distribution parameters, which are mean and standard deviation for normal distributions.
> mean(gdp$USA.growth[!is.na(gdp$USA.growth)])
[1] 7.00594
> sd(gdp$USA.growth[!is.na(gdp$USA.growth)])
[1] 2.889808
> mean(gdp$China.growth[!is.na(gdp$China.growth)])
[1] 9.90896
> sd(gdp$China.growth[!is.na(gdp$China.growth)])
[1] 10.5712</code></pre>
Now our input analysis is done. These are the inputs:
$$ \begin{align*} \text{USA Growth} &\sim \mathcal{N}(7.00594, 2.889808^2)\\ \text{China Growth} &\sim \mathcal{N}(9.90896, 10.5712^2) \end{align*} $$
This should make the advantage of such an approach much more obvious. Compare the standard deviations for the two countries. China is a lot more likely to have negative GDP growth in any given year. They’re also more likely to have astronomical growth.
We now build and run our simulation study. The more times we run the simulation the tighter we can make our confidence interval (to a point), so we’ll pick a pretty big number somewhat arbitrarily. If we want to, we can be fairly scientific about determining how many iterations are necessary after we’ve done some runs, but we have to start somewhere.
repetitions <- 10000
This is the code for our simulation. For each iteration, it starts both countries at their 2009 GDPs. It then iterates, changing GDP randomly until China’s GDP is at least the same value as the USA’s. When that happens, it records the current year.
results <- rep(NA, repetitions)
for (i in 1:repetitions) {
usa <- gdp$USA[length(gdp$USA)]
china <- gdp$China[length(gdp$China)]
year <- gdp$Year[length(gdp$Year)]
while (TRUE) {
year <- year + 1
usa.growth <- rnorm(1, 7.00594, 2.889808)
china.growth <- rnorm(1, 9.90896, 10.5712)
usa <- usa * (1 + (usa.growth / 100))
china <- china * (1 + (china.growth / 100))
if (china >= usa) {
results[i] <- year
break
}
}
}
From the results vector we see that, given the data and assumptions for this model, China should surpass the USA in 2058. We also see that we can be 95% confident that the mean year this will happen is between 2057 and 2059. This is not quite the same as saying we are confident this will actually happen between those years. The result of our simulation is a probability distribution and we are discovering information about it.
> mean(results)
[1] 2058.494
> mean(results) + (sd(results) / sqrt(length(results)) * qnorm(0.025))
[1] 2057.873
> mean(results) + (sd(results) / sqrt(length(results)) * qnorm(0.975))
[1] 2059.114</code></pre>
So what’s wrong with this model? Well, we had to make a number of assumptions:
Here are some good simulation exercises if you’re looking to do more:
I thought it might be useful to follow up the last post with another one showing the same examples in R.
R provides a function called lm
, which is similar in spirit to NumPy’s linalg.lstsq
. As you’ll see, lm
’s interface is a bit more tuned to the concepts of modeling.
We begin by reading in the example CSV into a data frame:
responses <- read.csv('example_data.csv')
responses
respondent vanilla.love strawberry.love chocolate.love dog.love cat.love
1 Aylssa 9 4 9 9 9
2 Ben8 8 6 4 10 4
3 Cy 9 4 8 2 6
4 Eva 3 7 9 4 6
5 Lem 6 8 5 2 5
6 Louis 4 5 3 10 3
A data frame is sort of like a matrix, but with named columns. That is, we can refer to entire columns using the dollar sign. We are now ready to run least squares. We’ll create the model for predicting “dog love.” To create the “cat love” model, simply use that column name instead:
fit1 <- lm(
responses$dog.love ~ responses$vanilla.love +
responses$strawberry.love +
responses$chocolate.love
)
The syntax for lm is a little off-putting at first. This call tells it to create a model for “dog love” with respect to (the ~) a function of the form offset + x1 * vanilla love + x2 * strawberry love + x3 * chocolate love. Note that the offset is conveniently implied when using lm
, so this is the same as the second model we created in Python. Now that we’ve computed the coefficients for our “dog love” model, we can ask R about it:
summary(fit1)
Call:
lm(formula = responses$dog.love ~ responses$vanilla.love + responses$strawberry.love +
responses$chocolate.love)
Residuals:
1 2 3 4 5 6
3.1827 2.9436 -4.5820 0.8069 -1.9856 -0.3657
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 20.9298 15.0654 1.389 0.299
responses$vanilla.love -0.2783 0.9934 -0.280 0.806
responses$strawberry.love -1.4314 1.5905 -0.900 0.463
responses$chocolate.love -0.7647 0.8214 -0.931 0.450
Residual standard error: 4.718 on 2 degrees of freedom
Multiple R-squared: 0.4206, Adjusted R-squared: -0.4485
F-statistic: 0.484 on 3 and 2 DF, p-value: 0.7272
This gives us quite a bit of information, including the coefficients for our “dog love” model and various error metrics. You can find the offset and coefficients under the Estimate column above. We quickly verify this using R’s vectorized arithmetic:
20.9298 -
0.2783 * responses$vanilla.love -
1.4314 * responses$strawberry.love -
0.7647 * responses$chocolate.love
[1] 5.8172 7.0562 6.5819 3.1928 3.9853 10.3655
You’ll notice the model is essentially the same as the one we got from NumPy. Our next step is to add in the squared inputs. We do this by adding extra terms to the modeling formula. The I()
function allows us to easily add additional operators to columns. That’s how we accomplish the squaring. We could alternatively add squared input values to the data frame, but using I()
is more convenient and natural.
fit2 <- lm(responses$dog.love ~ responses$vanilla.love +
I(responses$vanilla.love^2) + responses$strawberry.love +
I(responses$strawberry.love^2) + responses$chocolate.love +
I(responses$chocolate.love^2))
summary(fit2)
Call:
lm(formula = responses$dog.love ~ responses$vanilla.love + I(responses$vanilla.love^2) +
responses$strawberry.love + I(responses$strawberry.love^2) +
responses$chocolate.love + I(responses$chocolate.love^2))
Residuals:
ALL 6 residuals are 0: no residual degrees of freedom!
Coefficients: (1 not defined because of singularities)
Estimate Std. Error t value Pr(>|t|)
(Intercept) -357.444 NaN NaN NaN
responses$vanilla.love 72.444 NaN NaN NaN
I(responses$vanilla.love^2) -6.111 NaN NaN NaN
responses$strawberry.love 59.500 NaN NaN NaN
I(responses$strawberry.love^2) -5.722 NaN NaN NaN
responses$chocolate.love 7.000 NaN NaN NaN
I(responses$chocolate.love^2) NA NA NA NA
Residual standard error: NaN on 0 degrees of freedom
Multiple R-squared: 1, Adjusted R-squared: NaN
F-statistic: NaN on 5 and 0 DF, p-value: NA
We can see that we get the same “dog love” model as produced by the third Python version of the last post. Again, we quickly verify that the output is the same (minus some rounding errors):
-357.444 +
72.444 * responses$vanilla.love -
6.111 * responses$vanilla.love^2 +
59.5 * responses$strawberry.love -
5.722 * responses$strawberry.love^2 +
7 * responses$chocolate.love
[1] 9.009 10.012 2.009 4.011 2.016 10.006
For purposes of a simple working example, we have collected six records of input data over three dimensions with the goal of predicting two outputs. The input data are:
$$ \begin{align*} x_1 &= \text{How much a respondent likes vanilla [0-10]}\\ x_2 &= \text{How much a respondent likes strawberry [0-10]}\\ x_3 &= \text{How much a respondent likes chocolate [0-10]} \end{align*} $$
Output data consist of:
$$ \begin{align*} b_1 &= \text{How much a respondent likes dogs [0-10]}\\ b_2 &= \text{How much a respondent likes cats [0-10]} \end{align*} $$
Below are anonymous data collected from a random sample of people.
respondent | vanilla ❤️ | strawberry ❤️ | chocolate ❤️ | dog ❤️ | cat ❤️ |
---|---|---|---|---|---|
Alyssa P Hacker | 9 | 4 | 9 | 9 | 8 |
Ben Bitdiddle | 8 | 6 | 4 | 10 | 4 |
Cy D. Fect | 9 | 4 | 8 | 2 | 6 |
Eva Lu Ator | 3 | 7 | 9 | 4 | 6 |
Lem E. Tweakit | 6 | 8 | 5 | 2 | 5 |
Louis Reasoner | 4 | 5 | 3 | 10 | 3 |
Our input is in three dimensions. Each output requires its own model, so we’ll have one for dogs and one for cats. We’re looking for functions, dog(x)
and cat(x)
, that can predict $b_1$ and $b_2$ based on given values of $x_1$, $x_2$, and $x_3$.
For both models we want to find parameters that minimize their squared residuals (read: errors). There’s a number of names for this. Optimization folks like to think of it as unconstrained quadratic optimization, but it’s more common to call it least squares or linear regression. It’s not necessary to entirely understand why for our purposes, but the function that minimizes these errors is:
$$\beta = ({A^t}A)^{-1}{A^t}b$$
This is implemented for you in the numpy.linalg
Python package, which we’ll use for examples. Much more information than you probably want can be found here.
Below is a first stab at a Python version. It runs least squares against our input and output data exactly as they are. You can see the matrix $A$ and outputs $b_1$ and $b_2$ (dog and cat love, respectively) are represented just as they are in the table.
# Version 1: No offset, no squared inputs
import numpy
A = numpy.vstack([
[9, 4, 9],
[8, 6, 4],
[9, 4, 8],
[3, 7, 9],
[6, 8, 5],
[4, 5, 3]
])
b1 = numpy.array([9, 10, 2, 4, 2, 10])
b2 = numpy.array([9, 4, 6, 6, 5, 3])
print('dog ❤️:', numpy.linalg.lstsq(A, b1, rcond=None)[0])
print('cat ❤️:', numpy.linalg.lstsq(A, b2, rcond=None)[0])
# Output:
# dog ❤️: [0.72548294 0.53045642 -0.29952361]
# cat ❤️: [2.36110929e-01 2.61934385e-05 6.26892476e-01]
The resulting model is:
dog(x) = 0.72548294 * x1 + 0.53045642 * x2 - 0.29952361 * x3
cat(x) = 2.36110929e-01 * x1 + 2.61934385e-05 * x2 + 6.26892476e-01 * x3
The coefficients before our variables correspond to beta in the formula above. Errors between observed and predicted data, shown below, are calculated and summed. For these six records, dog(x)
has a total error of 20.76 and cat(x)
has 3.74. Not great.
respondent | predicted b1 | b1 error | predicted b2 | b2 error |
---|---|---|---|---|
Alyssa P Hacker | 5.96 | 3.04 | 7.77 | 1.23 |
Ben Bitdiddle | 7.79 | 2.21 | 4.40 | 0.40 |
Cy D. Fect | 6.25 | 4.25 | 7.14 | 1.14 |
Eva Lu Ator | 3.19 | 0.81 | 6.35 | 0.35 |
Lem E. Tweakit | 7.10 | 5.10 | 4.55 | 0.45 |
Louis Reasoner | 4.66 | 5.34 | 2.83 | 0.17 |
Total error: | 20.76 | 3.74 |
One problem with this model is that dog(x)
and cat(x)
are forced to pass through the origin. (Why is that?) We can improve it somewhat if we add an offset. This amounts to prepending 1 to every row in $A$ and adding a constant to the resulting functions. You can see the very slight difference between the code for this model and that of the previous:
# Version 2: Offset, no squared inputs
import numpy
A = numpy.vstack([
[1, 9, 4, 9],
[1, 8, 6, 4],
[1, 9, 4, 8],
[1, 3, 7, 9],
[1, 6, 8, 5],
[1, 4, 5, 3]
])
print('dog ❤️:', numpy.linalg.lstsq(A, b1, rcond=None)[0])
print('cat ❤️:', numpy.linalg.lstsq(A, b2, rcond=None)[0])
# Output:
# dog ❤️: [20.92975427 -0.27831197 -1.43135684 -0.76469017]
# cat ❤️: [-0.31744124 0.25133547 0.02978098 0.63394765]
This yields the seconds version of our models:
dog(x) = 20.92975427 - 0.27831197 * x1 - 1.43135684 * x2 - 0.76469017 * x3
cat(x) = -0.31744124 + 0.25133547 * x1 + 0.02978098 * x2 + 0.63394765 * x3
These models provide errors of 13.87 and 3.79. A little better on the dog side, but still not quite usable.
respondent | predicted b1 | b1 error | predicted b2 | b2 error |
---|---|---|---|---|
Alyssa P Hacker | 5.82 | 3.18 | 7.77 | 1.23 |
Ben Bitdiddle | 7.06 | 2.94 | 4.41 | 0.41 |
Cy D. Fect | 6.58 | 4.58 | 7.14 | 1.14 |
Eva Lu Ator | 3.19 | 0.81 | 6.35 | 0.35 |
Lem E. Tweakit | 3.99 | 1.99 | 4.60 | 0.40 |
Louis Reasoner | 10.37 | 0.37 | 2.74 | 0.26 |
Total error: | 13.87 | 3.79 |
The problem is that dog(x)
and cat(x)
are linear functions. Most observed data don’t conform to straight lines. Take a moment and draw the line $f(x) = x$ and the curve $f(x) = x^2$. The former makes a poor approximation of the latter.
Most of the time, people just use squares of the input data to add curvature to their models. We do this in our next version of the code by just adding squares of the input row values to our $A$ matrix. Everything else is the same. (In reality, you can add any function of the input data you feel best models the data, if you understand it well enough.)
# Version 3: Offset with squared inputs
import numpy
A = numpy.vstack([
[1, 9, 9**2, 4, 4**2, 9, 9**2],
[1, 8, 8**2, 6, 6**2, 4, 4**2],
[1, 9, 9**2, 4, 4**2, 8, 8**2],
[1, 3, 3**2, 7, 7**2, 9, 9**2],
[1, 6, 6**2, 8, 8**2, 5, 5**2],
[1, 4, 4**2, 5, 5**2, 3, 3**2]
])
b1 = numpy.array([9, 10, 2, 4, 2, 10])
b2 = numpy.array([9, 4, 6, 6, 5, 3])
print('dog ❤️:', numpy.linalg.lstsq(A, b1, rcond=None)[0])
print('cat ❤️:', numpy.linalg.lstsq(A, b2, rcond=None)[0])
# dog ❤️: [1.29368307 7.03633306 -0.44795498 9.98093332
# -0.75689575 -19.00757486 1.52985734]
# cat ❤️: [0.47945896 5.30866067 -0.39644128 -1.28704188
# 0.12634295 -4.32392606 0.43081918]
This gives us our final version of the model:
dog(x) = 1.29368307 + 7.03633306 * x1 - 0.44795498 * x1**2 + 9.98093332 * x2 - 0.75689575 * x2**2 - 19.00757486 * x3 + 1.52985734 * x3**2
cat(x) = 0.47945896 + 5.30866067 * x1 - 0.39644128 * x1**2 - 1.28704188 * x2 + 0.12634295 * x2**2 - 4.32392606 * x3 + 0.43081918 * x3**2
Adding curvature to our model eliminates all perceived error, at least within 1e-16. This may seem unbelievable, but when you consider that we only have six input records, it isn’t really.
respondent | predicted b1 | b1 error | predicted b2 | b2 error |
---|---|---|---|---|
Alyssa P Hacker | 9 | 0 | 9 | 0 |
Ben Bitdiddle | 10 | 0 | 4 | 0 |
Cy D. Fect | 2 | 0 | 6 | 0 |
Eva Lu Ator | 4 | 0 | 6 | 0 |
Lem E. Tweakit | 2 | 0 | 5 | 0 |
Louis Reasoner | 10 | 0 | 3 | 0 |
Total error: | 0 | 0 |
It should be fairly obvious how one can take this and extrapolate to much larger models. I hope this is useful and that least squares becomes an important part of your lives.
]]>Unfortunately, some of the competitors are wily and attached to the idea of winning. They go so far as programming or hiring bots to cast thousands of votes for them. Your manager wants to know which votes are real and which ones are fake Right Now. Given very limited time, and ignoring actions that you could have taken to avoid the problem, how can you tell apart sets of good votes from those that shouldn’t be counted?
One quick-and-dirty option involves comparing histograms of interarrival times for sets of votes. Say you’re concerned that all the votes during a particular period of time or from a given IP address might be fraudulent. Put all the vote times you’re concerned about into a list, sort them, and compute their differences:
# times is a list of datetime instances from vote records
times.sort(reversed=True)
interarrivals = [y-x for x, y in zip(times, times[1:]]
Now use matplotlib to display a histogram of these. Votes that occur naturally are likely to resemble an exponential distribution in their interarrival times. For instance, here are interarrival times for all votes received in a contest:
This subset of votes is clearly fraudulent, due to the near determinism of their interarrival times. This is most likely caused by the voting bot not taking random sleep intervals during voting. It casts a vote, receives a response, clears its cookies, and repeats:
These votes, on the other hand, are most likely legitimate. They exhibit a nice Erlang shape and appear to have natural interarrival times that one would expect:
Of course this method is woefully inadequate for rigorous detection of voting fraud. Ideally one would find a method to compute the probability that a set of votes is generated by a bot. This is enough to inform quick, ad hoc decisions though.
]]>Data fitting is one of those tasks that everyone should have at least some exposure to. Certainly developers and analysts will benefit from a working knowledge of its fundamentals and their implementations. However, in my own reading I’ve found it difficult to locate good examples that are simple enough to pick up quickly and come with accompanying source code.
This article commences an ongoing series introducing basic data fitting techniques. With any luck they won’t be overly complex, while still being useful enough to get the point across with a real example and real data. We’ll start with a binary classification problem: presented with a series of records, each containing a set number of input values describing it, determine whether or not each record exhibits some property.
We’ll use the cancer1.dt
data from the proben1
set of test cases, which you can download here. Each record starts with 9 data points containing physical characteristics of a tumor. The second to last data point contains 1 if a tumor is benign and 0 if it is malignant. We seek to find a linear function we can run on an arbitrary record that will return a value greater than zero if that record’s tumor is predicted to be benign and less than zero if it is predicted to be malignant. We will train our linear model on the first 350 records, and test it for accuracy on the remaining rows.
This is similar to the data fitting problem found in Chvatal. Our inputs consist of a matrix of observed data, $A$, and a vector of classifications, $b$. In order to classify a record, we require another vector $x$ such that the dot product of $x$ and that record will be either greater or less than zero depending on its predicted classification.
A couple points to note before we start:
Most observed data are noisy. This means it may be impossible to locate a hyperplane that cleanly separates given records of one type from another. In this case, we must resort to finding a function that minimizes our predictive error. For the purposes of this example, we’ll minimize the sum of the absolute values of the observed and predicted values. That is, we seek $x$ such that we find $min \sum_i{|a_i^T x-b_i|}$.
The slope-intercept form of a line, $f(x)=m^T x+b$, contains an offset. It should be obvious that this is necessary in our model so that our function isn’t required to pass through the origin. Thus, we’ll be adding an extra variable with the coefficient of 1 to represent our offset value.
In order to model this, we use two linear constraints for each absolute value. We minimize the sum of these. Our Linear Programming model thus looks like:
$$ \begin{align*} \min\quad & z = x_0 + \sum_i{v_i}\\ \text{s.t.}\quad& v_i \geq x_0 + a_i^\intercal x - 1 &\quad\forall&\quad\text{benign tumors}\\ & v_i \geq 1 - x_0 - a_i^\intercal x &\quad\forall&\quad\text{benign tumors}\\ & v_i \geq x_0 + a_i^\intercal x - (-1) &\quad\forall&\quad\text{malignant tumors}\\ & v_i \geq -1 - x_0 - a_i^\intercal x &\quad\forall&\quad\text{malignant tumors} \end{align*} $$
In order to do this in Python, we use SCIP and SoPlex. We start by setting constants for benign and malignant outputs and providing a function to read in the training and testing data sets.
# Preferred output values for tumor categories
BENIGN = 1
MALIGNANT = -1
def read_proben1_cancer_data(filename, train_size):
'''Loads a proben1 cancer file into train & test sets'''
# Number of input data points per record
DATA_POINTS = 9
train_data = []
test_data = []
with open(filename) as infile:
# Read in the first train_size lines to a training data list, and the
# others to testing data. This allows us to test how general our model
# is on something other than the input data.
for line in infile.readlines()[7:]: # skip header
line = line.split()
# Records = offset (x0) + remaining data points
input = [float(x) for x in line[:DATA_POINTS]]
output = BENIGN if line[-2] == '1' else MALIGNANT
record = {'input': input, 'output': output}
# Determine what data set to put this in
if len(train_data) >= train_size:
test_data.append(record)
else:
train_data.append(record)
return train_data, test_data
The next function implements the LP model described above using SoPlex and SCIP. It minimizes the sum of residuals for each training record. This amounts to summing the absolute value of the difference between predicted and observed output data. The following function takes in input and observed output data and returns a list of coefficients. Our resulting model consists of taking the dot product of an input record and these coefficients. If the result is greater than or equal to zero, that record is predicted to be a benign tumor, otherwise it is predicted to be malignant.
from pyscipopt import Model
def train_linear_model(train_data):
'''
Accepts a set of input training data with known output
values. Returns a list of coefficients to apply to
arbitrary records for purposes of binary categorization.
'''
# Make sure we have at least one training record.
assert len(train_data) > 0
num_variables = len(train_data[0]['input'])
# Variables are coefficients in front of the data points. It is important
# that these be unrestricted in sign so they can take negative values.
m = Model()
x = [m.addVar(f'x{i}', lb=None) for i in range(num_variables)]
# Residual for each data row
residuals = [m.addVar(lb=None, ub=None) for _ in train_data]
for r, d in zip(residuals, train_data):
# r will be the absolute value of the difference between observed and
# predicted values. We can model absolute values such as r >= |foo| as:
#
# r >= foo
# r >= -foo
m.addCons(sum(x * y for x, y in zip(x, d['input'])) + r >= d['output'])
m.addCons(sum(x * y for x, y in zip(x, d['input'])) - r <= d['output'])
# Find and return coefficients that min sum of residuals.
m.setObjective(sum(residuals))
m.setMinimize()
m.optimize()
solution = m.getBestSol()
return [solution[xi] for xi in x]
We also provide a convenience function for counting the number of correct predictions by our resulting model against either the test or training data sets.
def count_correct(data_set, coefficients):
'''Returns the number of correct predictions.'''
correct = 0
for d in data_set:
result = sum(x*y for x, y in zip(coefficients, d['input']))
# Do we predict the same as the output?
if (result >= 0) == (d['output'] >= 0):
correct += 1
return correct
Finally we write a main method to read in the data, build our linear model, and test its efficacy.
from pprint import pprint
if __name__ == '__main__':
# Specs for this input file
INPUT_FILE_NAME = 'cancer1.dt'
TRAIN_SIZE = 350
train_data, test_data = read_proben1_cancer_data(
INPUT_FILE_NAME,
TRAIN_SIZE
)
# Add the offset variable to each of our data records
for data_set in [train_data, test_data]:
for row in data_set:
row['input'] = [1] + row['input']
coefficients = train_linear_model(train_data)
print('coefficients:')
pprint(coefficients)
# Print % of correct predictions for each data set
correct = count_correct(train_data, coefficients)
print(
'%s / %s = %.02f%% correct on training set' % (
correct, len(train_data),
100 * float(correct) / len(train_data)
)
)
correct = count_correct(test_data, coefficients)
print(
'%s / %s = %.02f%% correct on testing set' % (
correct, len(test_data),
100 * float(correct) / len(test_data)
)
)
The result of running this model against the cancer1.dt
data set is:
coefficients:
[1.4072882449702786,
-0.14014055927954652,
-0.6239513714263405,
-0.26727681774258882,
0.067107753841131157,
-0.28300216102808429,
-1.0355594670918404,
-0.22774451038152174,
-0.69871243677663608,
-0.072575089848659444]
328 / 350 = 93.71% correct on training set
336 / 349 = 96.28% correct on testing set
The accuracy is pretty good here against the both the training and testing sets, so this particular model generalizes well. This is about the simplest model we can implement for data fitting, and we’ll get to more complicated ones later, but it’s nice to see we can do so well so quickly. The coefficients correspond to using a function of this form, rounding off to three decimal places:
$$ \begin{align*} f(x) =\ &1.407 - 0.140 x_1 - 0.624 x_2 - 0.267 x_3 + 0.067 x_4 - \\ &0.283 x_5 - 1.037 x_6 - 0.228 x_7 - 0.699 x_8 - 0.073 x_9 \end{align*} $$
cancer1.dt
data file from proben1
One of the most useful tools one learns in an Operations Research curriculum is Monte Carlo Simulation. Its utility lies in its simplicity: one can learn vital information about nearly any process, be it deterministic or stochastic, without wading through the grunt work of finding an analytical solution. It can be used for off-the-cuff estimates or as a proper scientific tool. All one needs to know is how to simulate a given process and its appropriate probability distributions and parameters if that process is stochastic.
Here’s how it works:
In the case of time spent waiting for a bus, the sample mean and variance are estimators of mean and variance for one’s wait time. In the boolean case, these represent probability that the given event will occur.
One can think of Monte Carlo Simulation as throwing darts. Say you want to find the area under a curve without integrating. All you must do is draw the curve on a wall and throw darts at it randomly. After you’ve thrown enough darts, the area under the curve can be approximated using the percentage of darts that end up under the curve times the total area.
This technique is often performed using a spreadsheet, but that can be a bit clunky and may make more complex simulations difficult. I’d like to spend a minute showing how it can be done in Python. Consider the following scenario:
Passengers for a train arrive according to a Poisson process with a mean of 100 per hour. The next train arrives exponentially with a rate of 5 per hour. How many passers will be aboard the train?
We can simulate this using the fact that a Poisson process can be represented as a string of events occurring with exponential inter-arrival times. We use the sim()
function below to generate the number of passengers for random instances of the problem. We then compute sample mean and variance for these values.
import random
PASSENGERS = 100.0
TRAINS = 5.0
ITERATIONS = 10000
def sim():
passengers = 0.0
# Determine when the train arrives
train = random.expovariate(TRAINS)
# Count the number of passenger arrivals before the train
now = 0.0
while True:
now += random.expovariate(PASSENGERS)
if now >= train:
break
passengers += 1.0
return passengers
if __name__ == '__main__':
output = [sim() for _ in range(ITERATIONS)]
total = sum(output)
mean = total / len(output)
sum_sqrs = sum(x*x for x in output)
variance = (sum_sqrs - total * mean) / (len(output) - 1)
print('E[X] = %.02f' % mean)
print('Var(X) = %.02f' % variance)
$$ \small \begin{align*} \min\quad & z = \sum_i \sum_{j\ne i} d_{ij} x_{ij}\\ \text{s.t.}\quad& \sum_{j\ne i} x_{ij} = 1 &\quad\forall&\ i & \text{leave each city once}\\ & \sum_{i\ne j} x_{ij} = 1 &\quad\forall&\ j & \text{enter each city once}\\ & x_{ij} \in \{0,1\} &\quad\forall&\ i,j \end{align*} $$
This appears like a reasonable formulation until we solve it and see that our solution contains disconnected subtours. Suppose we have four cities, labeled $A$ through $D$. Connecting $A$ to $B$, $B$ to $A$, $C$ to $D$ and $D$ to $C$ provides a feasible solution to our formulation, but does not constitute a cycle. Here is a more concrete example of two disconnected subtours $\{(1,5),(5,1)\}$ and $\{(2,3),(3,4),(4,2)\}$ over five cities:
ampl: display x;
x [*,*]
: 1 2 3 4 5 :=
1 0 0 0 0 1
2 0 0 1 0 0
3 0 0 0 1 0
4 0 1 0 0 0
5 1 0 0 0 0
;
Realizing we just solved the Assignment Problem, we now add subtour elimination constraints. These require that any proper, non-null subset of our $n$ cities is connected by at most $n-1$ active edges:
$$ \sum_{i \in S} \sum_{j \in S} x_{ij} \leq |S|-1 \quad\forall\ S \subset {1, …, n}, S \ne O $$
Indexing subtour elimination constraints over a power set of the cities completes the formulation. However, this requires an additional $\sum_{k=2}^{n-1} \begin{pmatrix} n \\ k \end{pmatrix}$ rows tacked on the end of our matrix and is clearly infeasible for large $n$. The most current computers can handle using this approach is around 19 cities. It remains an instructive tool for understanding the combinatorial explosion that occurs in problems like TSP and is worth translating into a modeling language. So how does one model it on a computer?
Unfortunately, AMPL, the gold standard in mathematical modeling languages, is unable to index over sets. Creating a power set in AMPL requires going through a few contortions. The following code demonstrates power and index sets over four cities:
set cities := 1 .. 4 ordered;
param n := card(cities);
set indices := 0 .. (2^n - 1);
set power {i in indices} := {c in cities: (i div 2^(ord(c) - 1)) mod 2 = 1};
display cities;
display n;
display indices;
display power;
This yields the following output:
set cities := 1 2 3 4;
n = 4
set indices := 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15;
set power[0] := ; # empty
set power[1] := 1;
set power[2] := 2;
set power[3] := 1 2;
set power[4] := 3;
set power[5] := 1 3;
set power[6] := 2 3;
set power[7] := 1 2 3;
set power[8] := 4;
set power[9] := 1 4;
set power[10] := 2 4;
set power[11] := 1 2 4;
set power[12] := 3 4;
set power[13] := 1 3 4;
set power[14] := 2 3 4;
set power[15] := 1 2 3 4;
Note how the index set contains an index for each row in our power set. We can now generate the subtour elimination constraints:
var x {cities cross cities} binary;
s.t. subtours {i in indices: card(power[i]) > 1 and card(power[i]) < card(cities)}:
sum {(c,k) in power[i] cross power[i]: k != c} x[c,k] <= card(power[i]) - 1;
expand subtours;
subject to subtours[3]: x[1,2] + x[2,1] <= 1;
subject to subtours[5]: x[1,3] + x[3,1] <= 1;
subject to subtours[6]: x[2,3] + x[3,2] <= 1;
subject to subtours[7]: x[1,2] + x[1,3] + x[2,1] + x[2,3] + x[3,1] + x[3,2] <= 2;
subject to subtours[9]: x[1,4] + x[4,1] <= 1;
subject to subtours[10]: x[2,4] + x[4,2] <= 1;
subject to subtours[11]: x[1,2] + x[1,4] + x[2,1] + x[2,4] + x[4,1] + x[4,2] <= 2;
subject to subtours[12]: x[3,4] + x[4,3] <= 1;
subject to subtours[13]: x[1,3] + x[1,4] + x[3,1] + x[3,4] + x[4,1] + x[4,3] <= 2;
subject to subtours[14]: x[2,3] + x[2,4] + x[3,2] + x[3,4] + x[4,2] + x[4,3] <= 2;
While this does work, the code for generating the power set looks like voodoo. Understanding it required piece-by-piece decomposition, an exercise I suggest you go through yourself if you have a copy of AMPL and 15 minutes to spare:
set foo {c in cities} := {ord(c)};
set bar {c in cities} := {2^(ord(c) - 1)};
set baz {i in indices} := {c in cities: i div 2^(ord(c) - 1)};
set qux {i in indices} := {c in cities: (i div 2^(ord(c) - 1)) mod 2 = 1};
display foo;
display bar;
display baz;
display qux;
This may be an instance where open source leads commercial software. The good folks who produce the SCIP Optimization Suite provide an AMPL-like language called ZIMPL with a few additional useful features. One of these is power sets. Compared to the code above, doesn’t this look refreshing?
set cities := {1 to 4};
set power[] := powerset(cities);
set indices := indexset(power);
]]>For $n$ periods with per-period fixed set-up cost $f_t$, unit production cost $p_t$, unit storage cost $h_t$,and demand $d_t$, we define decision variables related to production and storage quantities:
$$ \small \begin{align*} x_t &= \text{units produced in period}\ t\\ s_t &= \text{stock at the end of period}\ t\\ y_t &= 1\ \text{if production occurs in period}\ t, 0\ \text{otherwise} \end{align*} $$
One can minimize overall cost for satisfying all demand on time using the following model per Wolsey (1998), defined slightly differently here:
$$ \small \begin{align*} \min\quad & z = \sum_t{p_t x_t} + \sum_t{h_t s_t} + \sum_t{f_t y_t}\\ \text{s.t.}\quad& s_1 = d_1 + s_1\\ & s_{t-1} + x_t = d_t + s_t &\quad\forall&\ t > 1\\ & x_t \leq M y_t &\quad\forall&\ t\\ & s_t, x_t \geq 0 &\quad\forall&\ t\\ & y_t \in {0,1} &\quad\forall&\ t \end{align*} $$
According to Wolsey, page 11, given that $s_t = \sum_{i=1}^t (x_i - d_i)$ and defining new constants $K = \sum_{t=1}^n h_t(\sum_{i=1}^t d_i)$ and $c_t = p_t + \sum_{i=t}^n h_i$, the objective function can be rewritten as $z = \sum_t c_t x_t + \sum _t f_t y_t - K$. The book lacks a proof of this and it seems a bit non-obvious, so I attempt an explanation in somewhat painstaking detail here.
$$ \small \begin{align*} &\text{Proof}:\\ & & \sum_t p_t x_t + \sum_t h_t s_t + \sum_t f_t y_t &= \sum_t c_t x_t + \sum _t f_t y_t - K\\ &\text{1. Remove} \sum_t f_t y_t:\\ & & \sum_t p_t x_t + \sum_t h_t s_t &= \sum_t c_t x_t - K\\ &\text{2. Expand}\ K\ \text{and}\ c_t:\\ & & \sum_t p_t x_t + \sum_t h_t s_t &= \sum_t (p_t + \sum_{i=t}^n h_i) x_t - \sum_t h_t (\sum_{i=1}^t d_i)\\ &\text{3. Remove}\ \sum_t p_t x_t:\\ & & \sum_t h_t s_t &= \sum_t x_t (\sum_{i=t}^n h_i) - \sum_t h_t (\sum_{i=1}^t d_i)\\ &\text{4. Expand}\ s_t:\\ & & \sum_t h_t (\sum_{i=1}^t x_i) - \sum_t h_t (\sum_{i=1}^t d_i) &= \sum_t x_t (\sum_{i=t}^n h_i) - \sum_t h_t (\sum_{i=1}^t d_i)\\ &\text{5. Remove}\ \sum_t h_t (\sum_{i=1}^t d_i):\\ & & \sum_t h_t (\sum_{i=1}^t x_i) &= \sum_t x_t (\sum_{i=t}^n h_i) \end{align*} $$
The result from step 5 becomes obvious upon expanding its left and right-hand terms:
$$ h_1 x_1 + h_2 (x_1 + x_2) + \cdots + h_n (x_1 + \cdots + x_n) =\\ x_1 (h_1 + \cdots + h_n) + x2 (h_2 + \cdots + h_n) + \cdots + x_n h_n $$
In matrix notation with $h$ and $x$ as column vectors in $\bf R^n$ and $L$ and $U$ being $n \times n$ lower and upper triangular identity matrices, respectively, this can be written as:
$$ \small \begin{pmatrix} h_1 \cdots h_n \end{pmatrix} \begin{pmatrix} 1 \cdots 0 \\ \vdots \ddots \vdots \\ 1 \cdots 1 \end{pmatrix} \begin{pmatrix} x_1 \\ \vdots \\ x_n \end{pmatrix} = \begin{pmatrix} x_1 \cdots x_n \end{pmatrix} \begin{pmatrix} 1 \cdots 1 \\ \vdots \ddots \vdots \\ 0 \cdots 1 \end{pmatrix} \begin{pmatrix} h_1 \\ \vdots \\ h_n \end{pmatrix} $$
or $h^T L x = x^T U h$.
]]>