Creating a Markov chain Monte Carlo simulation in Python

Project completed in week 8 (16.11.-20.11.20) of the Data Science Bootcamp at Spiced Academy in Berlin.

This week we learned to make a Markov Chain Monte Carlo (MCMC) simulation of new customers in a supermarket, based on data about customer paths from entrance to checkout through four aisles (fruit, drinks, dairy, and spices) recorded on five days from 7 am to 10 pm. This project was particularly challenging for two reasons: it involved object-oriented programming (OOP) and team work. In this post I’ll give you an overview of our workflow.

Planning the project

I teamed up with two colleagues and right from the start we planned and divided our tasks. There are many things to track when working on a team project, but they can be classified into three main categories:

Idea	Design	Implementation
user story	state diagram	code-debug cycle
requirements.txt	ML model architecture	skeleton code
workflow tracker	class diagram	prototype
	entity-relationship diagram	baseline models
	flowchart
	data flow diagram
	pseudocode

Creating the simulation

Markov-Chain Monte Carlo (MCMC) is a method used to approximate the posterior distribution of a parameter of interest by random sampling in a probabilistic space. As the name says, it combines two properties:

Markov-Chain uses each random sample (generated by a sequential process) as a stepping stone to generate the next random sample. While each new sample depends on the one before it, new samples do not depend on any samples before the previous one.
Monte Carlo estimates the properties of a distribution by examining random samples from the distribution.

The first step in our project was to calculate the transition probability matrix, i.e. the probability that a customer moves from one aisle to another. For example, in our data there was a 28% probability that a customer goes from drinks to dairy.

Next, we had to generate new customers and simulate their paths based on the calculated transition probabilities. For this, we had to use OOP. We created three classes:

a Customer class that simulates a new customers and their path from entrance to checkout
a Supermarket class that manages multiple Customer instances that are currently in the market
a SupermarketMap class which visualizes the supermarket background. We used numpy and OpenCV to make the supermarket visualization, by basically creating and slicing arrays, and inserting icons in specific tiles.

To calculate the customers’ path from entrance to checkout, we used the A* (star) algorithm. This is a search algorithm used for finding the shortest path in environments where there are obstacles along the way (like walls or aisles in the supermarket). It works as a weighted graph, which in this case is our supermarket with 8 nodes (entrance, fruit, spices, dairy, drinks, three checkouts) and edges (paths) between them. A* starts from a specific starting node of a graph (entrance) and aims to find a path to the given goal node (checkout) having the smallest cost (least distance traveled, shortest time, etc.). Specifically, A* selects the path that minimizes

f (n) = g(n) + h(n)

n is the next node on the path
g(n) is the cost of the path from the start node to n
h(n) is a heuristic function that estimates the cost of the cheapest path from n to the goal

A* terminates when the path it chooses to extend is a path from start to goal or if there are no paths eligible to be extended.

Friday Lightning Talk

At the end of the week, we gave a team presentation of our project. I presented the exploratory data analysis and visualization with Plotly, while my colleagues talked about the implementation of the A* algorithm and our experience and lessons learned from pair programming.

cover