Reconstructing and Analyzing Family Trees (my PhD project)

Usually, when I think I’ve come up with a great idea, I wait until the next day to see if it still seems as good. Most of the time it doesn’t. However, when Arno Solin first told me about the HisKi database , which contains digitized church records (births, deaths, marriages, migration) from Finland spanning from the 1600s to the late 1800s, and the analysis possibilities this data could offer, I immediately felt compelled to start working on it, and the next day I was even more excited.

Eventually, I told my professor about the idea and we decided that I would start my PhD research around the questions arising from the HisKi dataset. Other people also liked the idea and so I was chosen to present my research in a pitching competition called Falling Walls Lab in Berlin a few months ago (Aalto news wrote about my trip here ). Here’s a video of my two and a half minutes presentation where I explain what kind of research questions I’m aiming to address.

In summary, my two main research problems are the following:

  1. Develop algorithms for automatically reconstructing the whole Finnish family tree (genealogical tree).
  2. Analyze the structure of the reconstructed tree.

A typical birth record in the HisKi data lists at least the names of the parents in addition to the given name and the date of birth of the child. The main challenge is that there are many people with the same name so identifying the parents of a child is not straightforward. To make the problem even more challenging (but also more interesting!), there are often many alternative spellings for a single name and some records are missing in the first place.

After solving the first problem (even to some extent), there is a huge number of interesting questions one can start looking at. For example: Are there some branches of the tree that don’t mix even though the people have lived nearby, suggesting some sort of a class division? How have migration patterns between cities evolved over the centuries? What are the effects of events such disease or war outbreaks?

I have already reconstructed the first trees but a lot of preprocessing and cleaning of the data still needs to take place before I get to tackle the questions mentioned above. In the meanwhile, here’s a simple plot of the number of birth records per year. It shows clear drops around three wars and the Finnish famine of 1866–1868. The number of records also drops after year 1850 since most of the documents that are newer than this have not been digitized yet.

