Data Science is Hard: History, or It Seemed Like a Good Idea At the Time

I’m mentoring a Summer of Code project this summer about redesigning the “about:telemetry” interface
that ships with each and every version of Firefox.

The minute the first student (:flyingrub) asked me “What is a parent payload and child payload?” I knew I was going to be asked a lot of questions.

To least-effectively answer these questions, I’ll blog the answers as narratives. And to start with this question, here’s how the history of a project makes it difficult to collect data from it.

In the Beginning — or, rather, in the middle of October 2015 when I was hired at Mozilla (so, at my
Beginning) — there was single-process Firefox, and all was good. Users had many tabs, but one process. Users had many bookmarks, but one process. Users had many windows, but one process. All this and the web contents themselves were all sharing time within a single construct of electrons and bits and code and pixels: vying with each other for control of the filesystem, the addressable space of RAM, the network resources, and CPU scheduling.

Not satisfied with things being just “good”, we took a page from the book penned by Google Chrome and decided the time was ripe to split the browser into many processes so that a critical failure in one would not trouble the others. To begin with, because our code is venerable, we decided that we would try two processes. One of these twins would be in charge of the browser and the other would be in charge of the web contents.

This project was called Electrolysis
after the mechanism by which one might split water into Hydrogen and Oxygen using electricity.

Suddenly the browser became responsive, even in the face of the worst JavaScript written by the least experienced dev at the most privileged startup in Silicon Valley. And out-of-memory errors decreased in frequency because the browser’s memory and the web contents’ memory were able to grow without interfering with each other.

Remember, our code is venerable. Remember, our code hearkens from its single-process past.

Our data-collection code was written in that single-process past. But we had two processes with input events that need to be timed to find problems. We had two processes with memory allocations that need to be examined for regressions.

So the data collection code was made aware that there could be two types of process: parent and child.

Alas, not just one child. There could be many child processes in a row if some webpage were naughty and brought down the child in its anger. So the data collection code was made aware there could be many batches of data from child processes, and one batch of data from parent processes.

The parent data was left looking like single-process data, out in the root of the data collection payload. Child processes’ data were placed in an array of childPayloads where each payload echoed the structure of the parent.

Then, not content with “good”, I had to come along in bug 1218576
, a bug whose number I still have locked in my memory, for good or ill.

Firefox needs to have multiple child processes of different types, simultaneously. And many of some of those several types, also simultaneously. What was going to be a quick way to ensure that childPayloads was always of length 1 turned into a months-long exercise to put data exactly where we wanted it to be.

And so now we have childPayloads where the “weird” content child data that resists aggregation remains, and we also have payload.processes..* where the cool and hip data lives: histograms, scalars, and keyed variants of both.

Already this approach is showing dividends as some proportions of Nightly users are getting a gpu process, and others are getting some number of content processes. The data files neatly into place with minimal intervention required.

But it means about:telemetry needs to know whether you want the parent’s “weird” data or the child’s. And which child was that, again?

And about:telemetry also needs to know whether you want the parent’s “cool” data, or the content child’s, or the gpu child’s.

So this means that within about:telemetry there are now five places where you can select what process you want. One for “weird” data, and one for each of the four kinds of “cool” data.

Sadly, that brings my storytelling to a close, having reached the present day. Hopefully after this Summer’s Code is done, this will have a happier, more-maintainable, and responsively-designed ending.

But until now, remember that “accident of history” is the answer to most questions. As such it behooves you to learn history well.


稿源:chuttenblog (源链) | 关于 | 阅读提示

本站遵循[CC BY-NC-SA 4.0]。如您有版权、意见投诉等问题,请通过eMail联系我们处理。
酷辣虫 » 移动互联 » Data Science is Hard: History, or It Seemed Like a Good Idea At the Time

喜欢 (0)or分享给?

专业 x 专注 x 聚合 x 分享 CC BY-NC-SA 4.0

使用声明 | 英豪名录


忘记密码 ?