Tech debt is insidious, a kind of socio-infrastructural subprime crisis that's unfolding around us in slow motion. Our digital infrastructure is built atop layers and layers and layers of code that's insecure due to a combination of bad practices and bad frameworks.

Even people who write secure code import insecure libraries, or plug it into insecure authorization systems or databases. Like asbestos in the walls, this cruft has been fragmenting, drifting into our air a crumb at a time.


We ignored these, treating them as containable, little breaches and now the walls are rupturing and choking clouds of toxic waste are everywhere.

The infosec apocalypse was decades in the making. The machine learning apocalypse, on the other hand...

ML has serious, institutional problems, the kind of thing you'd expect in a nascent discipline, which you'd hope would be worked out before it went into wide deployment.


ML is rife with all forms of statistical malpractice - AND it's being used for high-speed, high-stakes automated classification and decision-making, as if it was a proven science whose professional ethos had the sober gravitas you'd expect from, say, civil engineering.

Civil engineers spend a lot of time making sure the buildings and bridges they design don't kill the people who use them. Machine learning?



Hundreds of ML teams built models to automate covid detection, and every single one was useless or worse.

The ML models failed due to failure to observe basic statistical rigor. One common failure mode?

Treating data that was known to be of poor quality as if it was reliable because good data was not available.

Obtaining good data and/or cleaning up bad data is tedious, repetitive grunt-work.


· · Web · 1 · 2 · 3

It's unglamorous, time-consuming, and low-waged. Cleaning data is the equivalent of sterilizing surgical implements - vital, high-skilled, and invisible unless someone fails to do it.

It's work performed by anonymous, low-waged adjuncts to the surgeon, who is the star of the show and who gets credit for the success of the operation.


The title of a Google Research team (Nithya Sambasivan et al) paper published in ACM CHI beautifully summarizes how this is playing out in ML: "Everyone wants to do the model work, not the data work: Data Cascades in High-Stakes AI,"

The paper analyzes ML failures from a cross-section of high-stakes projects (health diagnostics, anti-poaching, etc) in East Africa, West Africa and India.


They trace the failures of these projects to data-quality, and drill into the factors that caused the data problems.

The failures stem from a variety of causes. First, data-gathering and cleaning are low-waged, invisible, and thankless work. Front-line workers who produce the data - like medical professionals who have to do extra data-entry - are not compensated for extra work.


Often, no one even bothers to explain what the work is for. Some of the data-cleaning workers are atomized pieceworkers, such as those who work for Amazon's Mechanical Turk, who lack both the context in which the data was gathered and the context for how it will be used.

This data is passed to model-builders, who lack related domain expertise.


The hastily labeled X-ray of a broken bone, annotated by an unregarded and overworked radiologist, is passed onto a data-scientist who knows nothing about broken bones and can't assess the labels.

This is an age-old problem in automation, pre-dating computer science and even computers. The "scientific management" craze that started in the 1880s saw technicians observing skilled workers with stopwatches and clipboards, then restructuring the workers' jobs by fiat.


Rather than engaging in the anthropological work that Clifford Geertz called "thick description," the management "scientists" discarded workers' qualitative experience, then treated their own assessments as quantitative and thus empirical.

How long a task takes is empirical, but what you call a "task" is subjective. Computer scientists take quantitative measurements, but decide what to measure on the basis of subjective judgment.


This empiricism-washing sleight of hand is endemic to ML's claims of neutrality.

In the 2000s, there was a movement to produce tools and training that would let domain experts produce their own tools - rather than delivering "requirements" to a programmer, a bookstore clerk or nurse or librarian could just make their own tools using Visual Basic.

This was the radical humanist version of "learn to code" - a call to seize the means of computation and program, rather than being programmed.


Over time, it was watered down, and today it lives on as a weak call for domain experts to be included in production.

The disdain for the qualitative expertise of domain experts who produce data is a well-understood guilty secret within ML circles, embodied in Frederick Jelinek's ironic talk, "Every time I fire a linguist, the performance of the speech recognizer goes up."

But a thick understanding of context is vital to improving data-quality.


Take the American "voting wars," where GOP-affiliated vendors are brought in to purge voting rolls of duplicate entries - people who are registered to vote in more than one place.

These tools have a 99% false-positive rate.

Ninety. Nine. Percent.

To understand how they go so terribly wrong, you need a thick understanding of the context in which the data they analyze is produced.


The core assumption of these tools is that two people with the same name and date of birth are probably the same person.

But guess what month people named "June" are likely to be born in? Guess what birthday is shared by many people named "Noel" or "Carol"?


Many states represent unknown birthdays as "January 1," or "January 1, 1901." If you find someone on a voter roll whose birthday is represented as 1/1, you have no idea what their birthday is, and they almost certainly don't share a birthday with other 1/1s.


But false positives aren't evenly distributed. Ethnic groups whose surnames were assigned in recent history for tax-collection purposes (Ashkenazi Jews, Han Chinese, Koreans, etc) have a relatively small pool of surnames and a slightly larger pool of first names.


This is likewise true of the descendants of colonized and enslaved people, whose surnames were assigned to them for administrative purposes and see a high degree of overlap. When you see two voter rolls with a Juan Gomez born on Jan 1, you need to apply thick analysis.


Unless, of course, you don't care about purging the people who are most likely to face structural impediments to voter registration (such as no local DMV office) and who are also likely to be racialized (for example, migrants whose names were changed at Ellis Island).


ML practitioners don't merely use poor quality data when good quality data isn't available - they also use the poor quality data to assess the resulting models. When you train an ML model, you hold back some of the training data for assessment purposes.

So maybe you start with 10,000 eye scans labeled for the presence of eye disease. You train your model with 9,000 scans and then ask the model to assess the remaining 1,000 scans to see whether it can make accurate classifications.


But if the data is no good, the assessment is also no good. As the paper's authors put it, it's important to "catch[] data errors using mechanisms specific to data validation, instead of using model performance as a proxy for data quality."

ML practitioners studied for the paper - practitioners engaged in "high-stakes" model building reported that they had to gather their own data for their models through field partners, "a task which many admitted to being unprepared for."


High-stakes ML work has inherited a host of sloppy practices from ad-tech, where ML saw its first boom. Ad-tech aims for "70-75% accuracy."

That may be fine if you're deciding whether to show someone an ad, but it's a very different matter if you're deciding whether someone needs treatment for an eye-disease that, untreated, will result in irreversible total blindness.


Even when models are useful at classifying input produced under present-day lab conditions, those conditions are subject to several kinds of "drift."

For example, "hardware drift," where models trained on images from pristine new cameras are asked to assess images produced by cameras from field clinics, where lenses are impossible to keep clean (see also "environmental drifts" and "human drifts").


Bad data makes bad models. Bad models instruct people to make ineffective or harmful interventions. Those bad interventions produce more bad data, which is fed into more bad models - it's a "data-cascade."

GIGO - Garbage In, Garbage Out - was already a bedrock of statistical practice before the term was coined in 1957. Statistical analysis and inference cannot proceed from bad data.


Show newer
Sign in to participate in the conversation
La Quadrature du Net - Mastodon - Media Fédéré est une serveur Mastodon francophone, géré par La Quadrature du Net.