"Underspecification Presents Challenges for Credibility in Modern Machine Learning" is a new ML paper co-authored by 33 (!) Google researchers. It's been called a "wrecking ball" for our understanding of problems in machine learning.


There's been a lot of work on the problems of inadequate, low-quality, biased or poorly labeled training date in machine learning classifiers ("garbage in, garbage out"), but that's not what these researchers are documenting.


They're focused on "underspecification," a well-known statistical phenomenon that has not been at the center of machine learning analysis (until now).

It's a gnarly concept, and I quickly found myself lost while reading the original paper; thankfully, Will Douglas Heaven did a great breakdown for MIT Tech Review.



Show thread

"Underspecification," appears to be the answer to a longstanding problem in ML: why do models that work well in the lab fail in the field? Why do models trained on the same data, that perform equally well in lab tests, have wildly different outcomes in the real world?

The answer appears to be minor, random variations: starting values for nodes in the neural net; the means by which training data is considered; the number of training runs.


Show thread

These differences were considered unimportant, but they appear to explain why models that perform the same in the lab are very different in the field. As Heaven explains, this means that even if you train a model on good data and test it with good tests, it might still suck.


Show thread

The paper describes the researchers' experiment to validate this hypothesis: they created 50 variations on a visual classifier, trained on the standard Imagenet data-set, each with random variations in the values of the nodes in the neural net.


· · Web · 1 · 0 · 1

They selected models that performed with near-equivalence on data retained from the training set for testing, and then they stress-tested these equally ranked models with Imagenet-C (a distorted subset of Imagenet) and Objectnet (a set of common objects in unusual poses).

The models' stress-test outcomes were hugely variant. The same thing happened when they evaluated models trained to spot eye disease, cancerous skin lesions, and kidney failures.


Show thread

Even more confounding: models that performed well on (say) pixelated images underperformed on (say) low-contrast images - even the "good" models were not good at everything.

Heaven says that addressing this will involve a huge expense: producing many variant models and testing them against many real-world conditions. It's the kind of thing Google can afford to do, but which may be out of reach of smaller firms.


Show thread
Sign in to participate in the conversation
La Quadrature du Net - Mastodon - Media Fédéré

The social network of the future: No ads, no corporate surveillance, ethical design, and decentralization! Own your data with Mastodon!