Prediction, Understanding, Accuracy, and … Orthodoxy – Part 1

Posted By:

This blog post is an abbreviation of lecture 1 in a data science short course I taught at TEC de Monterrey in June 2016.

In the April 1970 issue of Management Science (B-465-485) John D. C. Little opened his article by saying:

“The big problem with management science models is that managers practically never use them. There have been a few applications, of course, but the practice is a pallid picture of the promise.”

Little has changed in the intervening 46.25 years, and the purpose of this blog post, is to explore some reasons why.

Today of course, we speak not of “management science” but the hot new thing “data science.” So certainly the names of the management fads have changed. But still, the effect of the management fads has remained much as Little described in 1970. Gartner in 2015 retired its technology hype cycle graph with both Data Science and Big Data in mid-rise/fall:

Something like this:

The shape of the hype cycle is instructive. Something happens after the trough of disillusionment. Something VERY good happens to redeem the faith in technologies of yore. Spoiler: I think that what happens is that in the slope of enlightenment enough understanding develops in a word of mouth community to learn how to monetize new technologies.

Something like this:

BillM’s Theory: New technologies need new monetization models, that is, unique new built-from-the-ground-up business models. The nerd-posse that stays with a technology long enough, will, eventually, find Monetizing Use 1, then Monetizing Use 2, etc., etc., etc.

So, why after these monetizing uses are found, are the monetizing uses NOT spreading across business like wild fire? Why is practice still a pallid picture of the promise?

Potential causes:

  1. Margin retreat.
  2. Fear Uncertainty and Doubt around so called “intellectual property.”
  3. The Biblical curse of the sins of statistical fathers being visited on the children … to the third and fourth generation (Exodus 34:7).
  4. Others …

As this is a blog post, and not a book, I’ll describe (briefly) my theories about causes 1 through 3, and look to comments on this blog (or email to bill@nealanalytics.com) for 4. In Part 2 of this blog post, I’ll drill into concrete steps that be taken to correct “pallid” state of current modeling practice.

Cause #1: Margin Retreat

For the 25 years of my career, every job I’ve had I’ve had to switch from one omnibus statistical platform to another. I’ve done SAS, SPSS, SYSAT/SYGRAPH, MicroTSP, Azure Machine Learning, and 7 other stats environments along this journey. Along the way, omnibus stats environments have grown more expensive by the hour with the exception of what I think of as the “democratized cloud based” stats environments which fall in price by the hour. But, I digress.

One omnibus statistical package I recently looked at, had a price list that was nearly 700 pages long. And from what I’ve heard from clients getting price quotes, the entry price for this environment appears to be US$1,000 per page in the price list!

What is going on here? The omnibus stats environments have contented themselves until now, with raising price until their growth rate slows from “too fast” to “comfortable.” The “big” packages end up with 20,000 to 40,000 enterprise customers, and a nice, relatively easy living.

The “big” suppliers concentrate the benefits of the tools they provide, on supplier’s upper management (stats companies are often privately held) and the costs of dealing with partially developed and legacy-encumbered tools are externalized to customers.

Now 40,000 enterprise customers are nothing to sneeze at, but according to World Bank (link) in 2007 there were 19,878,084 businesses registered worldwide. Take 40,000 enterprises and divide by 19,878,084 total enterprises and you get VERY ROUGHLY that .2% of businesses in the world have access to stats tools at scale.

.2% penetration of stats tools in businesses, is a big part of why practice is a pallid picture of the promise.

Cause #2: Fear Uncertainty and Doubt around so called “intellectual property.”

I would not have included this as a cause, until I saw it with my own eyes recently. But I came across a large consulting organization that had developed data visualizations with an off-the shelf tool, and then threatened a Fortune 500 company, with misuse of intellectual property if the client used the visualization they had paid for, outside the project’s scope document.

Now, I’m not an intellectual property attorney. Read that sentence again.

But … I have managed the patent portfolio for HP’s LaserJet group, and I’ve supported the business side of million dollar a month patent litigation, and I sat with attorneys as we reviewed over 4,000 invention disclosures from my group at HP. So I know the first rule of intellectual property is “If it isn’t written down, it is not intellectual property.”

So, I personally doubt whether large consulting group could claim anything except copyrights on the visuals. And because the visuals were developed by the company, even that is hypothetical.

What I think is clear though, is that fear-uncertainty-doubt threats, can be a big cause why knowing a monetizing use of an analytical technology, does not spread the new monetizing uses to new companies.

Cause #3: The Biblical curse of the sins of statistical fathers being visited on the children … to the third and fourth generation (Exodus 34:7).

What has impressed me as “the curse of statistical inheritance” can perhaps be indicated with a few passages from David Salsburg’s excellent The Lady Tasting Tea:

  1. Biometrika’s editor Pearson “… published Fisher’s work, but as a footnote to a larger paper … The result was that, to the casual reader, Fisher’s [works] were a mere appendix to the more important … work done by Pearson and his coworkers.” p. 34
  2. “… anomalous results were only used as a springboard from which Fisher utterly demolished another of Pearson’s most proud achievements.” p. 50
  3. “… a young scientist may be warned and advised that when he has a jewel to offer for the enrichment of mankind some certainly will wish to turn and rend him.” p. 52 Ronald Fisher in a 1947 BBC broadcast

Data science may or may not be arguably, statistics. But what is clear is that statisticians don’t get along.

  • The 250 year blood feud between the Hatfield’s (frequentists) and McCoy’s (Baysians).
  • But in fact, not getting along extends way beyond the boundary of statistics, proper, easily fully encompassing data science as practiced today.

To business outsiders, the data science playing field may look like this:

And the data science side of the picture is filled with many disciplines working on “models” some of which are at war.

Practically though, the modeling scene is, the reality of modelers working together is more like this:

So the third reason we don’t see monetizing applications spread like wildfire, are a set of sins:

  1. Persecution mentality when receiving any kind of feedback (Fisher was famous for this).
  2. Violent communication styles (in the sense of Rosenberg’s Nonviolent Communication)
  3. Disciplinary inertia to keep doing what we’ve always done
  4. Obliviousness to the massive ecology of different species of modeling
  5. Vacuum of philosophy to support a larger perspective

I’ve outlined three big reasons that modeling practice is a pallid picture of the promise, as called out by John D. C. Little in 1970. In part 2 of this blog post I’ll lay out some hypotheses about what we can do to make the future better.

Bill Meade