I frequently harp upon the difference between natural and intentional randomness. Natural randomness models observation as random, as I described last week in modeling noise in electronics or the distribution of attributes in some field study. Intentional randomness is generated by people, either for games of chance or as a convenient way to battle indecision.
我經常強調自然隨機性和有意隨機性之間的差異。自然隨機性模擬觀察為隨機的,就像我上週在模擬電子噪音或某些領域研究中描述的那樣。有意隨機性是由人們生成的,無論是用於賭博遊戲還是作為應對猶豫不決的方便方式。
The Roaring Twenties were a heyday for intentional randomness. Neyman and Fisher’s early work on randomized experiments called for randomly assigning crops to agricultural plots. For Neyman, this allocation could be done by selecting balls from an urn with an appropriate color distribution. For Fisher, the random assignments could be accomplished by shuffling cards. The idea that these sorts of selection rules returned random numbers dates back at least to Bernoulli. But were they really random?
翻譯結果:
狂野的二十年代是有意的隨機性的全盛時期。Neyman 和 Fisher 在隨機實驗方面的早期工作要求將作物隨機分配到農田地塊中。對於 Neyman 來說,這種分配可以通過從具有適當顏色分佈的瓮中選擇球來完成。對於 Fisher 來說,隨機分配可以通過洗牌來實現。這種選擇規則返回隨機數的想法至少可以追溯到伯努利。但它們真的是隨機的嗎?
Karl Pearson argued that they were decidedly not.1
卡爾·皮爾森認為他們絕對不是。1
“Practical experiment has demonstrated that it is impossible to mix the balls or shuffle the tickets between each draw adequately. Even if marbles be replaced by the more manageable beads of commerce their very differences in size and weight are found to tell on the sampling. The dice of commerce are always loaded, however imperceptibly. The records of whist, even those of long experienced players, show how the shuffling is far from perfect, and to get theoretically correct whist returns we must deal the cards without playing them. In short, tickets and cards, balls and beads fail in large scale random sampling tests; it is as difficult to get artificially true random samples as it is to sample effectively a cargo of coal or of barley.”
實際實驗證明,在每次抽獎之間,無法充分混合球或洗牌票券。即使用商業上更易處理的珠子取代彈珠,它們的大小和重量差異也會對抽樣產生影響。商業上的骰子總是被偷偷地加重。紙牌遊戲的記錄,即使是經驗豐富的玩家,也顯示出洗牌遠非完美,要獲得理論上正確的紙牌結果,我們必須在不打牌的情況下發牌。總之,票券和紙牌,彈珠和珠子在大規模的隨機抽樣測試中失敗;人工獲得真正隨機樣本與有效地抽樣一艘煤炭或大麥的貨物一樣困難。
If statistics were to be made rigorous, statisticians would need to be able to reliably generate random samples. But how? One of the first systematic approaches was compiling large tables of random numbers.
如果要使統計數據變得嚴謹,統計學家需要能夠可靠地生成隨機樣本。但是如何做到呢?其中一種最早的系統方法是編制大型的隨機數字表。
In 1927, Cambridge University Press published a book of Random Sampling Numbers by Leonard Tippett, the 15th volume of their series “Tracts for Computers.” The book consists of 26 pages of numbers, each with a 50x32 table of digits.
1927 年,劍橋大學出版社出版了萊納德·提普特的《隨機抽樣數字》一書,該書是他們的系列著作《電腦文獻》的第 15 卷。該書包含了 26 頁數字,每頁都有一個 50x32 的數字表格。
Altogether, the book contains 41,600 “random digits.” The book provides no details of how Tippett collected them, simply asserting “40,000 digits were taken at random from census reports.” I don’t know what that means, and no one else seems to know either. The only other information is that Ida McLearn handwrote the tables.
總共,這本書包含了 41,600 個「隨機數字」。該書並未提供 Tippett 如何收集這些數字的細節,僅聲稱「40,000 個數字是從人口普查報告中隨機選取的」。我不知道這是什麼意思,似乎也沒有其他人知道。唯一的其他資訊是 Ida McLearn 親手書寫了這些表格。
Karl Pearson wrote the foreword to the book, which features the quote I highlighted above. Pearson argues that there are many ways the applied statistician can use these numbers to generate samples. He suggests taking four digit runs as numbers between 0 and 9,999. These runs can be produced by scanning through the book horizontally, vertically, or diagonally, forwards or backwards. Any systematic scan would do.
卡爾·皮爾森為這本書寫了前言,其中包含了我上面標記的引文。皮爾森認為應用統計學家可以用這些數字生成樣本的方法有很多種。他建議將四位數的連續數字視為介於 0 和 9,999 之間的數字。這些連續數字可以通過水平、垂直或對角線方向的書中掃描來生成,可以從前往後或從後往前。任何有系統的掃描方式都可以。
You can see why statisticians might find such computational methods more appealing than mechanical randomness generators. But would such a sequence be random? How could you tell? Pearson more or less says not to worry about it:
你可以理解為什麼統計學家可能會覺得這種計算方法比機械隨機生成器更具吸引力。但這樣的序列是否是隨機的呢?你怎麼判斷呢?皮爾遜或多或少地表示不用擔心這個問題:
“Any such series may be a random sample, and yet a very improbable sample. We have no reason whatever to believe that Tippett's numbers are such, they have conformed to the mathematical expectation in a variety of cases, and we would suggest to the user who finds a discordance between the results provided by the table and the theory of sampling he has adopted, first to investigate whether his theory is really sound, and if he be certain that it is, only then to question the randomness of the numbers.
任何這樣的數列可能是一個隨機樣本,但卻是非常不太可能的樣本。我們完全沒有理由相信 Tippett 的數字是這樣的,它們在各種情況下都符合數學期望值,我們建議使用者如果在表格提供的結果和他所採用的抽樣理論之間發現不一致,首先應該調查他的理論是否真的正確,如果確定是正確的,那才能質疑數字的隨機性。“But the reader must remember that in taking hundreds of samples he must expect to find improbable samples occurring or even runs of such samples; they will be in their place in the general distribution of samples, and he must not conclude, from their isolated occurrence, that Tippett’s numbers are not random.”
It’s a bit odd from our modern perspective. You could run all sorts of tests on your samples to see if they behave like random samples. If they pass the tests, then I guess you would be assured that you are generating randomness. But of course, if you spend too much time testing randomness, you run out of numbers in your table. Pearson was almost suggesting that by starting at “random” places in the table and moving in “random directions,” even more randomness could be generated from the 41,600 digits. Claims to this effect were difficult to verify. For Pearson, randomness was an “I know it when I see it” phenomenon.
To be honest, I’m not convinced we have moved far beyond Pearson’s perspective. We’re all guilty of not being particularly diligent about random number generation. Just set the random seed in python to 1234 or 1337 and go with it.
This sort of intentional randomness is good enough until it isn’t. Philip Stark and Kellie Ottoboni showed that many of the random number generators in popular statistical packages were woefully lacking. Up until 2015, STATA used a random number generator that couldn’t randomly sample a permutation of length 13. That seems to be less than ideal. The common MT pseudorandom number generator (used in R and Matlab) can’t generate a random permutation of 2084 items. That’s better than STATA, but is it good enough?
I’ve run into many statisticians who think Stark and Ottoboni’s worries about pseudorandomness are pedantic. But I always ask them which part of statistics they think is not pedantic. Are those 50 page proofs of the asymptotic normality of standard errors not pedantic? What about all of the detailed calculations of the efficiency of maximum likelihood estimators of parameters from distributions we know are hypothetical? Every statistician argues their part of statistics is the one that needs excessive attention. The one thing I’ll give Stark and Ottoboni: they describe several ways to write better pseudorandom number generators if needed.
Regardless of whether the numbers are actually random, Pearson’s idea to simulate randomness with code has been strikingly effective. There are countless randomized algorithms whose predictions match our experience. Perhaps that should be good enough.
At some, I feel like I’ll need to dive into the politics of Fisher and Pearson, who remain two inescapable pillars of statistical thinking and yet held racist views that were extreme even for their time. That we rely so much on artificial concepts from these two remains concerning. So much of what bothers me about statistics and the general flattening of experience into pat answers and chance comes from their work in eugenics.
Pearson's second paragraph seems to be alluding to the fact that humans are often bad at evaluating randomness? E.g. song shuffling algorithms have been adapted from "traditional" random to perceptually random for better user experience (this one is old but I'm curious what people are using now https://engineering.atspotify.com/2014/02/how-to-shuffle-songs/).
So does evaluating randomness just come down to designing the right tests for your purposes? This seems hard, and subjective, but maybe this is what you're getting at.
"So much of what bothers me about statistics and the general flattening of experience into pat answers and chance comes from their work in eugenics." As a researcher in statistical genetics, "hear! hear!". Sure they made useful mathematical and statistical contributions, but they held repugnant beliefs about individuals who were not "of their stature" so to speak.