Maximum Fisherhood 最大費雪度
Ronald Fisher's shapeshifting conceptions of probability.
羅納德·費雪對概率的多變概念。
Everyone was confused by randomness in the 1920s, and no one was more confused than Ronald Fisher. Fisher wrote a series of papers establishing much of modern statistics. But his internal philosophy about what probability means shifts with every paper, revealing his deep confusion about epistemology and inference. For someone who is infamous for his staunch dogmatism, Fisher was philosophically all over the map in the 1920s. He contradicts himself in each subsequent paper (though, of course, he never admits it).
在 1920 年代,每個人都對於隨機性感到困惑,而沒有人比羅納德·費雪更加困惑。費雪撰寫了一系列的論文,奠定了現代統計學的基礎。但他對於概率的內在哲學在每篇論文中都有所轉變,顯示出他對於認識論和推論的深深困惑。對於一個以堅定教條主義聞名的人來說,費雪在 1920 年代的哲學觀點卻五花八門。他在每篇後續論文中自相矛盾(當然,他從未承認過)。
My most and least favorite Fisher paper is his 1922 magnum opus, “On the Mathematical Foundations of Theoretical Statistics.” This is heralded by statisticians as one of the most important papers in Statistics. It’s my least favorite because it defines the maximum likelihood method, of which I’ve never been a fan and which has been a mathematical mess for a century. For statistics, this paper has done more harm than good. It’s my favorite because I love the free-wheeling way Fisher writes. It’s clear he’s making things up as he goes to justify rigor in a field that cannot be rigorous.
我最喜歡和最不喜歡的費雪論文是他 1922 年的巨著《論理論統計的數學基礎》。統計學家們將其譽為統計學中最重要的論文之一。我最不喜歡它,因為它定義了最大概似法,而我從未是它的粉絲,這個方法在數學上已經混亂了一個世紀。對統計學來說,這篇論文帶來的傷害大於好處。我最喜歡它,是因為我喜歡費雪隨心所欲的寫作方式。顯然,他在為一個無法嚴謹的領域辯護時,隨著思路的展開而創造出一些東西。
Fisher argues the role of statistics is data summarization. This had been its primary use: a way of tabulating bulk facts about the properties of the state so that those who ruled could make informed decisions. Fisher sought to make this tabulation of counts into rigorous mathematics. Let’s find out what Fisher thought, closely reading the first three paragraphs of Section 2.
費雪主張統計學的角色是對數據進行總結。這一直是它的主要用途:一種將關於國家性質的大量事實進行統計表格化,以便統治者能夠做出明智的決策的方法。費雪試圖將這種計數的統計方法轉化為嚴謹的數學。讓我們仔細閱讀第二節的前三段,了解費雪的想法。
“...the object of statistical methods is the reduction of data. A quantity of data, which usually by its mere bulk is incapable of entering the mind, is to be replaced by relatively few quantities which shall adequately represent the whole, or which, in other words, shall contain as much as possible, ideally the whole, of the relevant information contained in the original data.”
統計方法的目的是減少數據。一大量的數據通常因其龐大而無法被人腦所理解,因此需要用相對較少的數量來代表整體,或者說,這些數量應該盡可能地包含原始數據中的所有相關信息,理想情況下是全部信息。
So far, so good. Now, how should you summarize data? Here’s where things get wild:
到目前為止,一切都很順利。現在,你應該如何總結數據呢?這就是事情變得瘋狂的地方:
“This object is accomplished by constructing a hypothetical infinite population, of which the actual data are regarded as constituting a random sample. The law of distribution of this hypothetical population is specified by relatively few parameters, which are sufficient to describe it exhaustively in respect of all qualities under discussion. Any information given by the sample, which is of use in estimating the values of these parameters, is relevant information.”
這個目標是通過建立一個假設的無限人口,其中實際數據被視為隨機樣本,來實現的。這個假設人口的分佈規律由相對較少的參數來指定,這些參數足以在討論的所有特性方面對其進行全面描述。樣本提供的任何有用於估計這些參數值的信息都是相關信息。
All data must be assumed to be random. Not only are they random, but they are randomly sampled (whatever that may mean) from a “population.” This population is hypothetical (i.e., it does not exist) and is a relatively simple mathematical object. Sampling from the population is the same as sampling from a certain simple probability distribution with only a few parameters. The important differences between populations can be summarized by a few numbers.
所有數據都必須被假設為隨機的。不僅如此,它們還是從一個「母體」中隨機抽樣(不管這意味著什麼)。這個母體是假設的(即它不存在),並且是一個相對簡單的數學對象。從母體中抽樣等同於從一個具有僅有幾個參數的簡單概率分佈中抽樣。不同母體之間的重要差異可以用幾個數字來總結。
This set of assumptions about data is patently absurd and never true. However, for Fisher, it doesn’t need to be true. The purpose of this hypothetical population is data summarization. It need only encapsulate the important features of the data before the analyst. Fisher, with his frustrating overloquation, is just saying “All models are wrong, but some are useful.”
這組對於資料的假設明顯荒謬且從未成立。然而,對於費雪來說,這並不需要是真實的。這個假設的族群的目的是為了資料摘要。它只需要包含分析師面前的資料的重要特徵。費雪以他令人沮喪的過度言辭只是在說「所有模型都是錯的,但有些是有用的。」
“Since the number of independent facts supplied in the data is usually far greater than the number of facts sought, much of the information supplied by any actual sample is irrelevant.”
由於數據中提供的獨立事實數量通常遠大於所需的事實數量,因此任何實際樣本提供的許多信息都是無關緊要的。
Indeed, most of the information, whatever that is, is irrelevant to the facts we seek. Of course, what’s relevant and irrelevant is in the eye of the beholder. Does that make Fisher a Bayesian subjective probabilist?
確實,大部分的資訊,不論是什麼,對我們所追求的事實來說都是無關緊要的。當然,什麼是相關和無關緊要的,取決於觀察者的角度。這是否意味著費雪是一個貝葉斯主觀概率論者?
“It is the object of the statistical processes employed in the reduction of data to exclude this irrelevant information, and to isolate the whole of the relevant information contained in the data.”
統計過程的目的是在資料的整理中排除這些不相關的資訊,並將資料中包含的所有相關資訊獨立出來。
The goal of a statistical algorithm is to remove all irrelevant information and find only the relevant information. What is relevant is clarified by creating a hypothetical, simple random model of the world and assuming all data is generated by it. The randomness flattens all uncertainty into stochastic variation around a small number of statistics. The statistician must model the world as a few simple facts corrupted by aberrations due solely to chance.
統計演算法的目標是去除所有不相關的資訊,並找出只有相關的資訊。什麼是相關的資訊是透過建立一個假設的、簡單的隨機模型來澄清,並假設所有的資料都是由這個模型生成的。隨機性將所有的不確定性轉化為圍繞著少數統計數字的隨機變動。統計學家必須將世界建模為幾個簡單的事實,這些事實只受到純粹偶然因素的影響而被扭曲。
Now I offhandedly quipped that Fisher might be considered a Bayesian for his subjectivity in this section. But he’s clearly not being a frequentist in this paper. How would the modern statistician characterize his proposed procedure?
現在我隨口說費雪在這一節中可能被認為是一個貝葉斯派。但在這篇論文中,他顯然不是一個頻率派。現代統計學家會如何描述他提出的程序呢?
I have a bunch of observations in front of me.
我面前有一堆觀察。I hypothesize a model for this data.
我對這個數據提出了一個模型假設。I use some math to estimate the parameters of this model.
我使用一些數學來估計這個模型的參數。These parameters serve as my summary of the data.
這些參數是我對數據的摘要。
This sounds like exploratory data analysis to me! We make some untestable assumptions about the world in order to tell a story about data. Fisher of 1922 is much closer to John Tukey than the Fisher of 1925 who wrote The Design of Experiments.
這聽起來對我來說像是探索性資料分析!我們對世界做出一些無法驗證的假設,以便從資料中講述一個故事。1922 年的費雪更接近約翰·圖基,而不是 1925 年寫下《實驗設計》的費雪。
Fisher further expands upon his probabilitist beliefs in the next paragraph.
費雪在下一段進一步闡述了他對概率論的信仰。
“It should be noted that there is no falsehood in interpreting any set of independent measurements as a random sample from an infinite population; for any such set of numbers are a random sample from the totality of numbers produced by the same matrix of causal conditions: the hypothetical population which we are studying is an aspect of the totality of the effects of these conditions, of whatever nature they may be. The postulate of randomness thus resolves itself into the question, ‘Of what population is this a random sample?’ which must frequently be asked by every practical statistician.”
應該注意的是,將任何一組獨立測量解釋為從無限人口中的隨機樣本並不包含虛假;因為這樣的數字集合是從相同的因果條件矩陣產生的數字的總體中隨機抽樣:我們正在研究的假設人口是這些條件效應的總體效應的一個方面,無論其性質如何。因此,隨機性的假設可以歸結為一個問題:“這是哪個人口的隨機樣本?”這是每個實際統計學家經常需要問的問題。
That first sentence is a doozy. So many clauses! What does he mean by independent here? Regardless, he’s laying his cards on the table and telling us that all data are a random sampling of something. This means that all of our experience is nothing more than the manifestation of random fluctuations of the universe. You might defend this position, but realize that you are making some very strong philosophical assertions. Natural randomness is a postulate for Fisher. All observations are random. Some, I suppose, are useful.
那第一句話真是夠複雜的。有這麼多子句!他在這裡所指的「獨立」是什麼意思呢?不管怎樣,他將他的底牌放在桌上,告訴我們所有的數據都是某種隨機抽樣的結果。這意味著我們所有的經驗只不過是宇宙隨機波動的表現。你可能會為這個立場辯護,但要意識到你正在提出一些非常強烈的哲學主張。對於費雪來說,自然的隨機性是一個假設。所有的觀察都是隨機的。我想有些觀察可能是有用的。
In the remaining fifty odd pages, Fisher proceeds to write a bunch of formulae to derive the method of maximum likelihood. Let me include one more paragraph that still haunts statistics.
“Readers of the ensuing pages are invited to form their own opinion as to the possibility of the method of the maximum likelihood leading in any case to an insufficient statistic. For my own part I should gladly have withheld publication until a rigorously complete proof could have been formulated; but the number and variety of the new results which the method discloses press for publication, and at the same time I am not insensible of the advantage which accrues to Applied Mathematics from the co-operation of the Pure Mathematician, and this co-operationis not infrequently called forth by the very imperfections of writers on Applied Mathematics.”
Hilarious. Is maximum likelihood rigorous today? No! 100 years later, we still use the technique with little justification. It’s mostly harmless as it’s often just computing means or solving least-squares problems. And it’s often as good as anything else because data summarization is exploratory.
We’d certainly add some forms of mathematical rigor. For example, Doob would show the method could be considered empirical risk minimization. While this gives a rigorous justification for the method in special contexts, it does not rigorously justify the assumptions. Doob’s theory is true only if the data are actually generated from one of Fisher’s hypothetical probability distributions. But this is almost never true. The assumptions of statistics are metaphysical and can never be made rigorous. You can never prove that all observations are generated by having god randomly generate an iid sample from a probability distribution governed by a few parameters. The mathematical foundations of statistics have their issues. The philosophical foundations are untenable.
"all observations are generated by having god randomly generate an iid sample from a probability distribution governed by a few parameters. " This world view is so confusing. "Random variables" in statistical world view seem to be super zombie which make everything rv. Any constant/object + random variable is a random variable. Random variable infects everything ! This worldview is good for mathematical analysis or exploration in some context. Generalizing this idea is so weird.
Interesting read. Would you mind going a little deeper on your last paragraph?
I’m interested in what the notion of MLE being on unstable ground implies about the philosophical implications of say the standard model in physics. Is that, too, non-rigorous from this perspective for the following reasons 1) the assumptions of the standard model itself represent an over simplification of the world and 2) it is experimentally verified using the methods of maximum likelihood inference (which as you say is unreliable).
I guess the question is how much of science becomes non-rigorous when these standards are held to different fields than statistics