1 olarak olasılıkları (oranlar veya paylaşımlar) için , bu bölgedeki önlemler (dizinler, katsayılar, her neyse) için birkaç öneri içerir. Böylecepi∑pai[ln(1/pi)]b
a=0,b=0 , olasılıklar arasındaki farkları göz ardı etmesine bakılmaksızın, göz önünde bulundurulması en basit olan gözlenen belirgin kelime sayısını döndürür. Bu sadece bağlam olarak her zaman yararlıdır. Diğer alanlarda, bu bir sektördeki firmaların sayısı, bir alanda gözlemlenen türlerin sayısı, vb. Olabilir. Genel olarak, buna farklı öğe sayısı diyelim .
a=2,b=0 returns the Gini-Turing-Simpson-Herfindahl-Hirschman-Greenberg sum of squared probabilities, otherwise known as the repeat rate or purity or match probability or homozygosity. It is often reported as its complement or its reciprocal, sometimes then under other names, such as impurity or heterozygosity. In this context, it is the probability that two words selected randomly are the same, and its complement 1−∑p2i the probability that two words are different. The reciprocal 1/∑p2i has an interpretation as the equivalent number of equally common categories; this is sometimes called the numbers equivalent. Such an interpretation can be seen by noting that k equally common categories (each probability thus 1/k) imply ∑p2i=k(1/k)2=1/k so that the reciprocal of the probability is just k. Picking a name is most likely to betray the field in which you work. Each field honours their own forebears, but I commend match probability as simple and most nearly self-defining.
a=1,b=1 returns Shannon entropy, often denoted H and already signalled directly or indirectly in previous answers. The name entropy has stuck here, for a mix of excellent and not so good reasons, even occasionally physics envy. Note that exp(H) is the numbers equivalent for this measure, as seen by noting in similar style that k equally common categories yield H=∑k(1/k)ln[1/(1/k)]=lnk, and hence exp(H)=exp(lnk) gives you back k. Entropy has many splendid properties; "information theory" is a good search term.
The formulation is found in I.J. Good. 1953. The population frequencies of species and the estimation of population parameters. Biometrika 40: 237-264.
www.jstor.org/stable/2333344.
Other bases for logarithm (e.g. 10 or 2) are equally possible according to taste or precedent or convenience, with just simple variations implied for some formulas above.
Independent rediscoveries (or reinventions) of the second measure are manifold across several disciplines and the names above are far from a complete list.
Tying together common measures in a family is not just mildly appealing
mathematically. It underlines that there is a choice of measure depending
on the relative weights applied to scarce and common items, and so reduces
any impression of adhockery created by a small profusion of apparently
arbitrary proposals. The literature in some fields is weakened by papers and even books based on tenuous claims that some measure favoured by the author(s) is the best measure that everyone should be using.
My calculations indicate that examples A and B are not so different except
on the first measure:
----------------------------------------------------------------------
| Shannon H exp(H) Simpson 1/Simpson #items
----------+-----------------------------------------------------------
A | 0.656 1.927 0.643 1.556 14
B | 0.684 1.981 0.630 1.588 9
----------------------------------------------------------------------
(Some may be interested to note that the Simpson named here (Edward Hugh Simpson, 1922- ) is the same as that honoured by the name Simpson's paradox. He did excellent work, but he wasn't the first to discover either thing for which he is named, which in turn is Stigler's paradox, which in turn....)