Kelime sıklığı verilerindeki dağılım nasıl ölçülür?

10

Kelime sayımlarının bir vektöründeki dağılım miktarını nasıl ölçebilirim? Sıklıkla ortaya çıkan bir kelime (veya birkaç kelime) içerdiğinden, nadiren ortaya çıkan birçok farklı kelime ve B belgesi için düşük olduğu için A belgesi için yüksek olacak bir istatistik arıyorum.

Daha genel olarak, nominal verilerdeki dağılım veya "yayılma" nasıl ölçülür?

Bunu metin analizi topluluğunda yapmanın standart bir yolu var mı?

— dB'
kaynak

10

1 olarak olasılıkları (oranlar veya paylaşımlar) için , bu bölgedeki önlemler (dizinler, katsayılar, her neyse) için birkaç öneri içerir. Böylece $p_i$ $\sum p_i^a [\ln (1/p_i)]^b$

$a = 0, b = 0$ , olasılıklar arasındaki farkları göz ardı etmesine bakılmaksızın, göz önünde bulundurulması en basit olan gözlenen belirgin kelime sayısını döndürür. Bu sadece bağlam olarak her zaman yararlıdır. Diğer alanlarda, bu bir sektördeki firmaların sayısı, bir alanda gözlemlenen türlerin sayısı, vb. Olabilir. Genel olarak, buna farklı öğe sayısı diyelim .
$a = 2, b = 0$ returns the Gini-Turing-Simpson-Herfindahl-Hirschman-Greenberg sum of squared probabilities, otherwise known as the repeat rate or purity or match probability or homozygosity. It is often reported as its complement or its reciprocal, sometimes then under other names, such as impurity or heterozygosity. In this context, it is the probability that two words selected randomly are the same, and its complement $1 - \sum p_i^2$ the probability that two words are different. The reciprocal $1 / \sum p_i^2$ has an interpretation as the equivalent number of equally common categories; this is sometimes called the numbers equivalent. Such an interpretation can be seen by noting that $k$ equally common categories (each probability thus $1/k$ ) imply $\sum p_i^2 = k (1/k)^2 = 1/k$ so that the reciprocal of the probability is just $k$ . Picking a name is most likely to betray the field in which you work. Each field honours their own forebears, but I commend match probability as simple and most nearly self-defining.
$a = 1, b = 1$ returns Shannon entropy, often denoted $H$ and already signalled directly or indirectly in previous answers. The name entropy has stuck here, for a mix of excellent and not so good reasons, even occasionally physics envy. Note that $\exp(H)$ is the numbers equivalent for this measure, as seen by noting in similar style that $k$ equally common categories yield $H = \sum^k (1/k) \ln [1/(1/k)] = \ln k$ , and hence $\exp(H) = \exp(\ln k)$ gives you back $k$ . Entropy has many splendid properties; "information theory" is a good search term.

The formulation is found in I.J. Good. 1953. The population frequencies of species and the estimation of population parameters. Biometrika 40: 237-264. www.jstor.org/stable/2333344.

Other bases for logarithm (e.g. 10 or 2) are equally possible according to taste or precedent or convenience, with just simple variations implied for some formulas above.

Independent rediscoveries (or reinventions) of the second measure are manifold across several disciplines and the names above are far from a complete list.

Tying together common measures in a family is not just mildly appealing mathematically. It underlines that there is a choice of measure depending on the relative weights applied to scarce and common items, and so reduces any impression of adhockery created by a small profusion of apparently arbitrary proposals. The literature in some fields is weakened by papers and even books based on tenuous claims that some measure favoured by the author(s) is the best measure that everyone should be using.

My calculations indicate that examples A and B are not so different except on the first measure:

----------------------------------------------------------------------
          |  Shannon H      exp(H)     Simpson   1/Simpson      #items
----------+-----------------------------------------------------------
        A |      0.656       1.927       0.643       1.556          14
        B |      0.684       1.981       0.630       1.588           9 
----------------------------------------------------------------------

(Some may be interested to note that the Simpson named here (Edward Hugh Simpson, 1922- ) is the same as that honoured by the name Simpson's paradox. He did excellent work, but he wasn't the first to discover either thing for which he is named, which in turn is Stigler's paradox, which in turn....)

— Nick Cox
kaynak

This is a brilliant answer (and far easier to follow than the 1953 Good paper ;) ). Thank you!

— dB'

7

I don't know if there's a common way of doing it, but this looks to me analogous to inequality questions in economics. If you treat each word as an individual and their count as comparable to income, you're interested in comparing where the bag of words is between the extremes of every word having the same count (complete equality), or one word having all the counts and everyone else zero. The complication being that the "zeros" don't show up, you can't have less than a count of 1 in a bag of words as usually defined ...

The Gini coefficient of A is 0.18, and of B is 0.43, which shows that A is more "equal" than B.

library(ineq)

A <- c(3, 2, 2, rep(1, 11))
B <- c(9, 2, rep(1, 7))
Gini(A)
Gini(B)

I'm interested in any other answers too. Obviously the old fashioned variance in counts would be a starting point too, but you'd have to scale it somehow to make it comparable for bags of different sizes and hence different mean counts per word.

— Peter Ellis
kaynak

Good call - the Gini coefficient was my first thought, too! Searching on google scholar, though, I couldn't find much precedent for using it with text data. I wonder if the NLP / text retrieval community has a more standard measure for this sort of thing...

— dB'

Watch out: by my count Gini has been given as a name to at least three different measures. The history is defensible in each case, but people need to see the formula used.

— Nick Cox

1

Good point @NickCox - I was thinking of this one, as used for inequality, which I think is the most common use: ellisp.github.io/blog/2017/08/05/weighted-gini I've seen different methods of estimating/calculating it but all with the same basic definition, in this context. I know machine learning folks use it for something different but haven't seen their excuse...

— Peter Ellis

1

@dB' I found this paper of using Gini in a text application: proceedings.mlr.press/v10/sanasam10a/sanasam10a.pdf (I prefer this answer to the accepted one, simply as it does the best job of distinguishing your A and B !)

— Darren Cook

5

This article has a review of standard dispersion measures used by linguists. They are listed as single-word dispersion measures (They measure the dispersion of words across sections, pages etc.) but could conceivably be used as word frequency dispersion measures. The standard statistical ones seem to be:

max-min
standard deviation
coefficient of variation $CV$
chi-squared $\chi^2$

The classics are:

Jullard's $D = 1-\frac{CV}{\sqrt{n-1}}$
Rosengren's $S = N\frac{(\sum_{i=1}^{n}\sqrt{n_i})^2}{n}$
Carroll's $D_2 = (\log_2N - \frac{\sum_{i=1}^n{n_i \log_2 n_i}}{N})/{\log_2(n)}$
Lyne's $D_3 = \frac{1-\chi^2}{4N}$

Where $N$ is the total number of words in the text, $n$ is the number of distinct words, and $n_i$ the number of occurrences of the i-th word in the text.

The text also mentions two more measures of dispersion, but they rely on the spatial positioning of the words, so this is inapplicable to the bag of words model.

Note: I changed the original notation from the article, to make the formulas more consistent with the standard notation.

— Chris Novak
kaynak

Could you please define

f

$f$ and

x_{i}

$x_i$ ? I suspect they are, or are definable in terms of, symbols you've defined already.

— Nick Cox

Interesting and very extensive, but these are measures of dispersion for single words. They relate to the variation of the frequencies,

v_{i}

$v_i$ , of a single word in different pieces of text (instead of the frequencies of different words in a single piece of text). This difference should be clarified.

— Sextus Empiricus

1

Why are the equations from the source not copied exactly (it is not just a change of labels in the expressions but also a change of the expression, or at least not a consistent change of the labels/variables)?

— Sextus Empiricus

@NickCox Thank you for catching that, I corrected the formulas to include only defined quantities.

— Chris Novak

@MartijnWeterings You are right that originally the article dealt with single word dispersion metrics, although they seem to generalize to the word frequency trivially. Just in case I included that information in the answer. I changed the original notation to make these applicable to the bag of word model (replacing f with N and v_i with n_i). I added a note to signifiy this, but if you think it is still misleading I can provide a longer justification in the answer.

— Chris Novak

4

The first I would do is calculating Shannon's entropy. You can use the R package infotheo, function entropy(X, method="emp"). If you wrap natstobits(H) around it, you will get the entropy of this source in bits.

— Alexey Burnakov
kaynak

3

One possible measure of equality you could use is the scaled Shannon entropy. If you have a vector of proportions $\boldsymbol{p} \equiv (p_1, ... , p_n)$ then this measure is given by:

\bar{H} (p) \equiv - \frac{\sum p_{i} \ln p_{i}}{\ln n} .

$\bar{H}(\boldsymbol{p}) \equiv - \frac{\sum p_i \ln p_i}{\ln n}.$

This is a scaled measure with range $0 \leqslant \bar{H}(\boldsymbol{p}) \leqslant 1$ with extreme values occurring at the extremes of equality or inequality. Shannon entropy is a measure of information, and the scaled version allows comparison between cases with different numbers of categories.

Extreme Inequality: All the count is in some category $k$ . In this case we have $p_i = \mathbb{I}(i=k)$ and this gives us $\bar{H}(\boldsymbol{p}) = 0$ .
Extreme Equality: All the counts are equal over all categories. In this case we have $p_i = 1/n$ and this gives us $\bar{H}(\boldsymbol{p}) = 1$ .

— Ben - Reinstate Monica
kaynak