Artımlı IDF (Ters Belge Sıklığı)

In a text mining application, one simple approach is to use the $tf-idf$ heuristic to create vectors as compact sparse representations of the documents. This is fine for the batch setting, where the whole corpus is known a-priori, as the $idf$ requires the whole corpus

i d f (t) = \log \frac{| D |}{| {d : t \in d} |}

$\mathrm{idf}(t) = \log \frac{|D|}{|\{d: t \in d\}|}$

burada bir terim, bir belgedir, belge birliğidir ve (gösterilmemiştir) sözlüktür. $t$ $d$ $D$ $T$

Ancak, zaman içinde yeni belgeler alınır. Bir seçenek, belirli sayıda yeni belge alınana kadar mevcut kullanmaya devam etmek ve yeniden hesaplamaktır. Ancak bu oldukça verimsiz görünüyor. Herkes, tüm veriler önceden görülüyorsa (muhtemelen yaklaşık olarak) değere yaklaşan artımlı bir güncelleme şemasını biliyor mu? Veya alternatif olarak, aynı kavramı yakalayan ancak artımlı bir şekilde hesaplanabilecek başka bir önlem var mı? $idf$

There is also a related question of whether the $idf$ remains a good measure over time. Since the idf captures the notion of the corpus word frequency, it is conceivable that older documents in the corpus (say for example, that my corpus includes over 100 years of journal articles), as the frequencies of different words change over time. In this case it might actually be sensible to throw out older documents when new ones come in, in effect using a sliding window $idf$ . Conceivably, one could also store all previous $idf$ vectors as new ones are calculated, and then if we wanted to retrieve documents from say 1920-1930, we could use the $idf$ calculated from documents in that date range. Does this approach make sense?

Edit: There is a separate but related issue about the dictionary $T$ . As time evolves, there will be new dictionary terms that didn't appear before, so $|T|$ will need to grow, and hence the length of the $idf$ vector. It seems like this wouldn't be a problem, as zeros could be appended to old $idf$ vectors.

time-series text-mining

— tdc
kaynak

stupid question: It is a problem to store the denominator for each t ? How does the ratio of |t| to |d| looks like (in general) ?

— steffen

Sorry maybe the equation isn't clear -

i d f (t)

$idf(t)$ is the inverse document frequency of term t, rather than at time

t

$t$ . So at time

t

$t$ you would have a vector of length

| T |

$|T|$ , i.e. the size of the dictionary (which also may change). I'll make edits to that effect.

— tdc

I understood the equation. My question was: If storing the dictionary is no problem then: Instead of storing |T| idfs one stores |T| denominators (of the equation) + number of documents. Incremental update is no problem then and idf is calculated on the fly. I have the feeling that I have overlooked something.

— steffen

So you mean something like, given a new document

d^{*}

$d^*$ , if we have the value

d : t \in d

${d:t \in d}$ , we simply add one to the denominator for

t : t \in d^{*}

${t:t \in d^*}$

— tdc

precisely. If this feasible ?

— steffen

Ok, Thanks to Steffen for the useful comments. I guess the answer is quite simple in the end. As he says, all we need to do is store current denominator (call it $z$ ):

$z(t) = |\{d:t\in d\}|$

Now given a new document $d^*$ , we update the denominator simply by:

$z^*(t) = z(t) + \left\{ \begin{array}{ll} 1 & \mbox{if}\; {t\in d^*} \\ 0 & \mbox{otherwise} \end{array} \right.$

We would then have to recalculate the $tf-idf$ based on the new $idf$ vector.

Similarly to remove an old document, we decrement the numerator in a similar fashion.

This does mean that we either have to store the entire $tf$ matrix as well as the $tf-idf$ matrix (doubling the memory requirements), or we have to compute the $tf-idf$ scores when needed (increasing computational costs). I can't see any way round that.

For the second part of the question, about the evolution of $idf$ vectors over time, it seems that we can use the above method, and store a set of "landmark" $z$ vectors (denominators) for different date ranges (or perhaps content subsets). Of course $z$ is a dense vector of the length of the dictionary so storing a lot of these will be memory intensive; however this is probably preferable to recomputing $idf$ vectors when needed (which would again require storing the $tf$ matrix as well or instead).

— tdc
kaynak