Artımlı IDF (Ters Belge Sıklığı)


11

In a text mining application, one simple approach is to use the tfidf heuristic to create vectors as compact sparse representations of the documents. This is fine for the batch setting, where the whole corpus is known a-priori, as the idf requires the whole corpus

idf(t)=log|D||{d:td}|

burada bir terim, d bir belgedir, D belge birliğidir ve T (gösterilmemiştir) sözlüktür.tdDT

Ancak, zaman içinde yeni belgeler alınır. Bir seçenek, belirli sayıda yeni belge alınana kadar mevcut kullanmaya devam etmek ve yeniden hesaplamaktır. Ancak bu oldukça verimsiz görünüyor. Herkes, tüm veriler önceden görülüyorsa (muhtemelen yaklaşık olarak) değere yaklaşan artımlı bir güncelleme şemasını biliyor mu? Veya alternatif olarak, aynı kavramı yakalayan ancak artımlı bir şekilde hesaplanabilecek başka bir önlem var mı?idf

There is also a related question of whether the idf remains a good measure over time. Since the idf captures the notion of the corpus word frequency, it is conceivable that older documents in the corpus (say for example, that my corpus includes over 100 years of journal articles), as the frequencies of different words change over time. In this case it might actually be sensible to throw out older documents when new ones come in, in effect using a sliding window idf. Conceivably, one could also store all previous idf vectors as new ones are calculated, and then if we wanted to retrieve documents from say 1920-1930, we could use the idf calculated from documents in that date range. Does this approach make sense?

Edit: There is a separate but related issue about the dictionary T. As time evolves, there will be new dictionary terms that didn't appear before, so |T| will need to grow, and hence the length of the idf vector. It seems like this wouldn't be a problem, as zeros could be appended to old idf vectors.


stupid question: It is a problem to store the denominator for each t ? How does the ratio of |t| to |d| looks like (in general) ?
steffen

Sorry maybe the equation isn't clear - idf(t) is the inverse document frequency of term t, rather than at time t. So at time t you would have a vector of length |T|, i.e. the size of the dictionary (which also may change). I'll make edits to that effect.
tdc

1
I understood the equation. My question was: If storing the dictionary is no problem then: Instead of storing |T| idfs one stores |T| denominators (of the equation) + number of documents. Incremental update is no problem then and idf is calculated on the fly. I have the feeling that I have overlooked something.
steffen

So you mean something like, given a new document d, if we have the value d:td, we simply add one to the denominator for t:td
tdc

precisely. If this feasible ?
steffen

Yanıtlar:


6

Ok, Thanks to Steffen for the useful comments. I guess the answer is quite simple in the end. As he says, all we need to do is store current denominator (call it z):

z(t)=|{d:td}|

Now given a new document d, we update the denominator simply by:

z(t)=z(t)+{1iftd0otherwise

We would then have to recalculate the tfidf based on the new idf vector.

Similarly to remove an old document, we decrement the numerator in a similar fashion.

This does mean that we either have to store the entire tf matrix as well as the tfidf matrix (doubling the memory requirements), or we have to compute the tfidf scores when needed (increasing computational costs). I can't see any way round that.

For the second part of the question, about the evolution of idf vectors over time, it seems that we can use the above method, and store a set of "landmark" z vectors (denominators) for different date ranges (or perhaps content subsets). Of course z is a dense vector of the length of the dictionary so storing a lot of these will be memory intensive; however this is probably preferable to recomputing idf vectors when needed (which would again require storing the tf matrix as well or instead).

Sitemizi kullandığınızda şunları okuyup anladığınızı kabul etmiş olursunuz: Çerez Politikası ve Gizlilik Politikası.
Licensed under cc by-sa 3.0 with attribution required.