In a text mining application, one simple approach is to use the heuristic to create vectors as compact sparse representations of the documents. This is fine for the batch setting, where the whole corpus is known a-priori, as the requires the whole corpus
burada bir terim, d bir belgedir, D belge birliğidir ve T (gösterilmemiştir) sözlüktür.
Ancak, zaman içinde yeni belgeler alınır. Bir seçenek, belirli sayıda yeni belge alınana kadar mevcut kullanmaya devam etmek ve yeniden hesaplamaktır. Ancak bu oldukça verimsiz görünüyor. Herkes, tüm veriler önceden görülüyorsa (muhtemelen yaklaşık olarak) değere yaklaşan artımlı bir güncelleme şemasını biliyor mu? Veya alternatif olarak, aynı kavramı yakalayan ancak artımlı bir şekilde hesaplanabilecek başka bir önlem var mı?
There is also a related question of whether the remains a good measure over time. Since the idf captures the notion of the corpus word frequency, it is conceivable that older documents in the corpus (say for example, that my corpus includes over 100 years of journal articles), as the frequencies of different words change over time. In this case it might actually be sensible to throw out older documents when new ones come in, in effect using a sliding window . Conceivably, one could also store all previous vectors as new ones are calculated, and then if we wanted to retrieve documents from say 1920-1930, we could use the calculated from documents in that date range. Does this approach make sense?
Edit: There is a separate but related issue about the dictionary . As time evolves, there will be new dictionary terms that didn't appear before, so will need to grow, and hence the length of the vector. It seems like this wouldn't be a problem, as zeros could be appended to old vectors.