Büyük veri kümelerinde hesaplama problemleri için I / O Stratejileri?

15

Araştırma grubum, daha sonra analiz edilmesi gereken tek bir yörüngenin parçası olarak gigabaytlarca veri üretebilen moleküler dinamiklere odaklanıyor.

İlgilendiğimiz sorunların birçoğu veri kümesindeki korelasyonları içerir, bu da daha ardışık bir yaklaşım kullanmak yerine büyük miktarda veriyi hafızada izlememiz ve analiz etmemiz gerektiği anlamına gelir.

Bilmek istediğim şey, büyük veri kümelerinin I / O'larını komut dosyalarına dönüştürmek için en etkili stratejilerdir. Normalde Python tabanlı komut dosyaları kullanırız, çünkü G / Ç dosyasını kodlamayı C veya Fortran'dan çok daha az acı verici yapar, ancak işlenmesi gereken on veya yüz milyonlarca satırımız olduğunda, en iyi yaklaşımın ne olduğu açık değildir. . Kodun dosya giriş kısmını C olarak mı yapmalıyız yoksa başka bir strateji daha yararlı mıdır? (Tüm diziyi belleğe önceden yüklemek, bir dizi "parça" (megabayt sırası) sıralı okumalarından daha iyi olur mu?

Bazı ek notlar:

Öncelikle "çevrimiçi" araçlar yerine post-processing için komut dosyası araçları arıyoruz - dolayısıyla Python kullanımı.
$D = \frac{1}{6} lim_{Δ t \to \infty} ⟨ {(x (t + Δ t) - x (t))}^{2} ⟩$ $D = \frac{1}{6} \lim_{\Delta t \rightarrow \infty} \left< \left( {\bf x}(t + \Delta t) - {\bf x}(t) \right)^2 \right>$ This means we really need to load all of the data into memory before beginning the calculation—all of the chunks of data (records of individual times) will interact with one another.

python c efficiency

— aeismail
kaynak

6

I'm assuming your question comes from the observation that the I/O causes a significant overhead in your whole analysis. In that case, you can try to overlap I/O with computation.

A successful approach depends on how you access the data, and the computation you perform on that data. If you can identify a pattern, or the access to different regions of the data is known beforehand, you can try to prefetch the "next chunks" of data in the background while processing the "current chunks".

As a simple example, if you only traverse your file once and process each line or set of lines, you can divide the stream in chunks of lines (or MBs). Then, at each iteration over the chunks, you can load chunk i+1 while processing chunk i.

Your situation may be more complex and need more involved solutions. In any case, the idea is to perform the I/O in the background while the processor has some data to work on. If you give more details on your specific problem, we may be able to take a deeper look into it ;)

---- Extended version after giving more details ----

I'm not sure I understand the notation, but well, as you said, the idea is an all-to-all interaction. You also mention that the data may fit in RAM. Then, I would start by measuring the time to load all the data and the time to perform the computation. Now,

if the percent of the I/O is low (low as in you don't care about the overhead, whatever it is: 0.5%, 2%, 5%, ...), then just use the simple approach: load data at once, and compute. You will save time for more interesting aspects of your research.
if you cannot afford the overhead you may want to look into what Pedro suggested. Keep in mind what Aron Ahmadia mentioned, and test it before going for a full implementation.
if the previous are not satisfactory, I would go for some out-of-core implementation[1]. Since it seems that you are performing $n^2$ computations on $n$ data, there is hope :) Some pseudocode (assuming the results of your analysis fit in RAM):

    load chunk1 and chunk2
    for chunks i = 1 to n
        asynchronously load chunk i+1
        for chunks in j = i+1 to n
            asynchronously load chunk j+1
            compute with chunks i, j  (* for the first iteration, these are the preloaded chunks 1 and 2 *)

Note: this is quick and dirty pseudocode, one would need to adjust the indices.

To implement this, it is common to use the so-called double-buffering. Roughly speaking: divide memory in two workspaces; while data is being loaded in the background into workspace 1, processor is computing with the data in workspace 2. At each iteration, exchange the role.

I am sorry I cannot come up with a good reference right now.

[1] An out-of-core algorithm incorporates some mechanism to (efficiently) deal with data residing on disk. They are called out-of-core as opposed to in-core ("in-RAM").

— Diego
kaynak

7

I've had to deal with similar problems before, and my favourite solution is to use Memory-mapped I/O, albeit in C...

The principle behind it is quite simple: instead of opening a file and reading from it, you load it directly to the memory and access it as if it were a huge array. The trick that makes it efficient is that the operating system doesn't actually load the file, it just treats it like swapped-out memory that needs to be loaded. When you access any given byte in your file, the memory page for that part of the file is swapped into memory. If you keep accessing different parts of the file and memory gets tight, the less-used parts will get swapped back out -- automagically!

A quick Google search tells me that this is also available for Python: 16.7. mmap — Memory-mapped file support, but I don't know enough about Python to tell if it's really the same thing.

— Pedro
kaynak

1

Just make sure you measure and test before implementing something like mmap into your main code. Many modern operating systems give similar performance between regular read with less complication. (Also, yes, mmap in Python provides a portable interface to the Windows and UNIX memory maps).

— Aron Ahmadia

1

Perhaps you may use Cython in your file I/O sections and convert this part into C code?

— asmatic
kaynak