I'm assuming your question comes from the observation that the I/O causes a significant overhead in your whole analysis. In that case, you can try to overlap I/O with computation.
A successful approach depends on how you access the data, and the computation you perform on that data. If you can identify a pattern, or the access to different regions of the data is known beforehand, you can try to prefetch the "next chunks" of data in the background while processing the "current chunks".
As a simple example, if you only traverse your file once and process each line or set of lines, you can divide the stream in chunks of lines (or MBs). Then, at each iteration over the chunks, you can load chunk i+1 while processing chunk i.
Your situation may be more complex and need more involved solutions. In any case, the idea is to perform the I/O in the background while the processor has some data to work on. If you give more details on your specific problem, we may be able to take a deeper look into it ;)
---- Extended version after giving more details ----
I'm not sure I understand the notation, but well, as you said, the idea is an all-to-all interaction. You also mention that the data may fit in RAM. Then, I would start by measuring the time to load all the data and the time to perform the computation. Now,
if the percent of the I/O is low (low as in you don't care about the overhead, whatever it is: 0.5%, 2%, 5%, ...), then just use the simple approach: load data at once, and compute. You will save time for more interesting aspects of your research.
if you cannot afford the overhead you may want to look into what Pedro suggested. Keep in mind what Aron Ahmadia mentioned, and test it before going for a full implementation.
if the previous are not satisfactory, I would go for some out-of-core implementation[1]. Since it seems that you are performing n2 computations on n data, there is hope :) Some pseudocode (assuming the results of your analysis fit in RAM):
load chunk1 and chunk2
for chunks i = 1 to n
asynchronously load chunk i+1
for chunks in j = i+1 to n
asynchronously load chunk j+1
compute with chunks i, j (* for the first iteration, these are the preloaded chunks 1 and 2 *)
Note: this is quick and dirty pseudocode, one would need to adjust the indices.
To implement this, it is common to use the so-called double-buffering. Roughly speaking: divide memory in two workspaces; while data is being loaded in the background into workspace 1, processor is computing with the data in workspace 2. At each iteration, exchange the role.
I am sorry I cannot come up with a good reference right now.
[1] An out-of-core algorithm incorporates some mechanism to (efficiently) deal with data residing on disk. They are called out-of-core as opposed to in-core ("in-RAM").
mmap
into your main code. Many modern operating systems give similar performance between regularread
with less complication. (Also, yes, mmap in Python provides a portable interface to the Windows and UNIX memory maps).