Bilimsel hesaplamada paralelleştirme ile hızlanamayan ünlü problemler / algoritmalar var mı?


27

Bilimsel hesaplamada paralelleştirme ile hızlanamayan ünlü problemler / algoritmalar var mı? CUDA ile ilgili kitap okurken çoğu şeyin olabileceği anlaşılıyor.


İkili arama, bellek hiyerarşisi düşünüldüğünde bile (önemli ölçüde bir faktöre göre) hızlandırılamaz.


3
@Anycorn Hayır, sol görünümlü klasik Gram-Schmidt ve sağ görünümlü modifiye Gram-Schmidt paralel olarak iyi çalışır. Son zamanlarda popüler olan TSQR dahil olmak üzere birçok paralel QR algoritması vardır.
Jed Brown

@Raphael: İkili aramayı faktör günlüğü (n), n = # işlemciler ile hızlandırmanın mümkün olduğunu düşünüyorum. Arama aralığını parçalara bölmek ve nerede devam edeceğini kontrol etmek yerine, aralığı n parçaya bölün. Belki daha verimli yollar vardır, bilmiyorum.
mucize173

Yanıtlar:


32

Temel sorun, kritik yol uzunluğu hesaplama toplam miktarına göre , T . Eğer C ile orantılıdır T iyi bir sabit hız-up, daha sonra paralellik teklifler. Eğer daha asimptotik küçük olan T , sorun büyüklüğü arttıkça daha fazla paralellik için oda vardır. Ki burada algoritmalar için , T giriş büyüklüğü polinom , N , en iyi durumda ~ günlük T çok az faydalı miktarlarda logaritmik süresinden daha kısa bir süre içinde hesaplanabilir, çünkü.CTCTCTTNClogT

Örnekler

  • C=TStandart algoritmayı kullanarak bir tridiagonal çözme içinHer işlem önceki işlemin tamamlanmasına bağlıdır, bu yüzden paralellik için fırsat yoktur. Tridiagonal problemler, logaritmik bir sürede paralel bir bilgisayarda, doğrudan diseksiyonlu direk çözme, çok düzeyli alan ayrıştırması veya harmonik uzatma kullanılarak oluşturulan temel fonksiyonlara sahip çok çekirdekli bir bilgisayarda çözülebilir (bu üç algoritma, birden fazla boyutta farklıdır, ancak 1D'de tam olarak çakışabilir).
  • Yoğun bir alt üçgen bir ile çözmek matrisi vardır , T = K = O ( m, 2 ) , ancak kritik yol sadece = m = m×mT=N=O(m2) , bu yüzden bazı paralellikler yararlı olabilir.C=m=T
  • Multigrid ve FMM'nin her ikisi de , kritik yol uzunluğu C = log T'dir .T=NC=logT
  • Alanın düzenli bir bir süre için açık dalga yayılımı ( 0 , 1 ) d , k = τ / Δ t τ N 1 / d zaman adımlarını gerektirir (kararlılık için), bu nedenle kritik yol en az C = k . Toplam çalışma miktarı T = k N = τ N ( d + 1 ) / d'dir . Maksimum yararlı işlemci sayısı P = Tτ(0,1)dk=τ/ΔtτN1/dC=kT=kN=τN(d+1)/d , kalan faktör N 1 / d artan paralellik ile geri kazanılamaz.P=T/C=NN1/d

Biçimsel karmaşıklık

Kuzey Carolina karmaşıklığı sınıfı (polylogarithmic zamanda, örneğin,) paralel olarak etkin bir şekilde çözülebilir bu sorunları karakterize etmektedir. olup olmadığı bilinmemektedir , ancak yanlış olduğu yaygın olarak varsayılmaktadır. Eğer gerçekten durum buysa, P-complete , "içsel olarak sıralı" olan ve paralellikten dolayı önemli ölçüde hızlanamayan sorunları karakterize eder.NC=P


13

NCO(logcn)O(nk) parallel processors. It is still unknown whether P=NC (although most people suspect it's not) where P is the set of problems solvable in polynomial time. The "hardest" problems to parallelize are known as P-complete problems in the sense that every problem in P can be reduced to a P-complete problem via NC reductions. If you show that a single P-complete problem is in NC, you prove that P=NC (although that's probably false as mentioned above).

So any problem that is P-complete would intuitively be hard to parallelize (although big speedups are still possible). A P-complete problem for which we don't have even very good constant factor speedups is Linear Programming (see this comment on OR-exchange).


9

Start by grocking Amdahl's Law. Basically anything with a large number of serial steps will benefit insignificantly from parallelism. A few examples include parsing, regex, and most high-ratio compression.

Aside from that, the key issue is often a bottleneck in memory bandwidth. In particular with most GPU's your theoretical flops vastly outstrip the amount of floating point numbers you can get to your ALU's, as such algorithms with low arithmetic intensity (flops / cache-miss) will spend a vast majority of time waiting on RAM.

Lastly, any time that a piece of code requires branching, it is unlikely to get good performance, as ALU's typically outnumber logic.

In conclusion, a really simple example of something that would be hard to get a speed gain from a GPU is simply counting the number of zeros in a array of integers, as you may have to branch often, at most perform 1 operation (increment by one) in the case that you find a zero, and make at least one memory fetch per operation.

An example free of the branching problem is to compute a vector which is the cumulative sum of another vector. ( [1,2,1] -> [1,3,4] )

I don't know if these count as "famous" but there is certainly a large number of problems that parallel computing will not help you with.


3
The "branching free example" you gave is the prefix-sum, which actually has a good parallel algorithm: http.developer.nvidia.com/GPUGems3/gpugems3_ch39.html Calculating the number of zeros should be efficient for similar reasons. There is no way around arithmetic intensity, though...
Max Hutchinson

Cool. I stand corrected on that one.
meawoppl

8

The (famous) fast marching method for solving the Eikonal equation cannot be sped up by parallelization. There are other methods (for example fast sweeping methods) for solving the Eikonal equation that are more amenable to parallelization, but even here the potential for (parallel) speedup is limited.

The problem with the Eikonal equation is that the flow of information depends on the solution itself. Loosely speaking, the information flows along the characteristics (i.e. light rays in optics), but the characteristics depend on the solution itself. And the flow of information for the discretized Eikonal equation is even worse, requiring additional approximations (like implicitly present in fast sweeping methods) if any parallel speedup is desired.

To see the difficulties for parallelization, imagine a nice labyrinth like in some of the examples on Sethian's webpage. The number of cells on the shortest path through the labyrinth (probably) is a lower bound for the minimal number of steps/iterations of any (parallel) algorithm solving the corresponding problem.

(I write "(probably) is", because lower bounds are notoriously difficult to prove, and often require some reasonable assumptions on the operations used by an algorithm.)


Nice example, but I do not believe your claimed lower bound. In particular, multigrid methods can be used to solve the eikonal equation. As with multigrid for high frequency Helmholtz, the challenges are mainly in designing suitable coarse spaces. In the case of a labyrinth, a graph aggregation strategy should be effective, with the coarse representation determined by solving local (thus independent) problems for segments of the labyrinth.
Jed Brown

In general when multigrid methods do well, it means that the granularity of the problem is lower than the descritization, and a disproportional "amount of correct answer" is coming from the course solve step. Just an observation, but the lower bound on that sort of thing is tricky!
meawoppl

@JedBrown From a practical perspective, multigrid for high frequency Helmholtz is quite challenging, contrary to what your comment seems to imply. And using multigrid for the eikonal equation is "uncommon", to say the least. But I see your "theoretical" objection against the suggested lower bound: Time offsets from various points inside the labyrinth can be computed before the time to reach these points is known, and added in parallel after the missing information becomes available. But in practice, general purpose parallel eikonal solvers are happy if they actually come close to the bound.
Thomas Klimpel

I didn't mean to imply that it was easy, the wave-ray coarse spaces are indeed very technical. But, I think we agree that there is already opportunity for parallelism in open regions, while in narrow "labyrinths" (which expose very little parallelism with standard methods), the upscaling problem is more tractable.
Jed Brown

@JedBrown Slide 39 of www2.ts.ctw.utwente.nl/venner/PRESENTATIONS/MSc_Verburg.pdf (from 2010) says things like "Extend the solver from 2D to 3D" and "Adapt method to problems with strongly varying wavenumbers". So wave-ray multigrid may be promising, but "not yet mature" seems to be more appropriate than "very technical" for describing its current issues. And it is not really a high frequency Helmholtz solver (because it is a "full wave" solver). There are others "sufficiently mature" multigrid Helmholtz solvers ("full wave" solvers), but even these are still "active research".
Thomas Klimpel

1

Another class of problems that are hard to parallelize in practice are problems sensitive to rounding errors, where numerical stability is achieved by serialization.

Consider for example the Gram–Schmidt process and its serial modification. The algorithm works with vectors, so you might use parallel vector operations, but that does not scale well. If the number of vectors is large and the vector size is small, using parallel classical Gram–Schmidt and reorthogonalization might be stable and faster than single modified Gram–Schmidt, although it involves doing several times more work.

Sitemizi kullandığınızda şunları okuyup anladığınızı kabul etmiş olursunuz: Çerez Politikası ve Gizlilik Politikası.
Licensed under cc by-sa 3.0 with attribution required.