Well, other than crypto what is there really to do that is more efficient in the...

jws · on May 6, 2011

I don't think it needs to be more efficient than the CPU to merit moving to the GPU. If the horsing around to get the data in and out is less work than doing the job, then you may was well put the GPU to work and improve your total throughput.

Perhaps the memory page deduplication candidate detection could run out there. It would be memory bound, but maybe by not ruining the CPU cache it would be a win. (This is important for systems running a bunch of virtual machines.)

malingo · on May 6, 2011

"the horsing around to get the data in and out" seems to be the key factor. An analysis of BLAS libraries' performance across several architectures [1] showed that GPU-based calculation only approached implementations like Goto BLAS with matrix dimensions well up into the thousands. That's just one example, but there seems to be a fair bit of overhead in getting the data to and from the GPU.

[1] http://dirk.eddelbuettel.com/blog/code/gcbd/

kazuya · on May 6, 2011

Calculating error correction code, though efficiency depends on the memory architecture.

I heard Tsubame, a supercomputer built with NVIDIA GPUs, calculated ECC on its GPU-side memory with GPU code because those GPUs were consumer grade and didn't have hardware ECC.

wmf · on May 6, 2011

Routing: http://shader.kaist.edu/packetshader/

This was really non-obvious to me.

onan_barbarian · on May 6, 2011

It's non-obvious because it's a bad idea. This paper comes from a wacky world where latency and power consumption don't matter. The comparisons between CPU vs. GPU aren't that compelling just on the surface of it. The latency/power consumption numbers (compared to dedicated ASICs for this sort of thing) are just laughable.

Being the most compelling 'software router' is sort of like being the 'tallest midget' but even in this domain, I think their alleged advantages over CPU-only are mainly due to carefully massaging the presentation of the data.

forgottenpaswrd · on May 7, 2011

OCR.(Optical character recognition) Picture recognition - face detection. Speech recognition. Speech synthesis. Video recognition.

Multi touch gestures and handwriting recognition.

elliottcarlson · on May 6, 2011

Phiber Optik (Mark Abene) had a pretty interesting talk yesterday at NY Hacker about using CUDA for intrusion detection calculations.

wazoox · on May 6, 2011

RAID checksums computation looks like an obvious possibility. We'd need a battery backup for the VRAM, too :)

undef · on May 6, 2011

A single core can hash (checksum) 5 GB/s using murmurhash. The data you checksum is probably already in L1/L2 cache (write to RAID) or going to be used by userland, and us reading the data will just mean userland process gets its data from cache instead (read from RAID). You can get maybe 2-6 GB/s to GPU. Add the latency (sync, etc.) and GPU time to calculate the hash, you've probably radically slowed down the process. Additionally, assuming DMA transfer, your memory subsystem is more stressed due to both CPU and GPU reading same data.

Oh, and simple xor? Well, assuming data is in L2 already, Intel i7 can xor 10+ GB/s using just single core, between 3 buffers, aka minimum RAID 5. Fastest RAID adapters can achieve only a fraction of that speed.

rbanffy · on May 6, 2011

I think this is very memory intensive. Remember the GPU would have to calculate block checksums preferably from the main memory where the buffer resides.

Maybe block deduplication could be done this way. If the block is a dupe, skipping its allocation on the disk (it would save at least one block write) could offset a lot of block hash calculations.

adrianN · on May 7, 2011

Working with polynomials. That's important in Computational Geometry

https://domino.mpi-sb.mpg.de/intranet/ag1/ag1publ.nsf/0/ca00...

tkahn6 · on May 6, 2011

Calculating Viterbi paths for Hidden Markov Models is faster by an order of magnitude or two than doing it on the CPU. I worked on porting NVIDIAs OpenCL implementation to a more 'platform neutral' version for the research project I'm involved in.

Here are some more examples:

http://developer.download.nvidia.com/compute/opencl/sdk/webs...

There are many, many applications beyond crypto.

samps · on May 6, 2011

I think the question is about how the kernel can use the GPU. Linux probably doesn't need to train hidden Markov models. It might, however, need to do crypto (e.g., for an encrypted filesystem).

tkahn6 · on May 6, 2011

Oh gosh you're right. I wasn't thinking about the context in which the question was posed. Anyway, hopefully someone will find those examples interesting. NVIDIA's CUDA developer zone is chock full of great resources for GPGPU (like video lectures and tools and code examples).