Creating 2 thread per core in the purpose of receiving while sending is
plain stupid. First it needs 2 threads synchronizing with each other
which has a cost. Second, since only one thread can run at a time the
threads slow each other (using BatchQueue where the sender is on the
same core as the receiver yields bad performance). This patch remove all
this complexity to have one thread receive, compute and then resend
data, which improve performances dramatically.
* Provide for BatchQueue, CSQ, FastForward, MCRingBuffer and GOMP stream
a version using 64 cache lines in total for all buffers.
* Rename common version from _common_comm.h to _common.h to avoid
considering them as communication technique on their own
Uses 2 mapping to the same structure to avoid prefetching of the
producer semi-buffer by the consumer. The idea is to access everything
through mapping 1 except semi-buffer 2 which is accessed through mapping
2.
Add native algorithm from OpenMP stream extension. This require adding
one function in commtech.h: end_producer(). This function does nothing
for all communication algorithm but gomp_stream (the algorithm added by
this commit).
* Refactor the source to be able to chain more than 2 nodes together
* Compile all binaries by default (binList must be set manually in
lancement.sh to run only a subset of the binaries
Add a calculation method which add the value of the first integer of
n consecutive cache lines and write the results in one of the integer of
these cache lines. Next calculation uses the next n consecutives cache
lines and write the result in the next integer.
* Divide CSQ in 2 communication techniques: one with 2 slots (as in
BatchQueue aka c_cache) and one with 64 slots (as in the article)
* Rename fake communication technique in none communication technique
and disable any activity (send no longer does anything)
Paper about CSQ uses memcpy in enqueue and dequeue. Although it is not
possible to use memcpy in enqueue because of current API, it is possible
to use memcpy in dequeue, hence this commit.
This respect what we claim to send to the send() function and allow to
reduce the FAKE_NURSERY_START. Thus we are sure gcc won't optimize the
second part of the if in include/jikes_barrier_comm.h
Pages cannots be freed as fast as they are allocated, so this whole
mecanism can only delay the kernel panic. It's wiser to exit badly if
too much memory is consumed