Creating 2 thread per core in the purpose of receiving while sending is
plain stupid. First it needs 2 threads synchronizing with each other
which has a cost. Second, since only one thread can run at a time the
threads slow each other (using BatchQueue where the sender is on the
same core as the receiver yields bad performance). This patch remove all
this complexity to have one thread receive, compute and then resend
data, which improve performances dramatically.
Uses 2 mapping to the same structure to avoid prefetching of the
producer semi-buffer by the consumer. The idea is to access everything
through mapping 1 except semi-buffer 2 which is accessed through mapping
2.
Add native algorithm from OpenMP stream extension. This require adding
one function in commtech.h: end_producer(). This function does nothing
for all communication algorithm but gomp_stream (the algorithm added by
this commit).
* Refactor the source to be able to chain more than 2 nodes together
* Compile all binaries by default (binList must be set manually in
lancement.sh to run only a subset of the binaries
This respect what we claim to send to the send() function and allow to
reduce the FAKE_NURSERY_START. Thus we are sure gcc won't optimize the
second part of the if in include/jikes_barrier_comm.h
Main changes:
* Better separation of what is common from what is specific to a
communication technique
* Consumer wait initialization of all producer threads
Main changes:
* change library initialization: initialization is done with
init_library once and init_thread_comm by each thread
* cont is now directly accessed by main
* list of struct communication_assoc -> array of struct thread_comm
* struct communication_channel -> struct comm_channel