ToDo: 
 - complete hth_1
 - revise hth_4 to use streaming (reduce top edge actuals by 1).
 - hybrid (hth4 is returning a strange answer).


Codes:
 basic_mw: simple master/worker in nws
 dna_1: simple mw of dna
 dna_2: like dna_1, but with watermarking
 dna_2.5: like dna_2, but workers process results and only send the best to master.
 dna_3: result parallel version of dna outer loop.

 hth_1: (not finished): result parallel version of single hth comparison 
 hth_2: single head-to-head using agenda parallelism
 hth_3: target vs database, all workers applied to each hth
 hth_4: hybrid: target vs database, fixed blocking

2/22/07 Lecture
On board, Introduce DNA algorithm
  Talk about real-life relevance
  board algorithm
  Overview of individual comparison

Things to think about as we go along:
  1) possible points of parallelism
  2) load balance
  3) computation vs. communication
  4) watermarking techniques

What questions would you ask?
  - how expensive is a comparison?
  - how much do they vary?
  - cost of I/O
  - how big is the database?
  - where can we parallelize, and what are the tradeoffs?
  - tradeoffs: 
    - amount of parallelism
    - complexity of code change
    - task dependencies
    - granularity

How can we imagine applying our 3 different styles of parallelism?
  - result (each seq<->target is a result)
  - specialist (each worker holds a part of the db, or a whole db)
  - agenda (many workers, each turning seqs into results)

SWITCH TO LAPTOP
Seqential algorithm in python:
  introduce:
    input files, Database object, similarity function

Ask for thoughts on ways forward
  
Basic MW program (not DNA) (agenda)
  Discuss:
     Sleigh and eachWorker
     poison task

We'll choose parallelizing the outer loop.  Why?  What advantages?

First parallel version, on board, as pipeline like page 106.
while(s=getseq())
  generatetask(s)

while(t=gettask())
  generateresult(compare(target, t))

while(r=getresult())
  processresult()

show code for MW program for DNA
  Introduce bulldogc cluster
  log into bulldogc via putty
  run seq and dna_1 versions on test.70 vs db.r.small, talk about speedup.

  Problems with this program?
     
  What if there are many seq in DB?
  
  Discuss watermarking:

    Book's way:

cnt=0
while (t=gettask())
  ws.store('task', t)
  if (++cnt>upper)
  while (cnt-- > lower)
    r=fetch('result')
    processresult(r)
  }
}
while (cnt--) {
    r=fetch('result')
    processresult(r)  
}

An somewhat simpler alternative way:
cnt=0
while (t=gettask() and ++cnt<watermark)
  ws.store('task', t)

while (t=gettask())
  r=fetch('result')
  processresult(r)
  store('task', t)

while (cnt--)
  r=fetch('result')
  processresult(r)

talk about Jan's code, and how it brought down the cluster.
      
Show dna_2, with watermarking.

Discuss problem of large tasks, and ordering.

1000s of work, 20 processors
one task is 10s, picked up at the end.

total runtime is 980/20 + 10 = 49+10
1000/59=17x
17/20=85% eff, or we wasted 3 procs.

explain trailoff.

To fix it, order tasks by size, largest first

Can you think of a downside to this? (lots of small tasks at the end might
overwhelm master or communication fabric)

In book, added indices to task tuples and a index tuple to let workers get tasks
in order.
In NWS, we get ordering automatically with fifo variables (default).  Just
have to sort the database file.

RUN on Bulldogc, record time

Last thing: we can offload some of the result processing onto workers (trivial in this
case, but not always).

Show DNA_2.5

Don't run, since not much difference.

For Fun, lets look at the "result parallel" version of the prog.  
Note that it's not really a RP, since eachElem is actually a MW impl. of agenda.
 
Run DNA_3 on bulldogc.

Go back to our picture of trailoff issues
1000s of work, 20 processors
What if biggest task is 100s?
No ordering will help, even if a worker gets it at the beginning:

900/19 ~47
total runtime is max(900/19, 100) = 100
1000/100=10x
10/20=50% eff, or we wasted 10 procs.

After HTH stuff:

Think about performance modeling:

W=#workers
T=#tasks
max of:
1) IO time on master, including nws.
2) seqwork/W+task overhead(T/W)
3) non-paral. seq costs