ToDo: - complete hth_1 - revise hth_4 to use streaming (reduce top edge actuals by 1). - hybrid (hth4 is returning a strange answer). Codes: basic_mw: simple master/worker in nws dna_1: simple mw of dna dna_2: like dna_1, but with watermarking dna_2.5: like dna_2, but workers process results and only send the best to master. dna_3: result parallel version of dna outer loop. hth_1: (not finished): result parallel version of single hth comparison hth_2: single head-to-head using agenda parallelism hth_3: target vs database, all workers applied to each hth hth_4: hybrid: target vs database, fixed blocking 2/22/07 Lecture On board, Introduce DNA algorithm Talk about real-life relevance board algorithm Overview of individual comparison Things to think about as we go along: 1) possible points of parallelism 2) load balance 3) computation vs. communication 4) watermarking techniques What questions would you ask? - how expensive is a comparison? - how much do they vary? - cost of I/O - how big is the database? - where can we parallelize, and what are the tradeoffs? - tradeoffs: - amount of parallelism - complexity of code change - task dependencies - granularity How can we imagine applying our 3 different styles of parallelism? - result (each seq<->target is a result) - specialist (each worker holds a part of the db, or a whole db) - agenda (many workers, each turning seqs into results) SWITCH TO LAPTOP Seqential algorithm in python: introduce: input files, Database object, similarity function Ask for thoughts on ways forward Basic MW program (not DNA) (agenda) Discuss: Sleigh and eachWorker poison task We'll choose parallelizing the outer loop. Why? What advantages? First parallel version, on board, as pipeline like page 106. while(s=getseq()) generatetask(s) while(t=gettask()) generateresult(compare(target, t)) while(r=getresult()) processresult() show code for MW program for DNA Introduce bulldogc cluster log into bulldogc via putty run seq and dna_1 versions on test.70 vs db.r.small, talk about speedup. Problems with this program? What if there are many seq in DB? Discuss watermarking: Book's way: cnt=0 while (t=gettask()) ws.store('task', t) if (++cnt>upper) while (cnt-- > lower) r=fetch('result') processresult(r) } } while (cnt--) { r=fetch('result') processresult(r) } An somewhat simpler alternative way: cnt=0 while (t=gettask() and ++cnt