Collaborative Map-Reduce in the Browser

After immersing yourself into the field of distributed computing and large data sets you inevitably come to appreciate the elegance of Google's Map-Reduce framework. Both the generality and the simplicity of its map, emit, and reduce phases is what makes it such a powerful tool. However, while Google has made the theory public, the underlying software implementation remains closed source and is arguably one of their biggest competitive advantages (GFS, BigTable, etc). Of course, there is a multitude of the open source variants (Apache Hadoop, Disco, Skynet, amongst many others), but one can't help but to notice the disconnect between the elegance and simplicity of the theory and the painful implementation: custom protocols, custom servers, file systems, redundancy, and the list goes on! Which begs the question, how do we lower the barrier?

Massively Collaborative Computation

After several iterations, false starts, and great conversations with Michael Nielsen, a flash of the obvious came: HTTP + Javascript! What if you could contribute to a computational (Map-Reduce) job by simply pointing your browser to a URL? Surely your social network wouldn't mind opening a background tab to help you crunch a dataset or two!

Instead of focusing on high-throughput proprietary protocols and high-efficiency data planes to distribute and deliver the data, we could use battle tested solutions: HTTP and your favorite browser. It just so happens that there are more Javascript processors around the world (every browser can run it) than for any other language out there - a perfect data processing platform.

Google's server farm is rumored to be over six digits (and growing fast), which is an astounding number of machines, but how hard would it be to assemble a million people to contribute a fraction of their compute time? I don't think it's far-fetched at all as long as the barrier to entry is low. Granted, the efficiency of the computation would be much lower, but we would have a much larger potential cluster, and this could enable us to solve a whole class of problems previously unachievable.

Client-Side Computation in the Browser

Aside from storing and distributing the data the most expensive part of any job is the CPU time. However, by splitting the data into small and manageable chunks, we could easily construct an HTTP-based workflow to let the user's browser handle this for us:

The entire process consists of four easy steps. First, the client requests to join the cluster by making a request to the job-server which tracks the progress of the computation. Next, the job-server allocates a unit of work and redirects (301 HTTP Redirect, for example) the client to a URL which contains the data and the Javascript map/reduce functions. Here is a sample for a simple distributed word-count:

<html>
  <head>
    <script type="text/javascript">
 
      function map() {
        /* count the number of words in the body of document */
        var words = document.body.innerHTML.split(/\\n|\\s/).length;
        emit('reduce', {'count': words});
      }
 
      function reduce() {
        /* sum up all the word counts */
        var sum = 0;
        var docs = document.body.innerHTML.split(/\\n/);
        for each (num in docs) { sum+= parseInt(num) > 0 ? parseInt(num) : 0 }
        emit('finalize', {'sum': sum});
      }
 
      function emit(phase, data) { ... }
    </script>
  </head>
 
  <body onload="map();">
    ... DATA ...
  </body>
</html>
 

Once the page is loaded and the Javascript is executed (which is getting faster and faster with the Javascript VM wars), the results are sent back (POST) to the job-server, and the cycle repeats until all jobs (map and reduce) are done. Hence joining the cluster is as simple as opening a URL and distribution is handled by our battle-tested HTTP protocol.

Simple Job-Server in Ruby

The last missing piece of the puzzle is the job server to coordinate the distributed workflow. Which, as it turns out, takes just thirty lines of Ruby with the help of the Sinatra web framework:

require "rubygems"
require "sinatra"
 
configure do
  set :map_jobs, Dir.glob("data/*.txt")
  set :reduce_jobs, []
  set :result, nil
end
 
get "/" do
  redirect "/map/#{options.map_jobs.pop}" unless options.map_jobs.empty?
  redirect "/reduce"                      unless options.reduce_jobs.empty?
  redirect "/done"
end
 
get "/map/*"  do erb :map,    :file => params[:splat].first; end
get "/reduce" do erb :reduce, :data => options.reduce_jobs;  end
get "/done"   do erb :done,   :answer => options.result;     end
 
post "/emit/:phase" do
  case params[:phase]
  when "reduce" then
    options.reduce_jobs.push params['count']
    redirect "/"
 
  when "finalize" then
    options.result = params['sum']
    redirect "/done"
  end
end
 
# To run the job server: 
# > ruby job-server.rb -p 80 
 

That's it. Start up the server and type in the URL in your browser. The rest is both completely automated and easily parallelizable - just point more browsers at it! Add some load balancing, a database, and it may be just crazy enough that it might actually work.

브라우저에 Map-Reduce를 적용한 분산 시스템 만들기

스토리지나 데이터의 분산도 좋지만 사실 가장 비용이 드든건 CPU Time입니다.
이 CPU 자원을 잘게 쪼개서 여러 컴퓨터에 분산시킬 수 있다면 좋겠지요..
그런데 브라우저 사용자의 HTTP 통신을 이용해서 간단하게 분산하는 방법이 있습니다.

최근 어떤 사이트에 간단하게 브라우저와 Map-Reduce를 이용하여 분산 시스템을 만드는 법에 관한 글이 올라와 소개합니다.
일단 Flow를 그림으로 보면 아래와 같습니다. 




설명하자면 이렇습니다.

먼저 클러스터에 가입하려는 클라이언트(브라우저)가 Job Server에 접속합니다.
그러면 Job Server는 계산꺼리 데이터를 텍스트 형태로 브라우저에게 보내주는데(HTTP 301같은걸 이용해서),
이때 자바스크립트 map()함수와 reduce() 함수를 같이 보내줍니다.
그러면 클라이언트는 자바스크립트로 Map-Reduce를 실행하고 결과를 Job Server에게 돌려줍니다.


이렇게해서 Job Server는 작업의 조각들을 무수히 많은 대량의 클라이언트들(브라우저)에게 분산시키게 됩니다.


물론 여기에도 단점이 있을 수 있습니다.
Job Server가 받을 수 있는 Request의 수는 이론상의 최대 HTTP Request수를 넘을 수 없다는 것,
HTTP의 특성상 예상치 못한 오류가 생길 수 있다는 것,
보안에 약할 수 있다는 것 등의 해결해야 할 문제들이 있습니다만..,
이런 단점들을 조금씩 해결해 나가면 무지하게 큰 클라우드를 손쉽게 만들 수 있을지도 모르겠습니다.
어쨌든 참 재밌는 아이디어입니다.

간략하게 설명했습니다만 원본 글에 방문하시면 예제 소스와 함께 좀 더 자세히 보실 수 있습니다.


출처 : http://kimho.pe.kr/wordpress/?p=14 
원본 : http://www.igvita.com/2009/03/03/collaborative-map-reduce-in-the-browser/

AND