Documentation
¶
Overview ¶
Bigpi is an example bigmachine program that estimates digits of Pi using the Monte Carlo method. It distributes work by instantiating multiple machines and calling them to make samples, returning the total number of the samples that fell inside of the unit circle.
We can run it locally with a small number of sample to test:
% bigpi -n 1000000 2018/03/16 15:21:05 waiting for machines to come online 2018/03/16 15:21:08 machine http://localhost:63880/ RUNNING 2018/03/16 15:21:08 machine http://localhost:63878/ RUNNING 2018/03/16 15:21:08 machine http://localhost:63879/ RUNNING 2018/03/16 15:21:08 machine http://localhost:63881/ RUNNING 2018/03/16 15:21:08 machine http://localhost:63877/ RUNNING 2018/03/16 15:21:08 all machines are ready 2018/03/16 15:21:08 distributing work among 5 cores http://localhost:63878/: 2018/03/16 15:21:08 0/200000 http://localhost:63880/: 2018/03/16 15:21:08 0/200000 http://localhost:63879/: 2018/03/16 15:21:08 0/200000 http://localhost:63881/: 2018/03/16 15:21:08 0/200000 2018/03/16 15:21:08 total=784425 nsamples=1000000 π = 3.1377
By using a large EC2 instance we can distribute the work over 100s of cores trivially:
% bigpi -bigsystem ec2 -bigec2type c5.18xlarge -n 1000000000000 2018/03/20 21:00:05 waiting for machines to come online 2018/03/20 21:01:09 machine https://ec2-54-213-185-145.us-west-2.compute.amazonaws.com:2000/ RUNNING 2018/03/20 21:01:09 machine https://ec2-35-164-137-2.us-west-2.compute.amazonaws.com:2000/ RUNNING 2018/03/20 21:01:09 machine https://ec2-34-208-105-231.us-west-2.compute.amazonaws.com:2000/ RUNNING 2018/03/20 21:01:09 machine https://ec2-34-211-149-59.us-west-2.compute.amazonaws.com:2000/ RUNNING 2018/03/20 21:01:09 machine https://ec2-34-223-251-92.us-west-2.compute.amazonaws.com:2000/ RUNNING 2018/03/20 21:01:09 all machines are ready 2018/03/20 21:01:09 distributing work among 360 cores https://ec2-34-208-105-231.us-west-2.compute.amazonaws.com:2000/: 2018/03/20 21:01:09 0/2777777777 https://ec2-34-223-251-92.us-west-2.compute.amazonaws.com:2000/: 2018/03/20 21:01:09 0/2777777777 ... 2018/03/20 21:13:27 total=785397678380 nsamples=1000000000000 π = 3.141590713520
Once a bigmachine program is running, we can profile it using the standard Go pprof tooling. The returned profile is sampled from the whole cluster and merged. In the first iteration of this program, this helped find a bug: we were using the global rand.Float64 which requires a lock. The CPU profile highlighted the lock contention easily:
% go tool pprof localhost:3333/debug/bigmachine/pprof/profile
Fetching profile over HTTP from http://localhost:3333/debug/bigmachine/pprof/profile
Saved profile in /Users/marius/pprof/pprof.045821636.samples.cpu.001.pb.gz
File: 045821636
Type: cpu
Time: Mar 16, 2018 at 3:17pm (PDT)
Duration: 2.51mins, Total samples = 16.80mins (669.32%)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) top
Showing nodes accounting for 779.47s, 77.31% of 1008.18s total
Dropped 51 nodes (cum <= 5.04s)
Showing top 10 nodes out of 58
flat flat% sum% cum cum%
333.11s 33.04% 33.04% 333.11s 33.04% runtime.procyield
116.71s 11.58% 44.62% 469.55s 46.57% runtime.lock
76.35s 7.57% 52.19% 347.21s 34.44% sync.(*Mutex).Lock
65.79s 6.53% 58.72% 65.79s 6.53% runtime.futex
41.48s 4.11% 62.83% 202.05s 20.04% sync.(*Mutex).Unlock
34.10s 3.38% 66.21% 364.36s 36.14% runtime.findrunnable
33s 3.27% 69.49% 33s 3.27% runtime.cansemacquire
32.72s 3.25% 72.73% 51.01s 5.06% runtime.runqgrab
24.88s 2.47% 75.20% 57.72s 5.73% runtime.unlock
21.33s 2.12% 77.31% 21.33s 2.12% math/rand.(*rngSource).Uint64