We're working on ramping up to run some 40000 tasks on an EC2 cloud, using using the GG 2.1.0 ami to run 19 instances, along with an OpenMQ instance to take all the tasks.
This works fine as long as I'm only trying to run 10 or so tasks, but as soon as I ramp up to hundreds, I start getting lots of peer classloading errors, like the one attached at the bottom here. After some time, it does start running my tasks, but the ones that failed due to class loading stay failed (my task is a subclass of GridTaskSplitAdapter, and does not extend result()) stay failed. For example, I ran 280 tasks, and only 29 successfully executed.
For the moment, I've overridden the GridTaskSplitAdapter.result() method so that for ANY GridException, we indicate that the task should be FailedOver. This doesn't seem like a long term solution though, so I'm hoping someone can give me an idea on what my options are for getting around this. I'd like to stick with peer classloading, as the code we're running in this case will be changing so I'd rather not have to create a new image, deploy our code to it, etc. I imagine one option is to make our own GG image that has a much higher peer class loading timeout, is that the best solution?
Exception:
----------
<div class="jive-quote"><div class="jive-quote"><div class="jive-quote">Type: org.gridgain.grid.GridException
Message: Remote job threw user exception (override or implement GridTask.result(..) method if you would like to have automatic failover for this exception).
Documentation: http://wiki.gridgain.org
Stack trace:
at org.gridgain.grid.GridTaskAdapter.result(GridTaskAdapter.java:109)
at org.gridgain.grid.kernal.processors.task.GridTaskWorker.result(GridTaskWorker.java:617)
at org.gridgain.grid.kernal.processors.task.GridTaskWorker.onResponse(GridTaskWorker.java:546)
at org.gridgain.grid.kernal.processors.task.GridTaskProcessor$JobMessageListener.processJobExecuteResponse(GridTaskProcessor.java:886)
at org.gridgain.grid.kernal.processors.task.GridTaskProcessor$JobMessageListener.onMessage(GridTaskProcessor.java:851)
at org.gridgain.grid.kernal.managers.communication.GridCommunicationManager.unwindMessageSet(GridCommunicationManager.java:767)
at org.gridgain.grid.kernal.managers.communication.GridCommunicationManager.access$3300(GridCommunicationManager.java:45)
at org.gridgain.grid.kernal.managers.communication.GridCommunicationManager$6.body(GridCommunicationManager.java:706)
at org.gridgain.grid.util.runnable.GridRunnable$1.run(GridRunnable.java:142)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441)
at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
at java.util.concurrent.FutureTask.run(FutureTask.java:138)
at org.gridgain.grid.util.runnable.GridRunnable.run(GridRunnable.java:194)
at org.gridgain.grid.util.runnable.GridRunnablePool$1.run(GridRunnablePool.java:80)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:619)</div></div></div>
Caused By:
----------
<div class="jive-quote"><div class="jive-quote"><div class="jive-quote">Type: org.gridgain.grid.GridException
Message: Task was not deployed or was redeployed since task execution (either received a stale message in which case you should increase GridConfiguration.getPeerClassLoadingTimeout() configuration parameter, or encountered some invalid condition, like internal or user code version mismatch) [taskName=com.rtrms.instrument.model.convergence.RunConvergenceTask, taskClsName=com.rtrms.instrument.model.convergence.RunConvergenceTask, codeVer=0, clsLdrId=db63985c-5b30-4e7c-a805-73d2741431ff, seqNum=1]
Documentation: http://wiki.gridgain.org
Stack trace:
at org.gridgain.grid.kernal.processors.job.GridJobProcessor$JobExecutionListener.onMessage(GridJobProcessor.java:1063)
at org.gridgain.grid.kernal.managers.communication.GridCommunicationManager$4.body(GridCommunicationManager.java:589)
at org.gridgain.grid.util.runnable.GridRunnable$1.run(GridRunnable.java:142)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441)
at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
at java.util.concurrent.FutureTask.run(FutureTask.java:138)
at org.gridgain.grid.util.runnable.GridRunnable.run(GridRunnable.java:194)
at org.gridgain.grid.util.runnable.GridRunnablePool$1.run(GridRunnablePool.java:80)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:619)</div></div></div>