Dashboard > GridGain User Guide > Table Of Contents > Developers Guide > Working with Large Data Sets
Working with Large Data Sets
Added by architect, last edited by morpheus on Apr 10, 2008  (view change)
Labels: 
(None)


When working with large datasets you must be aware of the amount of data you pass over network between nodes. GridGain comes with following features to optimize working with large data sets:

Data Partitioning & Affinity Load Balancing

Co-locate your computations with your data by partitioning the data across grid nodes and sending grid jobs exactly to the nodes where the data is located. For more information see Data Partitioning And Data Grid Integration documentation.

Segmenting Nodes

You can segment your grid into separate groups and have each group working on its own designated data set.For more information about segmenting grid nodes, see Segmenting Grid Nodes documentation.

Intermediate Checkpoints

When dealing with long running jobs it is often useful to periodically save intermediate job state. This way you won't have to start from scratch if your job fails over to another node. For more information about saving intermediate job state see Checkpoint SPI documentation.

More Tips

  • Don't oversplit - make sure that the arguments passes between task and the jobs are smaller that the data processed. If the size of the arguments becomes comparable to the data, then network overhead becomes significantly noticeable.
  • Segment your data - this way you avoid reloading the same data into memory on different nodes. Use either affinity or static segmentation method described above.
  • Maintain job granularity - try to have every job to work on a specific logical data slice. Maintaining job granularity allows for more flexible failover logic when granular jobs can failover to different grid nodes without depending or affecting each other.

Powered by Atlassian Confluence, the Enterprise Wiki. (Version: 2.2.10 Build:#528 Nov 29, 2006) - Bug/feature request - Contact Administrators