|
|||
|
Table Of Contents Prev << [Ch. 5: Compute Grid] Next >> [Ch. 7: Cloud Auto-Scaling] |
|||
DATA GRID
In this chapter:
- 6.1 Cache Collocation of Computations and Data
- 6.2 Cache Zero Deployment
- 6.3 Cache Deployment Modes
- 6.4 Cache Rich Post-Functional API
- 6.5 Cache Projections
- 6.6 Cache Transactions
- 6.7 Cache Queries
- 6.8 Cache Eviction Policies
- 6.9 Cache Preloading
- 6.10 Cache Store
Data Grid, or In-Memory Data Grid, is a fancy word for distributed data caching. In a nutshell it provides applications with ability to keep data in memory for high availability rather than constantly fetching it from slower storage elsewhere, like persistent data store or shared file systems.
On top of high availability, data grids generally allow to scale large amounts of data as well, which is also called data partitioning. When data is partitioned, every key-value pair stored on data grid will be assigned to a designated primary node, and optionally a configurable amount of designated back up nodes (which can optionally be active or inactive). Data should never be lost as long as at least one back up node for it still remains. Such approach allows to use memory available on all nodes within data grid as one whole shared memory, with each node responsible for caching a portion of data allocated to it.
GridGain data grid is one of the most comprehensive data grid solutions available on the market today. Some of its main features include:
- Local, Replicated, and Partitioned deployment modes
- Collocation of computations and data
- Extremely rich post functional APIs
- Zero Provisioning and Deployment for cached data
- Support for batch reads and writes
- Synchronous and asynchronous modes
- Concurrent atomic operations, such as putIfAbsent, replace, or put-with-predicate
- Advanced data querying, including support for SQL, TEXT, and FULL SCAN queries
- Pluggable persistent storage with support for read-through and write-through semantics
- Pluggable data affinity for data partitioning and collocation of computations with data
- Synchronous and Asynchronous data preloading
- Pluggable Data Overflow or Swap to disk for effective memory management
- Optimistic and Pessimistic transactions with Read-Committed, Repeatable-Read, and Serializable isolation levels
- Extremely scalable, feather-weight Eventually-Consistent transactions
- Pluggable Eviction Policies, including out of the box support for LIRS, LRU, LFU, FIFO, and RANDOM eviction modes
- Flexible Cache Projections for fine grained control over cache behavior and custom cache views
- Pluggable JEE/JTA integration
In this chapter we will go over some of the key features of GridGain 3.0 Data Grid in detail. However, this documentation is not meant to replace main Javadoc API Documentation
, and you should still refer to Javadoc for detailed explanation of main abstractions.
6.1 Cache Collocation of Computations and Data
One of the major scalability problems of utilizing data grids is unnecessary data noise which may consume significant amount of bandwidth and often can bring a server to its knees. Imagine a scenario when you are using a partitioned cache and have to constantly retrieve various key-value pairs from cache and perform some computation on them. However, in partitioned mode, every key-value pair may or may not be cached on the local node, so it needs to be fetched from remote nodes. Once the data is fetched and brought to a local node, you perform the computation on it and, once you are done, the data you just requested is most likely discarded. It may be cached in Near cache on the local mode (which is default GridGain behavior), but Near caches are generally much smaller than partitioned caches (size limitation) and have more aggressive eviction policies than partitioned caches. So to summarize, most of the data access from caches is either immediately discarded or will be discarded shortly after - thus creating unnecessary noise traffic.
It is much more effective to bring the computation exactly to the node where data resides as opposed to bring the data to computation. In GridGain this is easily achieved via compute and data grid integration. Here is how this can be achieved in GridGain 3.0 by using @GridCacheAffinityMapped ![]()
Grid g = G.grid(); final GridCache<Integer, String> cache = g.cache(); // Get default cache. final Integer key = 1; String result = g.call(BALANCE, new Callable<String>() { // Affinity key for the job. The job will travel // to the node where the key is cached. @GridCacheAffinityMapped public Integer affinityKey() { return key; } // The logic below will be executed on remote node which // is responsible for caching the specified affinity key. @Override public String call() throws Exception { // Get locally cached value. String val = cache.get(key); // Perform some computation on retrieved value. ... return "OK"; } });
6.2 Cache Zero Deployment
Zero Deployment is a GridGain feature which automatically monitors deployed resources on the grid and redeploys it whenever they change. With GridGain you can basically startup several grid nodes, and just leave them running. You will never have to deploy or redeploy anything on them. All you do is keep writing and changing your code, and whenever you need to execute it, just hit the Run button in your IDE and your new code will be automatically deployed on all grid nodes. This feature works with both, compute and data grid. However, unlike compute grid, data grid keeps auto-deployed objects cached on remote nodes, and hence behaves a little differently. GridGain cache can be deployed in two different GridDeploymentModes ![]()
In SHARED mode, objects will be auto-deployed on remote nodes, but they will be automatically undeployed (hence, removed from cache) whenever either code changes or last node from which a resource has been deployed leaves. The class loader on remote nodes is shared only for the nodes that have common classes, thus if nodes don't share the same code base, their class loaders remotely will not be shared. This mode is ideal for development, when code changes quite frequently, and after every change it is generally best to start off fresh.
In CONTINUOUS mode objects are also automatically deployed on remote nodes, but they all share the same class-loader remotely and never get automatically undeployed / removed unless specifically specified this way by a user by changing userVersion in META-INF/gridgain.xml file. This mode is ideal for production when it is generally undesirable to undeploy (and hence remove) any object from cache.
6.3 Cache Deployment Modes
GridGain data grid can be deployed in any of the following 3 modes defined by GridCacheMode ![]()
- LOCAL mode is the most light weight mode of cache operation, as no data is distributed to other cache nodes. It is ideal for scenarios where data is either read-only, or can be periodically refreshed at some expiration frequency. It also works very well with read-through behavior where data is loaded from persistent storage on misses. Other than distribution, local caches still have all the features of distributed cache, such as automatic data eviction, expiration, disk swapping, data querying, transactions, and more.
- REPLICATED mode provides the utmost availability as data is available on every grid node. However, in this mode every data update must be propagated to all other nodes which can have an impact on performance and scalability. As the same data is stored on all grid nodes, the size of replicated cache is limited by the amount of memory available on a node. This mode is ideal for scenarios where updates are infrequent and data availability is most important.
- PARTITIONED cache is the most scalable distributed cache mode. In this mode the overall data set is divided equally into partitions and all partitions are split equally between participating nodes, essentially creating one huge distributed memory for caching data. This approach allows for storing as much data as can be fit in the total memory available across all nodes, hence allowing for loading gigabytes and terabytes of data into cache memory. Partitioned cache is always fronted by a smaller local, or near, cache which stores most recently or most frequently accessed data. Such combination provides for high availability of data that is accessed often together with high scalability of partitioned cache. This mode is ideal for scenarios where data volumes are large and updates are relatively frequent.
6.4 Cache Rich Post-Functional API
Majority of data grid products provide a simple java.util.concurrent.ConcurrentMap API for working with data grids. However plain ConcurrentMap API is quite limiting and does not often provide the desired convenience or usability. For example, imagine that you need to store objects of different types in cache, say Person and Organization, keyed by an Integer. Using plain ConcurrentMap<K, V> generics you would have to lose strong typing provided by generics and declare the map as ConcurrentMap<Integer, Object>... ouch! Or take a look at methods like Map.put(..), ConcurrentMap.putIfAbsent(..), or Map.remove(..). If you follow standard Map API then both of these methods have to return a previous value. However, when working with caches, returning previous value is expensive as it may involve a trip to persistent data store or to a neighboring node - why make that extra network trip for cases when previous value is not needed? To address these issues, and many others, GridCacheProjection ![]()
Here is some functionality available on GridCacheProjection API:
- Various get(..) methods to synchronously or asynchronously get values from cache.
- Various put(..), putIfAbsent(..), and replace(..) methods to synchronously or asynchronously put single or multiple entries into cache.
- Various remove(..) methods to synchronously or asynchronously remove single or multiple keys from cache.
- Various contains(..) method to check if cache contains certain keys or values.
- Various forEach(..), forAny(..), and reduce(..) methods to visit every cache entry within this projection.
- Various flagsOn(..)', flagsOff(..), and projection(..) methods to set specific flags and filters on a cache projection.
- Methods like keySet(..), values(..), and entrySet(..) to provide views on cache keys, values, and entries.
- Various peek(..) methods to peek at values in global or transactional memory, swap storage, or persistent storage.
- Various reload(..) methods to reload latest values from persistent storage.
- Various unswap(..) methods to load specified keys from swap storage into global cache memory.
- Various invalidate(..) methods to set cached values to null.
- Various lock(..), unlock(..), and isLocked(..) methods to acquire, release, and check on distributed locks on a single or multiple keys in cache.
- Various clear(..) methods to clear elements from cache, and optionally from swap storage.
- Various evict(..) methods to evict elements from cache, and optionally store them in underlying swap storage for later access.
- Various txStart(..) and inTx(..) methods to perform various cache operations within a transaction.
- Various createXxxQuery(..) methods to query cache using either SQL, LUCENE, H2TEXT text search, or SCAN for filter-based full scan.
- Various mapKeysToNodes(..) methods which provide node affinity mapping for given keys.
- Various gridProjection(..) methods which provide GridProjection
Javadoc only for nodes on which given keys reside.
Extended Put And Remove Methods
All methods that end with 'x', such as putx(..) or removex(..), provide the same functionality as their sibling methods that don't end with 'x', however, instead of returning a previous value, they return a boolean flag indicating whether operation succeeded or not. Returning a previous value may involve a network trip or a persistent store lookup and should be avoided whenever not needed.
6.5 Cache Projections
Cache projections, defined by GridCacheProjection ![]()
// Only objects of type Person. GridCacheProjection<Integer, Person> people = grid.cache().projection(Integer.class, Person.class); // Only objects of type Company. GridCacheProjection<Integer, Company> companies = grid.cache().projection(Integer.class, Company.class);
Or here is how you would programatically enable synchronousCommit mode for the view on object of type Person defined above:
GridCacheProjection<Integer, Person> syncCommitPeople = people.flagsOn(GridCacheFlag.SYNC_COMMMIT);
6.6 Cache Transactions
Most of the data grid products support transactions. However in many cases they will only provide automatic enlisting into an ongoing JEE/JTA cache transaction which is quite limiting, especially when not running in JEE container. In many cases it is a lot more convenient to use cache transactions directly. GridGain supports both, automatic enlistment into ongoing JEE transaction and explicit cache transactions. Explicit transactions are supported via GridCacheTx ![]()
GridGain supports OPTIMISTIC, and PESSIMISTIC concurrency levels, and READ_COMMITTED, REPEATABLE_READ, and SERIALIZABLE isolation levels. Such a wide support for concurrency and isolation levels allows to model any kind of concurrent access pattern on any set of data. Note that GridGain also supports EVENTUALLY_CONSISTENT transaction concurrency level for REPLICATED cache mode and support for PARTITIONED mode is coming up.
Here are examples of how transactions can be used:
GridCache<String, Integer> cache = G.grid().cache(); ... GridCacheTx tx = cache.txStart(); try { // Perform transactional operations. Integer v1 = cache.get("k1"); Integer old1 = cache.put("k2", 2); cache.removex("k3"); // Commit the transaction. tx.commit(); } finally { tx.end(); // Rollback, if was not committed. }
Or, the same logic as above can be executed by passing one or more closures to any of the GridCache.inTx(..) methods as follows:
GridCache<String, Integer> cache = G.grid().cache(); ... cache.inTx(new CI1<GridCacheProjection<String, Integer>>() { @Override public void apply(GridCacheProjection<String, Integer> cache) { // Perform transactional operations. Integer v1 = cache.get("k1"); Integer old1 = cache.put("k2", 2); cache.removex("k3"); } }
6.7 Cache Queries
Why would you ever query cached data if you can query your persistent store, such as database? Well, the answer is the same as for accessing data by key from cache vs. getting it from database - for performance and scalability. However, querying cache is not exactly the same as querying your database - the main difference is that if cache only has a subset of data stored in database, then you will be only querying that subset, so query result will be reflecting only in-memory state. Does this matter? Depends on your application requirements and also depends on the amount of data you are able to store in cache.
With introduction of cloud computing and virtual instances, the amount of memory available to your grid on the cloud becomes virtually limitless. Adding nodes to your grid has become as simple as calling AWS API on EC2 whenever your application demands it. On top of it, if GridGain swap space is configured, all the data that cannot fit in memory on a single node will be overflown to disk. Also, your application may not even have that much data, or often querying cached data, which usually contains data that has been accessed relatively recently, is good enough. Thus in many cases querying cache is becoming to look more and more like querying your database.
Now that you made a decision in your project that you want to query cached data, the next question becomes how to cache query results. Most of us are familiar with Hibernate and it's support for 2nd Level Caching which also comes with Query Cache. The way query cache works in Hibernate is generally the way we are used to think of caching queried data. In a nutshell, a query is issued against the database and the results of the query are then stored in cache in a single collection. If you have multiple queries, then multiple collections containing query results are stored. Now if you ever update a single bean in Hibernate which can potentially affect the query result (pretty much any change to the queried tables), Hibernate is forced to invalidate (remove) the cached query results from cache and reload them on-demand next time. This significantly increases memory consumption, and frequent cache invalidations of query results perform horribly and do not scale at all. Even Hibernate itself discourages its users from using it. Here is the quote from Hibernate documenation
:
... most queries do not benefit from caching or their results. So by default, individual queries are not cached even after enabling query caching ...
So, how does querying of cached data help? It helps by entirely removing the need for 'query result cache' altogether. SQL queries on your indexed cached data are executed in memory and perform very fast, so there is no more need to cache query results. Just run your SQL query on your cached data and get the results whenever you need them. However, it is important to note that without rich SQL support for cache queries, they will not be able to replace database queries within your project. In the example below, where Person relates to Company, if your cache does not support SQL joins, then you would not be able to find all people working for the same company, which may be quite limiting. Hense, it is extremely important to evaluate how rich the SQL support on a certain cache product before making a decision to query cached data.
In GridGain 3.0 the support for cache queries is virtually without any limitations. If you know SQL, you can run queries against cached data without any limitations, including support for any type of joins, any where clause keywords, order by, group by, etc... In addition to SQL queries, GridGain also supports text queries using Lucene or H2 TEXT underlying indexing. You can also run predicate-based FULL SCAN queries, which will iterate over all cache elements on remote nodes and will include only the ones that passed the predicate filter.
There are 3 main query APIs in GridGain: GridCacheQuery ![]()
![]()
![]()
SQL Queries
GridCacheQueryType.SQL query type allows to execute distributed cache queries using standard SQL syntax. All values participating in where clauses or joins must be annotated with GridCacheQuerySqlField ![]()
documentation directly. For full set of supported SQL syntax refer to H2 SQL Select Grammar
.
Note that whenever using 'group by' queries, only individual page results will be sorted and not the full result sets. However, if a single node is queried, then the result set will be quite accurate.
Text Queries
GridGain supports two type of text queries: GridCacheQueryType.LUCENE and GridCacheQueryType.H2TEXT. All fields that are expected to show up in text query results must be annotated with GridCacheQueryLuceneField ![]()
![]()
Scan Queries
Sometimes when it is known in advance that SQL query will cause a full data scan, or whenever data set is relatively small, the GridCacheQueryType.SCAN query type may be used. With this query type GridGain will iterate over all cache entries, skipping over the entries that don't pass the optionally provided key or value filters. In this mode the query clause should not be provided.
Execute vs. Visit
If there is no need to return result to the caller node, you can save on a potentially significant network overhead by visiting all query results directly on remote nodes by calling GridCacheQuery.visit(GridPredicate, GridProjection...) method. With this method, all the logic is performed inside of query predicate directly on the queried nodes. If the predicate will return false while visiting, then visiting will finish immediately.
Optional Key and Value filters
Note that all query results may be additionally filtered by specifying predicates for key and value filtering via GridCacheQuery.remoteKeyFilter(GridOutClosure) and GridCacheQuery.remoteValueFilter(GridOutClosure) methods. These additional filters are useful whenever filtering is based on logic or methods not available in SQL or TEXT queries. For 'SCAN' queries this filters should be usually provided as they are used directly to filter the query results during full scan.
Query Future Iterators
Note that GridCacheQueryFuture ![]()
Query Usage Examples
As an example, suppose we have data model consisting of 'Employee' and 'Organization' classes defined as follows:
public class Organization { @GridCacheQuerySqlField(unique = true) private long id; @GridCacheQuerySqlField private String name; ... } public class Person { // Unique index. @GridCacheQuerySqlField(unique=true) private long id; @GridCacheQuerySqlField private long orgId; // Organization ID. // Not indexed. private String name; // Non-unique index. @GridCacheQuerySqlField private double salary; // Index for text search. @GridCacheQueryLuceneField private String resume; ... }
Then you can create and execute queries that check various salary ranges like so:
GridCache<Long, Person> cache = G.grid().cache(); ... // Create query which selects salaries based on range for all employees // that work for a certain company. GridCacheQuery<Long, Person> qry = cache.createQuery(SQL, Person.class, "from Person, Organization where Person.orgId = Organization.id " + "and Organization.name = ? and Person.salary > ? and Person.salary <= ?"); // Query all nodes to find all cached GridGain employees // with salaries less than 1000. qry.queryArguments("GridGain", 0, 1000).execute(grid); // Query only remote nodes to find all remotely cached GridGain employees // with salaries greater than 1000 and less than 2000. qry.queryArguments("GridGain", 1000, 2000).execute(grid.remoteProjection()); // Query local node only to find all locally cached GridGain employees // with salaries greater than 2000. qry.queryArguments(2000, Integer.MAX_VALUE).execute(grid.localNode());
Here is a possible query that will use Lucene text search to scan all resumes to check if employees have Master degree:
GridCacheQuery<Long, Person> mastersQry = cache.createQuery(LUCENE, Person.class, "Master"); // Query all cache nodes. mastersQry.execute(grid.localNode()));
6.8 Eviction Policies
Selecting proper cache eviction strategy is one of the main parts of cache configuration. Generally, eviction controls the maximum number of elements that can be stored in cache, just so cache does not grow indefinitely. However, every eviction strategy will evict elements in different order and selecting a wrong strategy can have a significant impact on cache hit ratio and performance.
It is also important to note that eviction policy is pluggable in GridGain, and users can plug their own eviction policy whenever none of the ones provided out of the box is adequate. The following eviction policies are available out of the box:
- GridCacheLruEvictionPolicy
Javadoc 
- GridCacheLirsEvictionPolicy
Javadoc 
- GridCacheFifoEvictionPolicy
Javadoc 
- GridCacheRandomEvictionPolicy
Javadoc 
- GridCacheAlwaysEvictionPolicy
Javadoc 
- GridCacheNeverEvictionPolicy
Javadoc 
Note that if GridCacheConfiguration.isSwapEnabled() set to true, then evicted entries will be overflown to a swap storage, which is, by default, a file-based disk storage defined by GridFileSwapSpaceSpi ![]()
6.9 Cache Preloading
Preloading newly started cache nodes is important whenever it is necessary to have common data set in memory on all nodes (you may not need it if cache can always read-through a missing value from a persistent store). When preloading is enabled (i.e. has value other than GridCachePreloadMode.NONE), distributed caches will attempt to preload all necessary values from other grid nodes. GridGain supports the following preloading modes defined in GridCachePreloadMode ![]()
- GridCachePreloadMode.SYNC mode is a synchronous preload mode. Distributed caches will not start until all necessary data is loaded from other available grid nodes.
- GridCachePreloadMode.ASYNC mode is asynchronous preload mode (this mode is configured by default). Distributed caches will start immediately and will load all necessary data from other available grid nodes in the background.
- GridCachePreloadMode.NONE mode is there to disable preloading. In this mode no preloading will take place which means that caches will be either loaded on demand from persistent store whenever data is accessed, or will be populated explicitly.
Note that REPLICATED caches will try to load the full set of cache entries from other nodes (or as defined by pluggable GridCacheAffinity ![]()
Also note that preload mode does not makes sense for LOCAL caches as they are local by definition and, therefore, cannot preload any values from neighboring nodes.
6.10 Cache Store
Persistent storage in GridGain is defined by GridCacheStore ![]()
Note that there is also refresh-ahead mode specified by GridCacheConfiguration.getRefreshAheadRatio() configuration parameter. If value is other than zero, then entry will be preloaded in the background whenever it is accessed and refresh ratio of it's total time-to-live has passed. This feature ensures that entries are always automatically re-cached whenever they are nearing expiration.
For example, if refresh ratio is set to 0.75 and entry's time-to-live is 1 minute, then if this entry is accessed any time after 45 seconds since last update (which is 0.75 of a minute), the cached value will be immediately returned, but entry will be asynchronously reloaded from persistent store in the background.
Provided Cache Store Implementations
GridGain comes with two cache store implementations provided:
|
Do not use these implementations in production environment since it may slow down performance. These implementations are for local tests and POC. |
Here is an example on how to configure GridCacheHibernateBlobStore ![]()
org/gridgain/grid/cache/store/hibernate/GridCacheHibernateBlobStoreEntry.hbm.xml or annotated class GridCacheHibernateBlobStoreEntry ![]()
<bean id="cache.hibernate.store" class="org.gridgain.grid.cache.store.hibernate.GridCacheHibernateBlobStore"> <property name="sessionFactory"> <bean class="org.springframework.orm.hibernate3.LocalSessionFactoryBean"> <property name="hibernateProperties"> <value> hibernate.connection.url=jdbc:h2:mem:db;DB_CLOSE_DELAY=-1 hibernate.show_sql=true hibernate.hbm2ddl.auto=true </value> </property> <property name="mappingResources"> <list> <value> org/gridgain/grid/cache/store/hibernate/GridCacheHibernateBlobStoreEntry.hbm.xml </value> </list> </property> </bean> </property> </bean>
Here is an example on how to configure GridCacheHibernateBlobStore ![]()
<bean id="cache.hibernate.store2" class="org.gridgain.grid.cache.store.hibernate.GridCacheHibernateBlobStore"> <property name="hibernateProperties"> <props> <prop key="hibernate.connection.url">jdbc:h2:mem:db;DB_CLOSE_DELAY=-1</prop> <prop key="hibernate.hbm2ddl.auto">update</prop> <prop key="hibernate.show_sql">true</prop> </props> </property> </bean>
Here is an example on how to configure GridCacheJdbcBlobStore ![]()
<bean id="cache.jdbc.store1" class="org.gridgain.grid.cache.store.jdbc.GridCacheJdbcBlobStore"> <property name="connectionUrl" value="jdbc:h2:mem:db;DB_CLOSE_DELAY=-1"/> <property name="createTableQuery" value="create table if not exists ENTRIES (key other, val other)"/> </bean>
Cache Store Examples
Example implementations of cache stores (one is backed by JDBC and another one by Hibernate) can be found under GRIDGAIN_HOME/examples/java/org/gridgain/examples/cache/store folder. These stores are more suitable for production environments than stores provided since they are strongly typed.
See Also
GridCacheStoreAdapter ![]()
|
|||
|
Table Of Contents Prev << [Ch. 5: Compute Grid] Next >> [Ch. 7: Cloud Auto-Scaling] |
|||
