- Cannot Start gridgain.sh Or gridgain.bat Scripts
- Starting GridGain In Debug Mode
- I am getting core dumps... What to do?
- Windows command line shows "The input line is too long."
- Communication Exceptions On Linux/Unix
- java.net.BindException On Windows
- IP multicast Is Not Working
- Windows Vista Deep Sleep
- Using DHCP
- Using JConsole with Windows Domain Logon
- Using JConsole with Windows Vista
- Using BEA JRockit
- Clocks Synchronization
- Server vs. Client VM
- JGroups and IPv6
- JGroups and /etc/hosts
- Linux Compiz Fusion and GridGain Installer
- GridGain cannot bind to any port on Linux
- Linux installer shows "java.awt.AWTError Assistive Technology not found" error
- I need to add my libraries/jars to the classpath
- JConsole/VisualVM failed to connect with IOException
- I am getting "Address already in use" error on Windows
Cannot Start gridgain.sh Or gridgain.bat Scripts
If you cannot start bin/gridgain.sh or bin/gridgain.bat scripts, first make sure that the scripts have executable permission (especially on Linux/Unix).
If you get the following error:
Exception in thread "main" java.lang.NoClassDefFoundError: org/gridgain/grid/loaders/cmdline/GridCommandLineLoader
then your GRIDGAIN_HOME environment variable is not set or is set incorrectly. Please set GRIDGAIN_HOME to your GridGain installation folder.
Starting GridGain In Debug Mode
To start GridGain in debug mode, please uncomment the following section in GRIDGAIN_HOME/config/default-log4j.xml file:
<category name="org.gridgain"> <level value="DEBUG"/> </category>
I am getting core dumps... What to do?
If you are getting core dumps (JVM crashes), then most likely your OS is missing some latest patches. Talk to your sys admin to install the latest available patches for your OS and upgrade to the latest minor version of JDK you are using. Note that this may happen on any OS, be that Windows or Unix.
GridGain does not have any OS specific native code, but it uses Java NIO for inter-node communication which may not work well if latest patches are not installed. Sometimes it also helps to disable direct buffers on GridTcpCommunicationSpi. To disable direct buffers set setDirectBuffer(boolean) ![]()
Windows command line shows "The input line is too long."
When running on Windows the gridgain.bat, you receive an error message: "The input line is too long."
The gridgain.bat script runs several commands. The script passes command line arguments to some of these commands (notably ones running Java™) as variables.
These commands can become too long for the Windows® command line environment when variables used on the command line become too large. This is usually caused by a lengthy CLASSPATH (%CP%) environment variable.
Solution.
Redefine the CLASSPATH variable in your command (DOS) environment to reduce its length. Or, you can use drive mapping or SUBST command for GRIDGAIN_HOME folder with libraries.
Communication Exceptions On Linux/Unix
There are number of system settings that can affect Linux/Unix network performance.
Increase available number of ports on Linux
The /proc/sys/net/ipv4/ip_local_port_range defines the local port range that that is used by TCP and UDP connections. There are two numbers:
- the first number is the first local port allowed for TCP and UDP traffic on the server
- the second is the last local port number
Change it to 1024-65535 by editing the /etc/sysctl.conf file and adding the following line:
# Port range for TCP/UDP traffic.
net.ipv4.ip_local_port_range = 1024 65535
You need to restart your network for the change to take effect.
Increase number of file descriptors
Use the following sh or ksh command to increase number of file descriptors:
$ ulimit -n 1024
in csh use this variation:
% limit descriptors 1024
But it only allows you to increase the number of file descriptors to 1024 maximum. To remove this limitation follow these instructions:
| Solaris | Modify /etc/system file and change these settings:
# Hard limit on file descriptors. set rlim_fd_max = 4096 # Soft limit on file descriptors. set rlim_fd_cur = 1024 Note that system must be reboot after for changes to take an effect. |
|---|---|
| Linux | In /etc/security/limits.conf, add the lines:
* soft nofile 1024 * hard nofile 4096 In /etc/pam.d/login, add: session required /lib/security/pam_limits.so Add the following three lines to the /etc/rc.d/rc.local startup script: # Increase system-wide file descriptor limit. echo 4096 > /proc/sys/fs/file-max echo 16384 > /proc/sys/fs/inode-max Red Hat allows you to put the configuration changes into the /etc/sysctl.conf file with: # Increase system-wide file descriptor limit. fs.file-max = 4096 fs.inode-max = 16384 |
| Restart Note that most changes on Linux/Unix system tables will require a reboot or restart of network service. Consult your operating system manuals for more details. |
java.net.BindException On Windows
When a TCP/IP socket is closed, it goes into TIME_WAIT state before closing, for a period of time determined by Windows operating system. A socket in TIME_WAIT state cannot be reused and this usually limits the maximum rate at which network connections can be created and disconnected.
The symptoms of this limitation are usually:
- All of the TCP/IP resources of the operating system are in use, and requests for new connections fail causing to throw a java.net.BindException.
- Running the netstat -a command on the application machine shows a large number of sockets in TIME_WAIT state.
To improve the ability of the Windows operating system to deal with a high rate of network connections, add the following registry entries in:
HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\TCPIP\Parameters
| TcpTimedWaitDelay | A DWORD value, in the range 30-300, that determines the time in seconds that elapses before TCP can release a closed connection and reuse its resources. Set this to a low value to reduce the amount of time that sockets stay in TIME_WAIT. |
|---|---|
| MaxUserPort | A DWORD value that determines the highest port number that TCP can assign when an application requests an available user port. Set this to a high value to increase the total number of sockets that can be connected to the port. |
| Using Windows Registry Using Windows Registry Editor incorrectly may cause serious problems that may require reinstallation of your operating system. |
For example, setting TcpTimedWaitDelay to 30 seconds and MaxUserPort to 32678 can improve the overal system performance under the network load. See the operating system documentation for more details.
IP multicast Is Not Working
Usually when nodes cannot see each other means that IP Multicast is not properly working on your network. Getting IP Multicast to work properly can sometimes be a challenge. GridGain grid uses IP Multicast for node discovery in its default configuration in GridMulticastDiscoverySpi. IP Multicast may also be used in GridJgroupsCommunicationSpi and GridJgroupsDiscoverySpi depending on how JGroups is configured (see JGroups TCP configuration example).
The following tips maybe useful when enabling IP Multicast:
Address
- IP-multicast uses addresses in the range between 224.0.0.0 and 239.255.255.255
- Addresses [0-9].0.0.1 should not be used
Firewall
Most operating systems come with Software Firewall configured by default. If you already have a hardware firewall, then most likely you don't need the one that comes with Operating System. Try disabling it and see if IP Multicast starts working.
If disabling Operating System Firewall is not an option, then you should configure the Firewall to accept and properly route IP Multicast packets.
SELinux and iptables.
We noticed that at least in Fedora Core 9 multicast does not work with default SELinux configuration. So to have it working you need either to configure SELinux properly or better disable it.
Also if your computer is not exposed directly to the Internet (is behind the router or firewall) we would recommend to disable iptables and ip6tables services on Linux.
Add IP Multicast Route
System should have at least one route for IP multicast traffic. You can add a route for all multicast traffic to use the correct (Linux example):
route add -net 224.0.0.0 netmask 240.0.0.0 dev eth0
Note that if there are multiple IP multicast applications you will need to configure the route for a specific IP multicast address.
IPv4 and IPv6
If you have an OS with IPv6 enabled Java applications may try to route IP multicast traffic over IPv6. Use java.net.preferIPv4Stack=true system property to prevent this.
Local Network
If you don't have network installed, then most likely your IP Multicast packets will be routed to your ISP router which is probably blocking them. Try enabling loopback IP Multicast on your local box, so multiple grid nodes started on your local box can see each other.
Multiple Interfaces
This problem is mostly noticed with JGroups. Try adding your IP address to hosts file on your operating system (e.g. /etc/hosts/ on Unix/Linux) and restart your network. Also, specify your local IP address as local bind configuration parameter in JGroups configuration file or in GridMulticastDiscoverySpi depending on which SPI you are using to start grid.
JGroups TCP configuration example.
If you have any issues when using multicast you can set up JGroups configuration based on TCP. Here is an example:
For the GridGain 1.6.1 and earlier use this configuration:
<config>
<!--
Specifying TCP in your protocol stack tells JGroups to
use TCP to send messages between group members. Instead of
using a multicast bus, the group members create a mesh of
TCP connections.
-->
<TCP
<!--
The port to create the server socket on. If the specified
port is not available, the TCP protocol increments it in a loop
until if finds an available port.
-->
start_port="12345"
<!-- Loops back messages to self if true. By default is false. -->
loopback="true"
/>
<!--
The TCPPING protocol requires a static configuration, which
assumes that you to know in advance where to find other members
of your group.
-->
<TCPPING
<!--
The time interval, in milliseconds, to wait for
initial membership replies.
-->
timeout="3000"
<!--
The comma-separated lists of hosts names and ports to connect
to get the inital membership.
-->
initial_hosts="192.168.0.157[12345]"
<!--
The number of consecutive ports to be probed when getting
the initial membership, starting with the port specified
in the initial_hosts parameter.
-->
port_range="3"
<!--
Wait for at most 2 initial membership replies, but not
longer than "timeout" milliseconds.
-->
num_initial_members="3"
/>
<!-- Failure detection based on heartbeat messages.-->
<FD
<!-- Max number of ms to wait for a response. -->
timeout="2000"
<!--
Max number of missed responses until a member
is declare suspected.
-->
max_tries="4"
/>
<!--
Verifies that a suspected member is really dead
by pinging that member once again.
-->
<VERIFY_SUSPECT
<!--
How long to wait for a response from the suspected member
before passing the SUSPECT event up the stack.
-->
timeout="1500"
/>
<!--
Lossless and FIFO delivery of multicast messages,
using negative acks.
-->
<pbcast.NAKACK
<!-- Always leaves 100 msgs in the retransmit buffer. -->
gc_lag="100"
<!--
Asks for retransmission of the same msg after 600ms,
then 1200 etc.
-->
retransmit_timeout="600,1200,2400,4800"
/>
<!--
Garbage collects messages that have been seen by
all members of a cluster.
-->
<pbcast.STABLE
<!--
The number of milliseconds a member waits (random
number between 1 and 1500) before sending a STABILITY message.
-->
stability_delay="1000"
<!-- Gossip randomly every 20 secs. (Time-based gossipping). -->
desired_avg_gossip="20000"
<!--
Gossip when any number of bytes have been received.
(Size-based gossipping).
-->
max_bytes="0"
/>
<!--
Group Membership Service. Responsible for
joining/leaving members.
-->
<pbcast.GMS
<!-- Print the member's local address to stdout. -->
print_local_addr="true"
<!--
Wait for 5 secs for a valid response until we
retry the JOIN (sent to the coordinator).
-->
join_timeout="5000"
<!--
If we have to retry the JOIN, wait 2
secs before retrying.
-->
join_retry_timeout="2000"
<!--
Shun a member that was declared dead, but
came back nevertheless.
-->
shun="true"
/>
</config>
Starting from version 2.0 we support JGroups multiplexer functionality and configuration looks a bit different
<protocol_stacks> <stack name="grid.jgroups.stack" description="Grid configuration stack"> <config> <!-- Specifying TCP in your protocol stack tells JGroups to use TCP to send messages between group members. Instead of using a multicast bus, the group members create a mesh of TCP connections. --> <TCP <!-- The port to create the server socket on. If the specified port is not available, the TCP protocol increments it in a loop until if finds an available port. --> start_port="12345" <!-- Loops back messages to self if true. By default is false. --> loopback="true" /> <!-- The TCPPING protocol requires a static configuration, which assumes that you to know in advance where to find other members of your group. --> <TCPPING <!-- The time interval, in milliseconds, to wait for initial membership replies. --> timeout="3000" <!-- The comma-separated lists of hosts names and ports to connect to get the inital membership. --> initial_hosts="192.168.0.157[12345]" <!-- The number of consecutive ports to be probed when getting the initial membership, starting with the port specified in the initial_hosts parameter. --> port_range="3" <!-- Wait for at most 2 initial membership replies, but not longer than "timeout" milliseconds. --> num_initial_members="3" /> <!-- Failure detection based on heartbeat messages.--> <FD <!-- Max number of ms to wait for a response. --> timeout="2000" <!-- Max number of missed responses until a member is declare suspected. --> max_tries="4" /> <!-- Verifies that a suspected member is really dead by pinging that member once again. --> <VERIFY_SUSPECT <!-- How long to wait for a response from the suspected member before passing the SUSPECT event up the stack. --> timeout="1500" /> <!-- Lossless and FIFO delivery of multicast messages, using negative acks. --> <pbcast.NAKACK <!-- Always leaves 100 msgs in the retransmit buffer. --> gc_lag="100" <!-- Asks for retransmission of the same msg after 600ms, then 1200 etc. --> retransmit_timeout="600,1200,2400,4800" /> <!-- Garbage collects messages that have been seen by all members of a cluster. --> <pbcast.STABLE <!-- The number of milliseconds a member waits (random number between 1 and 1500) before sending a STABILITY message. --> stability_delay="1000" <!-- Gossip randomly every 20 secs. (Time-based gossipping). --> desired_avg_gossip="20000" <!-- Gossip when any number of bytes have been received. (Size-based gossipping). --> max_bytes="0" /> <!-- Group Membership Service. Responsible for joining/leaving members. --> <pbcast.GMS <!-- Print the member's local address to stdout. --> print_local_addr="true" <!-- Wait for 5 secs for a valid response until we retry the JOIN (sent to the coordinator). --> join_timeout="5000" <!-- If we have to retry the JOIN, wait 2 secs before retrying. --> join_retry_timeout="2000" <!-- Shun a member that was declared dead, but came back nevertheless. --> shun="true" /> </config> </stack> </protocol_stacks>
See http://wiki.jboss.org/wiki/Wiki.jsp?page=JGroups
for additional details.
Using JConsole with Windows Domain Logon
If you are using standard JDK JConsole on Windows you may encounter an undocumented behavior when you cannot locally connect to the JConsole if you logged in into Windows domain. In this case you will need to connect as 'remote'. For more information on troubleshooting see the following link: http://java.sun.com/j2se/1.5.0/docs/guide/management/faq.html![]()
Using JConsole with Windows Vista
Windows Vista has known problem supporting elevated mode of running Java application and JConsole. For more information and the status of the Sun's bug see http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6529265![]()
Using DHCP
When using DHCP local IP can be released and renewed by the local host during network outage. GridGain cannot recover from this error automatically and you will need to manually restart this grid node. For this reason we do not recommend running grid node on DHCP (at least not in a production environment). See more information on DHCP at http://en.wikipedia.org/wiki/Dynamic_Host_Configuration_Protocol![]()
Using BEA JRockit
BEA JRockit VM ver. 1.5.x has shown problems with NIO functionality in our tests that were not reproducible in other VMs. We recommend using Sun VM or make sure that GridGain works in your environment when using JRockit VM.
Windows Vista Deep Sleep
Windows Vista has a known problem with recovering network connectivity after waking up from deep sleep or hibernate mode. Due to this error GridGain cannot recover from this error automatically and you will need to restart this grid node manually. For more information see, for example, this link http://support.microsoft.com/kb/930311/en-us![]()
Clocks Synchronization
Although strict clock synchronization is not required for GridGain to properly operate, the task executions on the grid that has nodes wit non-synchronized clocks can experience hard-to-predict timeout exception if timeouts are set aggressively.
Furthermore, in situations with non-synchronized clocks determining the optimal timeout value for the task execution becomes error prone.
Server vs. Client VM
GridGain comes with a script bin/gridgain.{bat|sh} that doesn't explicitly specify -server or -client option. This is done because many will use JRE (not JDK) in evaluation process and -server option is not supported by JRE.
Note that you can always change bin/gridgain.{bat|sh} script to add you custom options to VM.
JGroups and IPv6
If you are running into trouble with JGroups on Linux platform with java.lang.BindException or JGroups is not working at all on Windows platform try to add the following VM option:
-Djava.net.preferIPv4Stack=true
For more information on JGroups troubleshooting see http://weblogs.java.net/blog/dcengija/archive/2006/04/jgroups_demos_o.html![]()
JGroups and /etc/hosts
JGroups may fail to start with SERVICE_DOWN indication if /etc/hosts file doesn't contain the "real" IP address for the host. Note that by default, many Linux OSs set 127.0.0.1 loop-back as IP for the host.
Linux Compiz Fusion and GridGain Installer
In our testing we found that our installer may not work properly with early versions of Compiz Fusion. Specifically, we have tested on Ubuntu 7.04 and GridGain 1.6.0 and the installer did not work when Compiz windows manager was used. The problem apparently was fixed in Ubuntu 7.10 as we tested it with GridGain 1.6.1 release.
If you are experiencing problem with GridGain installer on Linux try to switch back to default window manager
$ metacity --replace
and see if it helps the problem.
GridGain cannot bind to any port on Linux
If you got exception like below then you need to update your operation system.
Caused By: ---------- >>> Type: java.io.IOException >>> Message: Function not implemented >>> Stack trace: >>> at sun.nio.ch.EPollArrayWrapper.epollCreate(EPollArrayWrapper.java:1) >>> at sun.nio.ch.EPollArrayWrapper.<init>(EPollArrayWrapper.java:59) >>> at sun.nio.ch.EPollSelectorImpl.<init>(EPollSelectorImpl.java:52) >>> at sun.nio.ch.EPollSelectorProvider.openSelector(EPollSelectorProvider.java:18) >>> at org.gridgain.grid.util.nio.GridNioServer.createSelector(GridNioServer.java:110)
First, make sure that you use kernel 2.6+. Type following command that will print out Linux version and kernel version as well.
uname -a
If kernel version is less than 2.6 then we recommend to upgrade it.
If your kernel version is 2.6+ then check if you have latest "glibc" library installed or just update it to the latest.
Linux installer shows "java.awt.AWTError Assistive Technology not found" error
If you got this message below
java.awt.AWTError: Assistive Technology not found: org.GNOME.Accessibility.JavaBridge
To fix, either install JRE/JDK from SUN and add it to the beginning of your PATH or make sure that you commented property "assistive_technologies=org.GNOME.Accessibility.JavaBridge" in "accessibility.properties" file. On Fedora Core 9 it is "/usr/lib/jvm/jre-1.6.0-openjdk.x86_64/lib/accessibility.properties" by default. It's an official bug of OpenJDK http://icedtea.classpath.org/bugzilla/show_bug.cgi?id=108![]()
I need to add my libraries/jars to the classpath
Starting from 2.1.0 GridGain supports adding user JAR files without changing classpaths or any shell scripts. Just put your JAR files into the $GRIDGAIN_HOME/libs/ext directory. This directory will be scanned when GridGain starts and all libraries will be added to the classpath automatically.
JConsole/VisualVM failed to connect with IOException
Check if the hostname correctly resolves to the host address.
Run the command:
hostname -i
If it reports 127.0.0.1, JConsole/VisualVM would not be able to connect through JMX to the JVM running on that Linux machine. To fix this issue, edit:
/etc/hosts
such that the hostname resolves to the host address.
You can also add system property to GridGain:
-Djava.rmi.server.hostname=<hostname>
I am getting "Address already in use" error on Windows
The "Address already in use: connect" error is caused by client socket starvation on the machine(s). By default Windows does not allow you to set up client connections on ports above 5000. After a socket has been closed, the connection stays in a TIME_WAIT state for another 2 minutes, after which the socket is freed and the address can be reused. If more than 4000 connections (1024-5000) have been made before those ports are freed (after 2 min. in TIME_WAIT), then attempts to open a client socket on a port above 5000 will be rejected by the operating system, which will cause Java to throw "Address already in use: connect". This can be fixed by modifying the Windows registry entry that controls this parameter:
- Start Registry Editor: Start Menu > Run > Type in "regedit"
- Locate the following key: HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\Tcpip\Parameters
- Right click on the Parameters folder and select New > DWORD Value
- Name this new key "MaxUserPort"
- Double click on the "MaxUserPort" key and change the value data to 65534 and select "Decimal" as the base.
- Restart the machine.
