Netways OSMC was an awesome conference with lots of excellent talks. If you are interested in our talk you can check out the presentation about our favorite topic about bischeck and what dynamic and adaptive thresholds can enable in your monitoring environment.
Bisdw is a simple ETL tool that we developed for monitoring use case that demanded us to retrieve data from different source and put it into a local database. As a tool it can be used independent of Bischeck. Most of the ETL logic is provided by the Scriptella project. What we have added are functionality for scheduling, FTP integration, init scripts, etc.
After a little longer than expected, we finally have RC1 of version 1.0.0 available. This is not a production ready version and should only be used for testing. We hope to get feedback and bug reports from all of you who take the time to test.
RC1 does not support upgrade from 0.4.3, but should run with your current configuration files. And if you like to use the existing cached data from your bischeck 0.4.3 you need to migrate it to redis cache as explained below. Continue reading Bischeck 1.0.0 RC1 is available
Bischeck 1.0.0 will include a number of new features and enhancements. The plan is to get a release candidate out in the end of the summer. Thanks to all the people that used Bischeck and given feedback and provided feature requests. Continue reading Highlight features for the future Bischeck 1.0.0
New white paper describing some of the unique capabilities with Bischeck. You can check it out at in the Documentation section.
Performance testing is key to secure that your software can handle the load and to verify the robustness of the software. With server based software, running as a daemon, it is especially important to verify that the software is stable during a long period of continues uptime without decreased throughput and by leaking resources, like memory.
Since bischeck is designed to do advanced service check with dynamic and adaptive thresholds we know that cpu and memory will be important resources when operating with mathematical algorithms over historical collected data.
The test setup will start with a baseline that is scaled in two dimensions, increase the load by increase the number of service jobs and increase the load by decrease the interval between service job schedules.
We are pleased to release the second release candidate for bischeck 04.2. The major change in this release candidate is how null values are managed for mathematical functions that takes a list of arguments like sum and avg. Read more about this feature in the documentation.
This release include the following features and fixes:
• Related to bug [TR-227 ] the naming of host, service and serviceitem names has been improved.
• Execution statements and thresholds hour specification where cache data is retrieved as a list, like in a function as avg(x-y-z[4:10]) and max(x-y-z[-5M:-15M]), can now be configured to return a value as long as at lest one index in the range is not null. To support backwards capability the new functionality will only be used if the property notFullListParse is set to true in the properties.xml. The default value is false.
• There has been some discussion about what Nagios state should be sent if the a the returned execute statement of a service item is null. In previous releases this has been hard coded to OK, but now its possible to define it by setting the property stateOnNull. The property can be set to an integer 0,1,2 or 3 or to a string OK, WARNING, CRITICAL or UNKNOWN. The default is UNKNOWN.
• When a service class get an exception when creating a connection the previous versions did not save any data to the cache. If the property saveNullOnConnectionError you will now get a null value inserted into the cache when a connection exception is thrown. For backwards compatibility the default value of the property is false.
Bugs fixed and important issues
• [TR-227] “Cache parser do not work for host, service or serviceitems if the name include 0 (zero)” has been resolved.
• [TR-228] “Threshold factory return wrong threshold definition if service and serviceitem name is the same for different hosts” has been resolved.
• [TR-229] “When using service ShellService the number of open files limit will be reached” has been resolved.
• [TR-230] “NRDP submissions all come in as OK” has been resolved.
• Fixed migration script from 0.4.0 to copy etc directory content correctly. Changes in the file urlservices.xml will be overwritten. Existing 0.4.0 configuration will still be available in the previous version backup directory, bischeck_0.4.0.
Bischeck use the directory /var/tmp to store log files, pid file and persistent cache data. For logs this is not a bad location, but for pid file and cache data this is not a very smart location. The main reason for this are that if your bischeck process will run for a very long time, which it should, there is a risk that your pid file and cache data will be removed. This is due to the fact that distributions like Centos has a cron script that run a command tmpwatch that remove files in different “tmp” directories if files are not updated for a long time. This can be fixed by changing the cron script, /etc/cron.daily/tmpwatch on Centos or by changing the directory location by the properties in bischeck configuration file properties.xml.
The properties to change are:
- pidfile – default is /var/tmp/bischeck.pid
- lastStatusCacheDumpDir – default is /var/tmp/
We are pleased to release bischeck 0.4.2 release candidate 1. This release include the following features and fixes:
- Related to bug [TR-227 ] the naming of host, service and serviceitem names has been improved. For more info please see 8.1↑
- Execution statements and thresholds hour specification where cache data is retrieved as a list, like in a function as avg(x-y-z[4:10]) and avg(4,6,8), can now be configured to not return a null value if at least the first index in the list definition has a cached value. This means for the example that if, at least, index 4 as a value for the x-y-z an average will be calculated. To support backwards capability the new functionality will only be used if the property notFullListParse is set to true in the properties.xml. The default value is false.
- There has been some discussion about what Nagios state should be sent if the a the returned execute statement of a service item is null. In previous releases this has been hard coded to OK, but now its possible to define it by setting the property stateOnNull. The property can be set to an integer 0,1,2 or 3 or to a string OK, WARNING, CRITICAL or UNKNOWN. The default is UNKNOWN.
- When a service class get an exception when doing a connection the previous versions did not save any data to the cache. If the property saveNullOnConnectionError you will now get a null value inserted into the cache when a connection exception is thrown. For backwards compatibility the default value of the property is false.
Bugs fixed and important issues
- [TR-227] “Cache parser do not work for host, service or serviceitems if the name include 0 (zero)” has been resolved.
- [TR-228] “Threshold factory return wrong threshold definition if service and serviceitem name is the same for different hosts” has been resolved.
- [TR-229] “When using service ShellService the number of open files limit will be reached” has been resolved.
- [TR-230] “NRDP submissions all come in as OK” has been resolved.
- Fixed migration script from 0.4.0 to copy etc directory content correctly. Changes in the file urlservices.xml will be overwritten. Existing 0.4.0 configuration will still be available in the previous version backup directory, bischeck_0.4.0.
Check out our new quick start for bischeck.
Currently we have a naming limitation in the naming of a host, service and serviceitem. The issue is seen when using dynamic thresholds that do calculations on cached entries. When describing a cache entry in the 24threshols.xml file in a hour tag you should use the format of host-service-serviceitem, erphost-erpOrders-weborders. The problem with the current format is that the names given must be based on any letter, upper or lower case, and the number 1-9. Yes the missing of 0 is a major bug. Execept for the 0 bug the format has the following limitations:
- Dash (-) is used as the separator between the host, service and serviceitem name, which means that using dash in the name is a problem.
- Other characters like dot (.), plus (+), underscore (_) or any other character then the described above is not supported. This is a major weakness since many will use, for example dot and underscore in their existing Nagios host and service name.
We have made a quick new release of bischeck due to a bug that caused truncation of all measured and threshold values with more then 2 decimal values. This caused some obvious problems, especially if we are measuring stuff like network times. So if you are monitoring these kind of stuff please upgrade asap.
Since we had some new stuff in the trunk we chose to include it to, but they should be regarded as beta functionality. The new functionality are:
- Sending passive checks over NRDP as an alternative to NSCA
- New Service and serviceitem that support execution of local check commands. With this functionality any Nagios check commands that output performance data can now be executed through bischeck. The state is of course ignored since bischeck will do its own threshold calculation of the performance data. Thanks to Eric Loyd at Bitnetix (www.bitnetix.com) that gave me the idea during Nagios World 2012.
For more information about this new functionality please check out the 0.4.1 README. Feedback on 0.4.1 is more then welcomed.
To download bischeck 0.4.1 please visit our download area.
Until now NSCA has be the only way to integrate passive service checks with Nagios. From a functional perspective it has been working great using the jsendnsca package, http://code.google.com/p/jsendnsca. With the next version we will also support NRDP, Nagios Remote Data Processor. The are some benefits that is nice with NRDP like pure web interface and batch sending of passive checks.
If anyone would like this immediately please send us an email and we could make a 0.4.1 version.
Today we announce bischeck 0.4.0. The released has during 2 months been tested in a production environment. Upgrading from 0.3.0 and 0.4.0_RC2 is supported. No major changes has been done since RC 2. Full documentation and download is available.
- [FR-197] Support for different and multiple integration with different surveillance and monitoring systems. With version 0.4.0 bischeck is not limited to send data to Nagios. It can now send the data to multiple Nagios servers and to other servers like OpenTSB. This is done by moving server formatting and protocol to server integration classes that implements the interface com.ingby.socbox.bischeck.servers.Server. The server integration is described in the xml configuration file servers.xml. This also means that that some Nagios NSCA specific properties previous configured in properties.xml has been moved to the servers.xml file in the NSCA section. The OpenTSDB server class should be regarded as beta.
- [FR-202] The implementation of running bischeck once, in a none daemon mode, is changed so the same code is used as running in daemon mode. The only difference is that the initialization of triggers are different so all service items are just ran directly and and just once.
- [FR-204] The bischeck cache will be saved when the bischeck daemon is shutdown and reloaded on bischeck startup. Keeping the cache persistent between restarts is important since 0.4.0 support time based cache retrieval. The limitations is currently that if the bischeck daemon is killed by a signal that can not be caught or the daemon crash the data will not be saved. This will be improved in future versions.
- [FR-218] The bischeck daemon can now reload the configuration without a process restart. This is support through the JMX operation “reload”. The feature will limit the need of operating system access and authorization.
- [FR-219] Bischeck can now retrieve state and performance data from a Nagios server supporting livestatus. With the service class LivestatusService a connection is set up over livestatus and with the and serviceitem class LivestatusServiceItem state and/or performance data can be retrieved from the a Nagios service. This can be useful when when creating virtual services in bischeck or used in complex thresholds.
- [FR-220] Bischeck now support one additional scheduling method where scheduling can be defined to run a service after a different service has executed. This can be useful when a service is depending on data for another service for its thresholds or execution statement.
- [FR-221] Cache retrieval is now support by using a time offset to find the nearest cache element to the time offset.
- Cache data can be retrieved as a list of elements based both on index and time.
- Support for additional mathematical functions like average, min and max calculations on list of elements.
- Bischeck can now support the usage of cached data in an execution statement of a serviceitem. This is typical useful when a serviceitem execute statement is depending on other service data. For example in a SQL query string:
select value from table1 where id = host1-web-state and createdate = ’%%yyyy-MM-dd%%’");
- Added support for other Linux distributions then Redhat based. bischeck should now install on Debian 6 and Ubuntu 10/11.
- Configuration listing. The configuration listing has been moved from the ConfigurationManager class to the DocManager class. Currently html and text listing is supported. The generated configuration data will by default placed in the bischeckdoc directory.
- A configured service can be configured not to send its data to a the configured monitoring servers like Nagios. This can be useful if the service is just to be used to create virtual services or just to be used as thresholds.
- The bischeck script now support JMX authentication. The authentication files are located in the etc directory and named jmxremote.password and jmxremote.access. Default is to that authentication is disabled by the system property
“-Dcom.sun.management.jmxremote.authenticate=false”. To enable authentication set the property to true. For more info about JMX see
Bugs fixed and important issues
- The Twenty4Thresholds class was in previous version not doing a correct linear equation calculation if a expression based threshold was defined. Lets illustrate the errors with this example from the 24thresholds.xml configuration file having a mix with static and expression based thresholds.
.... <!-- 12:00 --> <hour>7000</hour> <!-- 13:00 --> <hour>testhost-testservice-testitem / 3</hour> <!-- 14:00 --> <hour>testhost-testservice-testitem / 2</hour> <!-- 15:00 --> <hour>testhost-testservice-testitem + 1000 </hour> <!-- 16:00 --> <hour>12000</hour> ....
In the previous version the threshold value between 12:00 and 13:00 would be null since it was a mix of static and expression based thresholds. And between 15:00 and 16:00 the threshold would have been calculated as “testhost-testservice-testitem + 1000” independent of the time between 15:00 and 16:00.
Now the linear equation will correctly be calculated with any mix of static and expression based definitions. In the above example the calculated threshold for 12:20 will now be:
20*((testhost-testservice-testitem/3) - 7000)/60 + 7000This fix will improve the correctness and also the capability of threshold adaptivity.
- The Service interface has a number of new methods that should been there from the beginning. If you developed any service class you need to add these, but if you just inherited ServiceAbstract its fixed for you. The new methods are:
public NAGIOSSTAT getLevel(); public void setLevel(NAGIOSSTAT level); public boolean isConnectionEstablished(); public void setConnectionEstablished(boolean connected); public Boolean isSendServiceData(); public setSendServiceData(Boolean sendServiceData);
- Property cacheclear is renamed to thresholdCacheClear.
- All the nsca related properties has been moved from properties.xml to servers.xml when used for the NSCAServer class. The new property names has also gone through some minor changes. When upgrading a manual update is needed of the servers.xml file with the current setting of nsca related properties in properties.xml. Recommended that these are later removed.
- All JAXB generated configuration classes now support serialization.
- Quartz jar is upgraded from 2.0.1 to 2.1.5.
- [TR-216] “Shutdown is automatic triggered”
- [TR-217] “Configuration Manager initialization failed with java.lang.NullPointerException”
- [TR-207] “sudo in bischeckd script cause problem at boot”
Once again its time for the Nagios World conference in St Paul. This time 3 days with lots of good stuff, http://www.nagios.com/events/nagiosworldconference/northamerica/2012/. If you like to know more about dynamic and adaptive thresholds com and join my presentation on the third day, http://www.nagios.com/events/nagiosworldconference/northamerica/2012/speakers/#ahaal.
Look forward to meet you all in St Paul.