All posts by Anders Håål

Bisdw – a simple ETL tool

Bisdw is a simple ETL tool that we developed for monitoring use case that demanded us to retrieve data from different source and put it into a local database. As a tool it can be used independent of Bischeck. Most of the ETL logic is provided by the Scriptella project. What we have added are functionality for scheduling, FTP integration, init scripts, etc.

You can download Bisdw from gforge.ingby.com. Documentation is available here.

Bischeck 1.0.0 RC1 is available

After a little longer than expected, we finally have RC1 of version 1.0.0 available. This is not a production ready version and should only be used for testing. We hope to get feedback and bug reports from all of you who take the time to test.

RC1 does not support upgrade from 0.4.3, but should run with your current configuration files. And if you like to use the existing cached data from your bischeck 0.4.3 you need to migrate it to redis cache as explained below. Continue reading Bischeck 1.0.0 RC1 is available

Bischeck 0.4.2 performance testing

Introduction

Performance testing is key to secure that your software can handle the load and to verify the robustness of the software. With server based software, running as a daemon, it is especially important to verify that the software is stable during a long period of continues uptime without decreased throughput and by leaking resources, like memory. 

Since bischeck is designed to do advanced service check with dynamic and adaptive thresholds we know that cpu and memory will be important resources when operating with mathematical algorithms over historical collected data.

The test setup will start with a baseline that is scaled in two dimensions, increase the load by increase the number of service jobs and increase the load by decrease the interval between service job schedules.

Read the full benchmark report 

 

bischeck 0.4.2 RC2 is released – test now!

We are pleased to release the second release candidate for bischeck 04.2. The major change in this release candidate is how null values are managed for mathematical functions that takes a list of arguments like sum and avg. Read more about this feature in the documentation.

 This release include the following features and fixes:

New feature

• Related to bug [TR-227 ] the naming of host, service and serviceitem names has been improved. 

• Execution statements and thresholds hour specification where cache data is retrieved as a list, like in a function as avg(x-y-z[4:10]) and max(x-y-z[-5M:-15M]), can now be configured to return a value as long as at lest one index in the range is not null. To support backwards capability the new functionality will only be used if the property notFullListParse is set to true in the properties.xml. The default value is false. 

• There has been some discussion about what Nagios state should be sent if the a the returned execute statement of a service item is null. In previous releases this has been hard coded to OK, but now its possible to define it by setting the property stateOnNull. The property can be set to an integer 0,1,2 or 3 or to a string OK, WARNING, CRITICAL or UNKNOWN. The default is UNKNOWN.

• When a service class get an exception when creating a connection the previous versions did not save any data to the cache. If the property saveNullOnConnectionError you will now get a null value inserted into the cache when a connection exception is thrown. For backwards compatibility the default value of the property is false.

Bugs fixed and important issues

• [TR-227] “Cache parser do not work for host, service or serviceitems if the name include 0 (zero)” has been resolved.

• [TR-228] “Threshold factory return wrong threshold definition if service and serviceitem name is the same for different hosts” has been resolved.

• [TR-229] “When using service ShellService the number of open files limit will be reached” has been resolved.

• [TR-230] “NRDP submissions all come in as OK” has been resolved.

• Fixed migration script from 0.4.0 to copy etc directory content correctly. Changes in the file urlservices.xml will be overwritten. Existing 0.4.0 configuration will still be available in the previous version backup directory, bischeck_0.4.0.

Bad directory location

Bischeck use the directory /var/tmp to store log files, pid file and persistent cache data. For logs this is not a bad location, but for pid file and cache data this is not a very smart location. The main reason for this are that if your bischeck process will run for a very long time, which it should, there is a risk that your pid file and cache data will be removed. This is due to the fact that distributions like Centos has a cron script that run a command tmpwatch that remove files in different “tmp” directories if files are not updated for a long time. This can be fixed by changing the cron script, /etc/cron.daily/tmpwatch on Centos or by changing the directory location by the properties in bischeck configuration file properties.xml.

The properties to change are:

  • pidfile – default is /var/tmp/bischeck.pid
  • lastStatusCacheDumpDir – default is /var/tmp/
In future release we will change the default directory location.

bischeck 0.4.2 RC1 is released – test now!

We are pleased to release bischeck 0.4.2 release candidate 1.  This release include the following features and fixes:

New feature

  • Related to bug [TR-227 ] the naming of host, service and serviceitem names has been improved. For more info please see 8.1↑
  • Execution statements and thresholds hour specification where cache data is retrieved as a list, like in a function as avg(x-y-z[4:10]) and avg(4,6,8), can now be configured to not return a null value if at least the first index in the list definition has a cached value. This means for the example that if, at least, index 4 as a value for the x-y-z an average will be calculated. To support backwards capability the new functionality will only be used if the property notFullListParse is set to true in the properties.xml. The default value is false.
  • There has been some discussion about what Nagios state should be sent if the a the returned execute statement of a service item is null. In previous releases this has been hard coded to OK, but now its possible to define it by setting the property stateOnNull. The property can be set to an integer 0,1,2 or 3 or to a string OK, WARNING, CRITICAL or UNKNOWN. The default is UNKNOWN.
  • When a service class get an exception when doing a connection the previous versions did not save any data to the cache. If the property saveNullOnConnectionError you will now get a null value inserted into the cache when a connection exception is thrown. For backwards compatibility the default value of the property is false.

Bugs fixed and important issues

  • [TR-227] “Cache parser do not work for host, service or serviceitems if the name include 0 (zero)” has been resolved.
  • [TR-228] “Threshold factory return wrong threshold definition if service and serviceitem name is the same for different hosts” has been resolved.
  • [TR-229] “When using service ShellService the number of open files limit will be reached” has been resolved.
  • [TR-230] “NRDP submissions all come in as OK” has been resolved.
  • Fixed migration script from 0.4.0 to copy etc directory content correctly. Changes in the file urlservices.xml will be overwritten. Existing 0.4.0 configuration will still be available in the previous version backup directory, bischeck_0.4.0. 
This RC1 version do not support upgrade from previous version, but just copy your configurations files and give it a spin.
Read more about 0.4.2 
 

Limitation in host, service and serviceitem naming

Currently we have a naming limitation in the naming of a host, service and serviceitem. The issue is seen when using dynamic thresholds that do calculations on cached entries. When describing a cache entry in the 24threshols.xml file in a hour tag you should use the format of host-service-serviceitem, erphost-erpOrders-weborders. The problem with the current format is that the names given must be based on any letter, upper or lower case, and the number 1-9. Yes the missing of 0 is a major bug. Execept for the 0 bug the format has the following limitations:

  • Dash (-) is used as the separator between the host, service and serviceitem name, which means that using dash in the name is a problem.
  • Other characters like dot (.), plus (+), underscore (_)  or any other character then the described above is not supported. This is a major weakness since many will use, for example dot and underscore in their existing Nagios host and service name.
We will solve these limitation as soon as possible with a quick release of 0.4.2 at the latest next week. This issues is documented in bug report TR-227.
 

bischeck 0.4.1 released – upgrade now!

We have made a quick new release of bischeck due to a bug that caused truncation of all  measured and threshold values with more then 2 decimal values. This caused some obvious problems, especially if we are measuring stuff like network times. So if you are monitoring these kind of stuff please upgrade asap.

Since we had some new stuff in the trunk we chose to include it to, but they should be regarded as beta functionality. The new functionality are:

  • Sending passive checks over NRDP as an alternative to NSCA
  • New Service and serviceitem that support execution of local check commands. With this functionality any Nagios check commands that output performance data can now be executed through bischeck. The state is of course ignored since bischeck will do its own threshold calculation of the performance data. Thanks to Eric Loyd at Bitnetix (www.bitnetix.com) that gave me the idea during Nagios World 2012.

For more information about this new functionality please check out the 0.4.1 README. Feedback on 0.4.1 is more then welcomed.

To download bischeck 0.4.1 please visit our download area.