We recently migrated our K1000 (6.2) from a hardware K1100 appliance to Virtual running within VMware ESXi/vSphere on a Dell PowerEdge 2950.

All "appears" to be functioning correctly and we've been running with this setup for a few months without any real issues. However, after we started looking closer at some things we've noticed that all is not quite 100%.

At least once a week the K1000 clock suddenly loses sync. We’ll arrive in the office at 8am only to find the K1000 thinks its 4am which really messes up a lot of things. From the KB I’ve seen this is due to ESXi not being man enough for the job, but at the moment I don’t have another option.

Also...under Communication Settings we have our Inventory set to 2 hours (7.66 connections per minute). If I look at the Inventory though there are machines in the list that have active AMP connections but have a Last Inventory time of “3 hours, 25 minutes”. I don’t recall this being the case when we were on hardware. The machines would inventory every 2 hours as instructed and no machine with an active AMP connection would go over 2 hours without performing an Inventory. If I check Agent Tasks under troubleshooting I can see that as far as the K1000 is concerned these machines were due to check-in after 2 hours as planned, but haven’t. They will eventually check-in.

We have the same problem with Patching, if I set a Detect task to run at 9am it won’t actually start at 9am. I’ll see machines that have been on all day running their Detect job at 2pm! The same happens with Deploy tasks. Save and Run Now doesn’t work at all with Patching.

As you can imagine, this is makes scheduling downtime and maintenance impossible.

Under Communication Settings I can see that our Load Average bounces between 0.8-1.8. I haven’t seen it go much higher.

All clients are also using the latest KACE Agent, however this problem existed on the older Agent too.

I plan to raise a support ticket for these issues, but I thought I’d reach out for ideas here too.

0 Comments   [ + ] Show Comments

Comments

Please log in to comment

Answers

0
My first thought was making sure you have adequate resources assigned to the Kbox as a VM; but I am just going to assume you do.  

Note: we have 1k clients, keep this in mind as you read the bullets below.  

I would also recommend you look in the following places:
* http://kbox/munin specifically the mysql processes and apache processes (our kbox typically runs around 10 apache processes, when you approach 100 things run terribly 
* regarding your agent settings, in order to performance tune we have set our to the following (note this may not work in your environment depending on the features in use)
** Inventory - every 4 hours
** Scripting - every 3 hours
** Metering - every 7 days (we don't use metering and have found in 5.5.x it has caused more problems than the return on value)
** Catalog Inventory - every 7 days (again we do not rely on this)
* the other area to look at is your status of the k1000 agent tasks; in 5.5 found under settings/ support/ troubleshooting tools.  This will given you a sense of what is going on in the background
* in the same area, also look at AMP message queue
* also check out, under settings/ logs the Konductor log; see http://www.kace.com/~/media/Files/Support/KACE-Konference/2011/US/K1000-Advanced-Topics.ashx for more information about Konductor
* finally check your scripts and see what type of scripts are running in the background.  If you have a lot of online scripts you should convert those to offline if possible; online scripts, if pointed at too many clients, will take your kbox offline.  



Answered 12/19/2014 by: Jbr32
Tenth Degree Black Belt

  • Hi, thanks for the reply.
    I've taken a look at your advice and a lot of it all looks fine.

    We have 900 clients.

    - Over the last month, our Apache processes have been no higher than 25. However, if I look at the "Number of Processes" graphs I can see that our K1 has 175 processes running on average.
    - We only have two scripts enabled, and both of those are Offline Scripts.
    - I've changed our agent settings to mirror yours just out of interest. I can't say this has made any difference.

    Here is a quick snippet of our Konductor Log. The first thing that jumps out at me is that our lv values are often well over the lt values. I don't honestly know what that means though.

    [2014-12-22 12:24:30 +0000] Konductor[1522] [main] stats [s:2352 t/s:0 t/tc:8 t:1268 tc:148 c:1226 cc:175 sl:30 sc:2271 tpl:14 apa:11 lt:1 lv:7.32]
    [2014-12-22 12:25:03 +0000] Konductor[1522] [main] stats [s:2385 t/s:0 t/tc:8 t:1268 tc:148 c:1242 cc:176 sl:30 sc:2301 tpl:14 apa:13 lt:1 lv:9.35]
    [2014-12-22 12:25:33 +0000] Konductor[1522] [main] stats [s:2415 t/s:0 t/tc:8 t:1268 tc:148 c:1256 cc:177 sl:30 sc:2331 tpl:14 apa:12 lt:1 lv:8.8]
    • Your LV is really high. How many processors do you have allocated on your VM and how much memory?
      How do your long tasks, task throughput, and mySQL long queries graphs look?
  • Also, under scripts, I can see K1000 Scripting Updater (ID=3) in the list of enabled scripts. I'm not sure I'm supposed to be able to see/modify that.
    • Ours is running at 171 processes, so I would assume that is probably normal. Regarding script 3, indeed that is a online script used by the kbox; do not change it.

      Your LV is higher than ours normally is. We typically see a .0x load level or sometimes 5 or 6.

      Have you rebooted? Might be worth putting in a support ticket to take a closer look.
      • The appliance has had a few reboots the last few days. Sometimes it's OK, other times not.

        I've raised a support ticket but with Christmas I think it'll be the new year before any answers (totally fine with that!).
      • Now the New year madness is out of the way I've been able to sit down and look at this again.

        As I test, I setup a small batch of machines to start a patching run on.
        These machines have already been doing a Detect daily, so I created a Detect and Deploy task and clicked "Save and Run Now", expecting it would happen within the next few minutes.
        Nothing happened...this was at about 10am this morning.

        If I looked under Agent Tasks I could see that the tasks were ready, but nothing was happening. All the machines had a valid AMP connection.

        An hour later, the machines started patching!

        I managed to grab a screenshot of this from the Agent Tasks - http://imgur.com/A1v8Ptw

        The same thing happens with Scheduled Patch runs. They never start when you tell them too.
Please log in to comment
This content is currently hidden from public view.
Reason: Removed by member request
For more information, visit our FAQ's.

Answer this question or Comment on this question for clarity