Discussion:
High CPU on one core, but unable to find process responsible
(too old to reply)
David Rosenstrauch
2018-03-12 02:19:34 UTC
Permalink
My server's been exhibiting some very strange behavior lately. Every
couple of days I run into a situation where one core (core #0) on the
quad core CPU starts continuously using around 34% of CPU, but I'm not
able to see (using htop) any process that's responsible for using all
that CPU. Even when I tell htop to show me kernel threads too, I still
am not able to see the offending process. Every process remains under
1% CPU usage (except for occasional, small, short-lived spikes up) yet
the CPU usage on that core remains permanently hovering at around 34%.
The problem goes away when I reboot, but then comes back with a day or
so.

I'm rather stumped as to how to fix this. The server is a bit old,
running an up-to-date installation of Arch on a Intel Core 2 Quad Q6600
CPU. Any suggestions anyone might have as to either what might be going
on here, or how to go about debugging it would be greatly appreciated.

Thanks!

DR
Carsten Mattner via arch-general
2018-03-12 02:59:10 UTC
Permalink
Post by David Rosenstrauch
My server's been exhibiting some very strange behavior lately. Every
couple of days I run into a situation where one core (core #0) on the
quad core CPU starts continuously using around 34% of CPU, but I'm not
able to see (using htop) any process that's responsible for using all
that CPU. Even when I tell htop to show me kernel threads too, I still
am not able to see the offending process. Every process remains under
1% CPU usage (except for occasional, small, short-lived spikes up) yet
the CPU usage on that core remains permanently hovering at around 34%.
The problem goes away when I reboot, but then comes back with a day or
so.
My gut feeling is that one of the kernel worker threads hangs.
So that would be 25% overall and 100% of the affected core.
But you say there's no load to be found in the kernel threads,
which is odd.

Or if the server is accessible from the Internet, is it possible
it's rooted and someone's running a hidden process? To confirm
this isn't the case, cut off Internet access and let it run for
two days.

I don't think there are any official hidden processes that do not
show up in htop or top since that would make them seem like rootkits.
That means if the guilty process is really invisible, then it's
definitely unusual.

It's scary to consider a rootkit, but if that's the case, then
it's best to be aware as soon as possible. I hope this is not
case for you, wouldn't wish it on your worst enemy.

Another idea. Can you limit the cores to 1 or maybe two and see
if it becomes easier to pinpoint?

This might work in the booted system:
echo 0 > /sys/devices/system/cpu/cpu1/online
echo 0 > /sys/devices/system/cpu/cpu2/online
echo 0 > /sys/devices/system/cpu/cpu3/online

But on the kernel command line maxcpus=1 should work.
Jiachen Yang via arch-general
2018-03-12 09:13:49 UTC
Permalink
My server's been exhibiting some very strange behavior lately.  Every
couple of days I run into a situation where one core (core #0) on the
quad core CPU starts continuously using around 34% of CPU, but I'm not
able to see (using htop) any process that's responsible for using all
that CPU.  Even when I tell htop to show me kernel threads too, I
still am not able to see the offending process.  Every process remains
under 1% CPU usage (except for occasional, small, short-lived spikes
up) yet the CPU usage on that core remains permanently hovering at
around 34%.  The problem goes away when I reboot, but then comes back
with a day or so.
I'm rather stumped as to how to fix this.  The server is a bit old,
running an up-to-date installation of Arch on a Intel Core 2 Quad
Q6600 CPU.  Any suggestions anyone might have as to either what might
be going on here, or how to go about debugging it would be greatly
appreciated.
Thanks!
DR
Can you check whether you have enabled "Detailed CPU time" option in
htop's setup (F2 -> Display options -> "Detailed CPU time")?
From my experience and understanging, htop's CPU meter is accounting
IO-wait/IRQ-response time by default but not showing them differently
unless you enabled the "Detailed CPU time" option.
And these waiting time is not accounted on each process or kernel
thread. Enabling that said option will revail more detailed CPU usage info.
High IO-wait or IRQ time is itself an indication of some misbehaving
hardware, but at least you can be sure that it is not by more
"dangerous" malwares or attacks.
David Rosenstrauch
2018-03-13 01:23:21 UTC
Permalink
Post by Jiachen Yang via arch-general
My server's been exhibiting some very strange behavior lately.  Every
couple of days I run into a situation where one core (core #0) on the
quad core CPU starts continuously using around 34% of CPU, but I'm not
able to see (using htop) any process that's responsible for using all
that CPU.
Can you check whether you have enabled "Detailed CPU time" option in
htop's setup (F2 -> Display options -> "Detailed CPU time")?
From my experience and understanging, htop's CPU meter is accounting
IO-wait/IRQ-response time by default but not showing them differently
unless you enabled the "Detailed CPU time" option.
And these waiting time is not accounted on each process or kernel
thread. Enabling that said option will revail more detailed CPU usage info.
High IO-wait or IRQ time is itself an indication of some misbehaving
hardware, but at least you can be sure that it is not by more
"dangerous" malwares or attacks.
Thanks for the suggestion. So this issue happened again tonight, and I
switched to "Detailed CPU time" to try to research it further.
According to htop, the cpu usage is from "irq" (orange color). I guess
this would explain why I'm not seeing any process responsible too.

And it also might be related that I'm seeing these messages in my dmesg:

[ 871.317377] perf: interrupt took too long (2506 > 2500), lowering
kernel.perf_event_max_sample_rate to 79000
[ 1732.773491] perf: interrupt took too long (3140 > 3132), lowering
kernel.perf_event_max_sample_rate to 63000
[ 3375.392292] perf: interrupt took too long (3950 > 3925), lowering
kernel.perf_event_max_sample_rate to 50000


So if this issue is irq-based, I guess that means some piece of hardware
is faulty or failing. Any idea how I might go about pinning down which
one? Would there be info in the kernel log about this? Or something
that I can look at in /proc?

Thanks,

DR
Carsten Mattner via arch-general
2018-03-13 01:56:28 UTC
Permalink
Post by David Rosenstrauch
So if this issue is irq-based, I guess that means some piece of hardware
is faulty or failing. Any idea how I might go about pinning down which
one? Would there be info in the kernel log about this? Or something
that I can look at in /proc?
Any BIOS updates or kernel updates recently (4.15.8)?

Try with 3.16 or 4.9 or another old lts kernel from archive.archlinux.org
just for testing (not production).

It's more likely that the kernel regressed rather than IRQ issues popping
up suddenly. It's possible but less likely.
David Rosenstrauch
2018-03-13 02:03:28 UTC
Permalink
Post by Carsten Mattner via arch-general
Any BIOS updates or kernel updates recently (4.15.8)?
Try with 3.16 or 4.9 or another old lts kernel from archive.archlinux.org
just for testing (not production).
It's more likely that the kernel regressed rather than IRQ issues popping
up suddenly. It's possible but less likely.
No BIOS updates. As far as kernel updates, I do those all the time, so
not sure that would be the cause.


After doing some digging, though, I did "cat /proc/interrupts", and this
line stood out, for having an astronomically high number:

CPU0 CPU1 CPU2 CPU3
...
16: 2424156658 0 0 0 IO-APIC 16-fasteoi
uhci_hcd:usb5, parport1


I have an old PCI card in the machine that's powers an old parallel port
printer I used to use with it. Perhaps that's failing. I don't need
the card anymore, so I might as well try taking it out and see if that
makes things better. If not, I'll be back. :-)

Thanks,

DR
Carsten Mattner via arch-general
2018-03-13 07:34:17 UTC
Permalink
Post by David Rosenstrauch
Post by Carsten Mattner via arch-general
Any BIOS updates or kernel updates recently (4.15.8)?
Try with 3.16 or 4.9 or another old lts kernel from archive.archlinux.org
just for testing (not production).
It's more likely that the kernel regressed rather than IRQ issues popping
up suddenly. It's possible but less likely.
No BIOS updates. As far as kernel updates, I do those all the time, so
not sure that would be the cause.
After doing some digging, though, I did "cat /proc/interrupts", and this
CPU0 CPU1 CPU2 CPU3
...
16: 2424156658 0 0 0 IO-APIC 16-fasteoi
uhci_hcd:usb5, parport1
I have an old PCI card in the machine that's powers an old parallel port
printer I used to use with it. Perhaps that's failing. I don't need
the card anymore, so I might as well try taking it out and see if that
makes things better. If not, I'll be back. :-)
Let us know in either case. Please tell me it's a huge plotter
you use to make construction plans for your rocket ships.
David Rosenstrauch
2018-03-13 12:58:02 UTC
Permalink
Post by Carsten Mattner via arch-general
Post by David Rosenstrauch
printer I used to use with it. Perhaps that's failing. I don't need
the card anymore, so I might as well try taking it out and see if that
makes things better. If not, I'll be back. :-)
Let us know in either case. Please tell me it's a huge plotter
you use to make construction plans for your rocket ships.
Just an old printer that's been good to me that I've been reluctant to
let go of:

https://www.cnet.com/products/hp-laserjet-6l/specs/

DR

David C. Rankin
2018-03-13 01:06:54 UTC
Permalink
My server's been exhibiting some very strange behavior lately.  Every couple
of days I run into a situation where one core (core #0) on the quad core CPU
starts continuously using around 34% of CPU, but I'm not able to see (using
htop) any process that's responsible for using all that CPU.  Even when I tell
htop to show me kernel threads too, I still am not able to see the offending
process.  Every process remains under 1% CPU usage (except for occasional,
small, short-lived spikes up) yet the CPU usage on that core remains
permanently hovering at around 34%.  The problem goes away when I reboot, but
then comes back with a day or so.
I'm rather stumped as to how to fix this.  The server is a bit old, running an
up-to-date installation of Arch on a Intel Core 2 Quad Q6600 CPU.  Any
suggestions anyone might have as to either what might be going on here, or how
to go about debugging it would be greatly appreciated.
Thanks!
DR
DR,

You are on to something. In the past 24hr. I've experienced two hardlocks on
two separate multi-cpu Arch servers. Both AMD based boxes. One the
dual-quad-core opteron box that was the subject of the 4.15.8 update failure
that left the box unbootable. Then today on a quad-quad-core opteron box that
hardlocked doing a simple rsync from another server on the LAN. Other than the
rsync process, the box is nearly idle, e.g.

top - 19:49:45 up 8 min, 2 users, load average: 2.93, 2.23, 1.11
Tasks: 271 total, 1 running, 158 sleeping, 0 stopped, 0 zombie
%Cpu0 : 0.0/0.0 0[ ]
%Cpu1 : 0.0/0.0 0[ ]
%Cpu2 : 0.0/0.0 0[ ]
%Cpu3 : 0.0/0.0 0[ ]
%Cpu4 : 0.0/0.0 0[ ]
%Cpu5 : 0.0/0.0 0[ ]
%Cpu6 : 0.0/0.0 0[ ]
%Cpu7 : 0.0/0.0 0[ ]
%Cpu8 : 0.0/0.0 0[ ]
%Cpu9 : 0.0/0.0 0[ ]
%Cpu10 : 0.0/0.0 0[ ]
%Cpu11 : 0.0/0.7 1[| ]
%Cpu12 : 0.0/0.0 0[ ]
%Cpu13 : 0.0/0.0 0[ ]
%Cpu14 : 0.0/0.0 0[ ]
%Cpu15 : 0.0/0.0 0[ ]
GiB Mem : 1.7/62.915 [ ]
GiB Swap: 0.0/1.998 [ ]


I could not get a top during the lockup, because when the lockup occurred,
while I could connect to the server from another box via ssh, I could not log
in. E.g.,

19:34 wizard:~/dev/src-c/tmp/debug> svh
Last login: Mon Mar 12 19:34:44 2018 from 192.168.6.104
^CConnection to valhalla closed.

Both boxes have exhibited strange behavior regarding the linux-raid array
(all disks are fine), but I receive spurious errors like (Out of IOMMU space),
Huh?

Mar 12 19:45:20 valhalla su[869]: pam_unix(su:session): session opened for
user root by david(uid=1000)
Mar 12 19:45:57 valhalla kernel: sata_nv 0000:00:05.0: PCI-DMA: Out of IOMMU
space for 65536 bytes
Mar 12 19:45:57 valhalla kernel: ata3: EH in SWNCQ mode,QC:qc_active 0x4
sactive 0x4
Mar 12 19:45:57 valhalla kernel: ata3: SWNCQ:qc_active 0x0 defer_bits 0x0
last_issue_tag 0x1
dhfis 0x0 dmafis 0x0 sdbfis 0x0
Mar 12 19:45:57 valhalla kernel: ata3: ATA_REG 0x40 ERR_REG 0x0
Mar 12 19:45:57 valhalla kernel: ata3: tag : dhfis dmafis sdbfis sactive
Mar 12 19:45:57 valhalla kernel: ata3.00: exception Emask 0x0 SAct 0x4 SErr
0x0 action 0x6
Mar 12 19:45:57 valhalla kernel: ata3.00: failed command: WRITE FPDMA QUEUED
Mar 12 19:45:57 valhalla kernel: ata3.00: cmd
61/00:10:00:d0:e4/0a:00:0f:00:00/40 tag 2 ncq dma 1310720 ou
res
40/00:20:00:ea:e3/00:00:0f:00:00/40 Emask 0x40 (internal error)
Mar 12 19:45:57 valhalla kernel: ata3.00: status: { DRDY }
Mar 12 19:45:57 valhalla kernel: ata3: hard resetting link
Mar 12 19:45:57 valhalla kernel: ata3: nv: skipping hardreset on occupied port
Mar 12 19:45:58 valhalla kernel: ata3: SATA link up 1.5 Gbps (SStatus 113
SControl 300)
Mar 12 19:45:58 valhalla kernel: ata3.00: configured for UDMA/133
Mar 12 19:45:58 valhalla kernel: ata3: EH complete
Mar 12 19:46:09 valhalla kernel: sata_nv 0000:00:05.1: PCI-DMA: Out of IOMMU
space for 65536 bytes
Mar 12 19:46:09 valhalla kernel: ata5: EH in SWNCQ mode,QC:qc_active 0x4
sactive 0x4
Mar 12 19:46:09 valhalla kernel: ata5: SWNCQ:qc_active 0x0 defer_bits 0x0
last_issue_tag 0x1
dhfis 0x0 dmafis 0x0 sdbfis 0x0
Mar 12 19:46:09 valhalla kernel: ata5: ATA_REG 0x40 ERR_REG 0x0
Mar 12 19:46:09 valhalla kernel: ata5: tag : dhfis dmafis sdbfis sactive
Mar 12 19:46:09 valhalla kernel: ata5.00: exception Emask 0x0 SAct 0x4 SErr
0x0 action 0x6
Mar 12 19:46:09 valhalla kernel: ata5.00: failed command: WRITE FPDMA QUEUED
Mar 12 19:46:09 valhalla kernel: ata5.00: cmd
61/00:10:00:c0:f8/0a:00:0f:00:00/40 tag 2 ncq dma 1310720 ou
res
40/00:20:00:da:f7/00:00:0f:00:00/40 Emask 0x40 (internal error)
Mar 12 19:46:09 valhalla kernel: ata5.00: status: { DRDY }
Mar 12 19:46:09 valhalla kernel: ata5: hard resetting link
Mar 12 19:46:09 valhalla kernel: ata5: nv: skipping hardreset on occupied port
Mar 12 19:46:10 valhalla kernel: ata5: SATA link up 1.5 Gbps (SStatus 113
SControl 300)
Mar 12 19:46:10 valhalla kernel: ata5.00: configured for UDMA/133
Mar 12 19:46:10 valhalla kernel: ata5: EH complete

Today's lockup occurred after update (where 29 packages were updated), e.g.

Packages (29)

calc-2.12.6.7-1 cryptsetup-2.0.2-1 device-mapper-2.02.177-5
e2fsprogs-1.44.0-1 efivar-34-1 filesystem-2018.1-2 gawk-4.2.1-1
git-2.16.2-2 graphite-1:1.3.11-1 hdparm-9.54-1 imagemagick-7.0.7.26-1
libevdev-1.5.9-1 libmagick-7.0.7.26-1 libmagick6-6.9.9.38-1
libsystemd-238.0-3 libzip-1.4.0-1 linux-lts-4.14.25-1
linux-lts-headers-4.14.25-1 lvm2-2.02.177-5 nano-2.9.4-1 qt5-base-5.10.1-5
systemd-238.0-3 systemd-sysvcompat-238.0-3 tcl-8.6.8-2
vulkan-icd-loader-1.1.70.0-3 wxgtk-common-3.0.4-1 wxgtk2-3.0.4-1
wxgtk3-3.0.4-1 xfsprogs-4.15.1-1

Following update, I reloaded systemd (systemctl daemon-reload), and then
proceeded with the rsync transfer -- which locked. I note we have
vulkan-icd-loader and qt5-base, both of which also had updates yesterday at
the time of the locked system update that caused so much problem.

I don't know where to begin. I've been monitoring with top since the last
lockup, but have not caught anything. Both of these boxes have run Arch since
2016 and have never experienced any problems before now. Both are SuperMicro
boards and both have either 2 or 4 quad-core opterons. Both are raid1 linux
raid based boxes and have either 32 or 64G of RAM.

Something has gone haywire in the past week that has a adverse impact on
these boxes. I have to admit -- I have no idea where to start tracking this down.
--
David C. Rankin, J.D.,P.E.
Loading...