View topic - Fsys 4.24V hang

Fsys 4.24V hang

For the old QNX2 and QNX4 RTOS

Fsys 4.24V hang

Postby davepage » Thu Apr 05, 2007 12:10 am

Hi,
My application (simcon) is hanging for 500-1000mSec (missing many hard deadlines).

Some probing with monitor reveals that Fsys is hogging the CPU. Elevating the application priority above Fsys does not fix the problem (Fsys continues to run, regardless). Supplying Fsys with the -F option at boot does not help.

# sin info
Node CPU Machine Speed Memory Ticksize Display Flags
1 686/687 PCI 46396 50360k/66715k 1.0ms VGA Color -3P+----------8P

Heapp Heapf Heapl Heapn Hands Names Sessions Procs Timers Nodes Virtual
10564 151k 22888 0 64 100 64 500 125 1 59M/ 117M

# sin
SID PID PROGRAM PRI STATE BLK CODE DATA
-- -- Microkernel --- ----- --- 11696 0
0 1 sys/Proc32 26f READY --- 114k 1093k
0 2 sys/Slib32 10r RECV 0 53k 4096
0 4 /bin/Fsys 10r RECV 0 77k 73998k
0 5 /bin/Fsys.eide 22r RECV 0 61k 114k
0 7 //1/bin/ksh 10o WAIT -1 47k 36k
0 8 idle 0r READY --- 0 40k
0 16 //1/bin/Dev32 24f RECV 0 32k 90k
0 19 //1/bin/Dev32.ansi 20r RECV 0 40k 90k
0 22 //1/bin/Dev32.pty 20r RECV 0 12k 32k
0 23 //1/bin/Pipe 10r RECV 0 16k 49k
0 25 //1/usr/bin/int10 10r RECV 0 65k 65k
0 28 //1/bin/Net 23r RECV 0 32k 73k
0 31 //1/bin/Net.rtl 20r RECV 0 40k 118k
0 37 //1/bin/tinit 10o WAIT -1 16k 28k
0 41 //1/*/usr/ucb/Socklet 10r RECV 0 114k 172k
0 51 //1/*/usr/bin/syslogd 10o RECV 0 36k 32k
0 56 //1/*/usr/ucb/inetd 10o RECV 58 36k 32k
1 62 //1/*/4.25/usr/ucb/ds 10o RECV 0 12k 16k
2 63 //1/*/samba/bin/nmbd 10o RECV 66 335k 446k
3 64 //1/*/usr/ucb/slinger 10o REPLY 41 40k 16k
4 67 //1/*/samba/bin/smbd 10o RECV 74 444k 704k
5 77 //1/bin/login 10o REPLY 16 24k 20k
0 81 //1/usr/simcon/simcon 27f REPLY 0 133k 5615k -- main
0 82 //1/usr/simcon/simcon 10o RECV 97 133k 200k -- net
0 83 //1/usr/simcon/simcon 29f RECV 0 133k 5615k -- isr
4 98 //1/*/samba/bin/smbd 10o RECV 99 444k 1048k
0 105 //1/*/usr/ucb/telnetd 10o RECV 107 53k 53k
6 108 //1/bin/ksh 10o WAIT -1 47k 45k
6 128 //1/bin/sin 10o REPLY 1 45k 49k


Internal simcon thread trace:
[Task_name UNIX_seconds(TSC_low_dword)]
...
UserIO 1175721142.039501(F5AEF744)
UserIO 1175721142.039501(F5AF1B07) - failure occurs after
ToroIO 1175721142.733000(1911B850) - note 700mSec delay
CanIO 1175721142.733101(191255FF)
...

msgprint excerpt:
....

- Note msg from smbd

10.f5a25a62 104(smbd) sent (0x20e:0x17) [0x020e] to 4(Fsys)
10.f5a25cee active is 4(Fsys)
10.f5a49f9c 4(Fsys) interrupted(#10): @5:6f81
10.f5a4a266 interrupt handler(10): 81(simcon), @0005: 12538
10.f5a4cb08 handler done, wakeup proxy(83(simcon))
10.f5a4cf7f end of interrupt #10
10.f5a4e120 proxy(83(simcon)) triggers 83(simcon)
10.f5a4e40c 4(Fsys) is ready
10.f5a4e5c1 active is 83(simcon)
10.f5a4eb25 83(simcon) recv (none) [] from {none}
10.f5a4ee28 active is 4(Fsys)
10.f5a70c1f 4(Fsys) interrupted(#0): @5:3c2b
10.f5a70e18 interrupt handler(0): {none}, @00f0: 05000
10.f5a713f5 handler done, wakeup proxy(1(Proc32))
10.f5a71851 interrupt handler(0): 81(simcon), @0005: 12268
10.f5a71bf7 handler done, wakeup proxy(83(simcon))
10.f5a71f20 interrupt handler(0): 30(Net), @0015: 043c9
10.f5a72257 handler done
10.f5a72442 end of interrupt #0
10.f5a73166 proxy(83(simcon)) triggers 83(simcon)
10.f5a73489 proxy(1(Proc32)) triggers 1(Proc32)
10.f5a736df 4(Fsys) is ready
10.f5a738c3 active is 83(simcon)
10.f5a74edc 83(simcon) recv (none) [] from {none}
10.f5a75157 active is 1(Proc32)
10.f5a75a43 1(Proc32) reply (0x0:0x9) [No error] to 81(simcon)
10.f5a75cdd 1(Proc32) is ready
10.f5a75e83 active is 81(simcon)

10.f5af61c5 81(simcon) interrupted(#10): @5:565f9
10.f5af644a interrupt handler(10): 81(simcon), @0005: 12538
10.f5af7f72 handler done, wakeup proxy(83(simcon))
10.f5af8243 end of interrupt #10
10.f5af922a proxy(83(simcon)) triggers 83(simcon)
10.f5af94ac 81(simcon) is ready
10.f5af96ac active is 83(simcon)
10.f5af9ca3 83(simcon) recv (none) [] from {none}
10.f5af9f02 active is 81(simcon)
10.f5aff14f 81(simcon) sent (0x19:0x7) [0x0019] to 1(Proc32)
10.f5aff427 active is 1(Proc32)
10.f5affd14 1(Proc32) recv (0x19:0x7) [0x0019] from 81(simcon)
10.f5b00e53 1(Proc32) reply (0x0:0x7) [No error] to 81(simcon)
10.f5b010ef 1(Proc32) is ready
10.f5b01296 active is 81(simcon)
10.f5b01a10 81(simcon) sent (0x16:0x0) [0x0016] to 1(Proc32)
10.f5b01ca0 active is 1(Proc32)
10.f5b02285 1(Proc32) recv (0x16:0x0) [0x0016] from 81(simcon)
10.f5b02ec1 1(Proc32) reply (0x0:0x7) [No error] to 81(simcon)
10.f5b031a7 1(Proc32) is ready
10.f5b0334d active is 81(simcon)
10.f5b0380c priority 4(Fsys) = 27
10.f5b03a47 81(simcon) sent (0x101:0x7) [0x0101] to 4(Fsys)
10.f5b03ceb active is 4(Fsys)
10.f5b41d01 4(Fsys) interrupted(#0): @5:6f7a
10.f5b41f95 interrupt handler(0): {none}, @00f0: 05000
10.f5b42645 handler done
10.f5b4283f interrupt handler(0): 81(simcon), @0005: 12268
10.f5b42af7 handler done, wakeup proxy(83(simcon))
10.f5b42d6f interrupt handler(0): 30(Net), @0015: 043c9
10.f5b431d3 handler done
10.f5b433d9 end of interrupt #0
10.f5b44114 proxy(83(simcon)) triggers 83(simcon)
10.f5b443e0 4(Fsys) is ready
10.f5b44586 active is 83(simcon)
10.f5b45bcd 83(simcon) recv (none) [] from {none}
10.f5b45e64 active is 4(Fsys)
10.f5c12c41 4(Fsys) interrupted(#0): @5:6f81
10.f5c12eeb interrupt handler(0): {none}, @00f0: 05000
10.f5c1322f handler done
10.f5c1342a interrupt handler(0): 81(simcon), @0005: 12268
10.f5c13695 handler done, wakeup proxy(83(simcon))
10.f5c1397a interrupt handler(0): 30(Net), @0015: 043c9
10.f5c13c0b handler done
10.f5c13f21 interrupt handler(-1): 102(monitor), @0005: 0ad36
10.f5c143c8 handler done, wakeup proxy(102(monitor))
10.f5c14782 interrupt handler(-1): 30(Net), @0015: 0431e
10.f5c14d31 handler done
10.f5c150b0 interrupt handler(-1): 20(Dev32.ansi), @0005: 05dc0
10.f5c156cc handler done
10.f5c159a6 interrupt handler(-1): 17(Dev32), @0005: 05760
10.f5c15f92 handler done
10.f5c16181 end of interrupt #0
10.f5c16e86 proxy(102(monitor)) triggers 102(monitor)
10.f5c1729f proxy(83(simcon)) triggers 83(simcon)
10.f5c17541 4(Fsys) is ready
10.f5c17737 active is 102(monitor)
10.f5c18898 102(monitor) sent (0x8:0x2) [0x0008] to 1(Proc32)
10.f5c18b05 active is 83(simcon)
10.f5c19c64 83(simcon) recv (none) [] from {none}
10.f5c19ef5 active is 4(Fsys)
10.f5ce3a6e 4(Fsys) interrupted(#0): @5:6f81
10.f5ce3cb3 interrupt handler(0): {none}, @00f0: 05000
10.f5ce4227 handler done
10.f5ce4479 interrupt handler(0): 81(simcon), @0005: 12268
10.f5ce47c6 handler done, wakeup proxy(83(simcon))
10.f5ce4ae6 interrupt handler(0): 30(Net), @0015: 043c9
10.f5ce4df4 handler done
10.f5ce4fea end of interrupt #0
10.f5ce621f proxy(83(simcon)) triggers 83(simcon)
10.f5ce64ed 4(Fsys) is ready
10.f5ce6693 active is 83(simcon)
10.f5ce79ec 83(simcon) recv (none) [] from {none}
10.f5ce7d54 active is 4(Fsys)
10.f5db48ec 4(Fsys) interrupted(#0): @5:6f81
10.f5db4b76 interrupt handler(0): {none}, @00f0: 05000
10.f5db4ed1 handler done
10.f5db510d interrupt handler(0): 81(simcon), @0005: 12268
10.f5db552d handler done, wakeup proxy(83(simcon))
10.f5db5785 interrupt handler(0): 30(Net), @0015: 043c9
10.f5db5b66 handler done
10.f5db5da5 end of interrupt #0
10.f5db6bc8 proxy(83(simcon)) triggers 83(simcon)
10.f5db6e20 4(Fsys) is ready
10.f5db6fc6 active is 83(simcon)
10.f5db82f3 83(simcon) recv (none) [] from {none}
10.f5db857b active is 4(Fsys)
10.f5e8597f 4(Fsys) interrupted(#0): @5:6f7a
10.f5e85c79 interrupt handler(0): {none}, @00f0: 05000
10.f5e860cc handler done
10.f5e86390 interrupt handler(0): 81(simcon), @0005: 12268
10.f5e86684 handler done, wakeup proxy(83(simcon))
10.f5e86947 interrupt handler(0): 30(Net), @0015: 043c9
10.f5e86cb5 handler done
10.f5e86e95 end of interrupt #0
10.f5e87d26 proxy(83(simcon)) triggers 83(simcon)
10.f5e88028 4(Fsys) is ready
10.f5e881ce active is 83(simcon)
10.f5e893f6 83(simcon) recv (none) [] from {none}

- Fsys now stuck. PID 81 not running

10.f5e8976a active is 4(Fsys)
10.f5f56905 4(Fsys) interrupted(#0): @5:3c2f
10.f5f56b0d interrupt handler(0): {none}, @00f0: 05000
10.f5f56eca handler done
10.f5f57106 interrupt handler(0): 81(simcon), @0005: 12268
10.f5f573bc handler done, wakeup proxy(83(simcon))
10.f5f57688 interrupt handler(0): 30(Net), @0015: 043c9
10.f5f57920 handler done
10.f5f57b53 end of interrupt #0
10.f5f58918 proxy(83(simcon)) triggers 83(simcon)
10.f5f58b6d 4(Fsys) is ready
10.f5f58d13 active is 83(simcon)
10.f5f59f5f 83(simcon) recv (none) [] from {none}

... above block repeats many times, lines deleted
... in the interim, CPU used 581983668 cycles (699mSec)!

11.18a5fd13 active is 4(Fsys)
11.18b2d42f 4(Fsys) interrupted(#0): @5:b5f
11.18b2d687 interrupt handler(0): {none}, @00f0: 05000
11.18b2d9d7 handler done
11.18b2dbc2 interrupt handler(0): 81(simcon), @0005: 12268
11.18b2de7c handler done, wakeup proxy(83(simcon))
11.18b2e0ce interrupt handler(0): 30(Net), @0015: 043c9
11.18b2e393 handler done
11.18b2e573 end of interrupt #0
11.18b2f312 proxy(83(simcon)) triggers 83(simcon)
11.18b2f58d 4(Fsys) is ready
11.18b2f733 active is 83(simcon)
11.18b30983 83(simcon) recv (none) [] from {none}
11.18b30c90 active is 4(Fsys)
11.18b56e13 4(Fsys) reply (0x0:0x0) [No error] to 104(smbd)

- Hmmm. Running Samba Ver 2.2.2

11.18b58821 4(Fsys) recv (0x101:0x7) [0x0101] from 81(simcon)
11.18b61bfe 4(Fsys) sent (0x16:0x0) [0x0016] to 1(Proc32)
11.18b621db active is 1(Proc32)
11.18b638ae 1(Proc32) recv (0x8:0x2) [0x0008] from 102(monitor)
11.18b64a4e 1(Proc32) reply (0x0:0x0) [No error] to 102(monitor)
11.18b64d95 1(Proc32) is ready
11.18b64f92 active is 102(monitor)

... monitor dumps its load (lines deleted)

11.18ffb9e6 active is 102(monitor)
11.18ffc801 priority 4(Fsys) = 29
11.18ffca89 102(monitor) sent (0x104:0x3) [0x0104] to 4(Fsys)
11.18ffcd82 active is 4(Fsys)
11.19013024 {none} interrupted(#0): @f0:3ed0
11.1901322e interrupt handler(0): {none}, @00f0: 05000
11.19013677 handler done
11.19013883 interrupt handler(0): 81(simcon), @0005: 12268
11.19013afd handler done, wakeup proxy(83(simcon))
11.19013dd6 interrupt handler(0): 30(Net), @0015: 043c9
11.190141c6 handler done
11.190143b6 end of interrupt #0
11.19017897 4(Fsys) read from 102(monitor)
11.19017dd5 proxy(83(simcon)) triggers 83(simcon)
11.19018f00 4(Fsys) reply (0x0:0x1000) [No error] to 102(monitor)
11.1901962c 4(Fsys) recv (none) [] from {none}
11.190198f1 active is 83(simcon)
11.1901b414 83(simcon) recv (none) [] from {none}
11.1901b6a6 active is 102(monitor)
11.1901c478 102(monitor) sent (0x8:0x2) [0x0008] to 1(Proc32)
11.1901c79e active is 81(simcon)
11.1902012d priority 4(Fsys) = 27
11.190204b0 81(simcon) sent (0x105:0x7) [0x0105] to 4(Fsys)
11.190207bd active is 4(Fsys)
11.19021c80 4(Fsys) reply (0x0:0x0) [No error] to 81(simcon)
11.19022245 4(Fsys) recv (none) [] from {none}
11.190224d4 active is 81(simcon) - Yay!

Why would smbd hose Fsys?

Best regards - Dave
davepage
New Member
 
Posts: 8
Joined: Sat Oct 29, 2005 2:41 am

RE: Fsys 4.24V hang

Postby mario » Thu Apr 05, 2007 1:23 am

Fsys will run code that is inside the driver (Fsys.eide in your case). So it might be related to Fsys.eide. could be an hardware issue.

Would probably not hurt to update to 4.24Y and latest Fsys.eide.

Or even better try Fsys.atapi.

- Mario
mario
QNX Master
 
Posts: 4132
Joined: Sun Sep 01, 2002 1:04 am

Postby davepage » Thu Apr 05, 2007 4:47 pm

The smbd message to Fsys (0x20e) is _FSYS_DISK_SPACE. Commenting out the call to statfs in samba-2.2.2/source/lib/fsusage.c makes the problem go away. I have had issues with this disk_space thing before. Grrr.
davepage
New Member
 
Posts: 8
Joined: Sat Oct 29, 2005 2:41 am

Postby mario » Thu Apr 05, 2007 4:53 pm

You got 2.2.2 to build? interesting. Most binaries for QNX4 are 2.0.7 which does have that fix included I believe.

Good catch!
mario
QNX Master
 
Posts: 4132
Joined: Sun Sep 01, 2002 1:04 am

Postby davepage » Thu Apr 05, 2007 9:48 pm

Switching to Fsys 4.24Y and Fsys.atapi 4.25G does not help (still hangs in Fsys)

FWIW, Samba 2.2.2 seems to work better with WinXP. I looked at the analogous source for 2.0.7, and there is no substantive difference.

I tried using the native disk_space call -- same problem.
davepage
New Member
 
Posts: 8
Joined: Sat Oct 29, 2005 2:41 am

Postby noc » Thu Apr 05, 2007 10:34 pm

There was another Dave Page talking about samba and disk_space:

http://www.openqnx.com/newsgroups/viewtopic.php?t=25165
noc
Senior Member
 
Posts: 1634
Joined: Sat Jul 06, 2002 4:34 am

Postby davepage » Thu Apr 05, 2007 11:09 pm

Same davepage, same samba, entirely different problem.

However, it is interesting to consider that the total failure in the stack disk_space library function implies a lack of testing in that domain, thus the additional Fsys hang issue was never tested/discovered by QSSL.
davepage
New Member
 
Posts: 8
Joined: Sat Oct 29, 2005 2:41 am

Postby davepage » Sat Apr 07, 2007 9:01 pm

Reimplementing the disk_space lib func as an Fsys message results in same faulty behavior.

...
#include <sys/fsys_msg.h>

int sys_fsusage(const char *path, SMB_BIG_UINT *dfree, SMB_BIG_UINT *dsize)
{
int fd;
pid_t proc;
struct _fsys_disk_space ds_msg;
struct _fsys_disk_space_reply ds_reply;

if ( (fd=open(path,O_RDONLY)) == -1) return -1;

proc = qnx_name_locate ( 0, "qnx/fsys32", 0, NULL);
if (proc == -1)
{
close(fd);
return -1;
}

ds_msg.type = _FSYS_DISK_SPACE;
ds_msg.fd = fd;
ds_msg.zero = 0;

if ( Send (proc, &ds_msg, &ds_reply, sizeof(ds_msg), sizeof(ds_reply)) == -1)
{
close(fd);
return -1;
}
*dfree = ds_reply.free_blocks;
*dsize = ds_reply.total_blocks;

close(fd);

return 0;
}
davepage
New Member
 
Posts: 8
Joined: Sat Oct 29, 2005 2:41 am

Postby mario » Tue Apr 10, 2007 3:27 am

I think that if the fd points to a directory Fsys actually has to scan the whole directory tree to obtain the size.
mario
QNX Master
 
Posts: 4132
Joined: Sun Sep 01, 2002 1:04 am

Postby davepage » Tue Apr 10, 2007 5:02 am

Running

# /bin/df /

also causes Fsys to hang for >500mSec -- note that this is the shortest possible path. But the delay occurs only on the first execution. Fsys seems to cache the result for subsequent calls.

This is with a 160GB drive. When we used 80GB drives, we did not see this problem.

> scan the whole directory tree to obtain the size

Which would be fine if Fsys ran the scan at prio 10 (which is where I'm running smbd and df).

I am convinced this is a serious bug in Fsys. What we're going to do for now is to put a /bin/df / >/dev/null in our startup script. With this, the next disk_space call from smbd completes in a timely fashion. Though, this is risky as we have no idea when the Fsys cached result will expire.

Thanks for your help - Dave
davepage
New Member
 
Posts: 8
Joined: Sat Oct 29, 2005 2:41 am

Postby gubbe » Wed Apr 11, 2007 8:19 am

good catch! Amazing how these bugs can surface after so many years. (one example we found was sending IPC messages to the idle process crashes QNX)
gubbe
Active Member
 
Posts: 82
Joined: Mon Dec 13, 2004 9:18 am

Postby mario » Wed Apr 11, 2007 12:44 pm

I am not sure that is a bug. I think that is the way the QNX4 filesystem is design. There is nowhere in the filesystem a location that hold the status of free and used space. Fsys probably has to go through inodes and .bitmap to figure it out. Like the rest of the stuff it probably gets cached on first access.

Also understand that Fsys was designed when 500Meg drives were a luxury ;-)
mario
QNX Master
 
Posts: 4132
Joined: Sun Sep 01, 2002 1:04 am

Postby davepage » Wed Apr 11, 2007 3:09 pm

I think that is the way the QNX4 filesystem is design. There is nowhere in the filesystem a location that hold the status of free and used space. Fsys probably has to go through inodes and .bitmap to figure it out. Like the rest of the stuff it probably gets cached on first access.


Certainly Fsys can and should block on caller's priority (10) to determine the result of the disk_space message.

The serious bug is that while Fsys is busy with the disk_space call at prio 10, my app is calling for io at prio 27. This floats Fsys up to my priority, and I block until the disk_space call is complete.

As a result, a disk_space call from a low priority process (10) is suspending a ready high priority process. This is not acceptable performance for an RTOS.

Fsys is multithreaded -- but how? Does the implementation of disk_space block every running thread? Or just threads doing IO to the same device? Am I simply running out of Fsys threads? There are very few files open, and only these two processes are doing file IO, so the default (4 threads) seems sufficient.

Clearly, these sorts of OS implementation issues are very difficult to get right. And, while the future growth of hard disk size is easy to anticipate, it is difficult (though not impossible) to test. I am not attacking QNX -- all OSes have bugs.

I would like to try increasing the cluster size to reduce the inode table. dinit does not appear to let me do that. Any ideas?

Thanks for your help - Dave
davepage
New Member
 
Posts: 8
Joined: Sat Oct 29, 2005 2:41 am

Postby mario » Wed Apr 11, 2007 3:24 pm

davepage wrote:
As a result, a disk_space call from a low priority process (10) is suspending a ready high priority process. This is not acceptable performance for an RTOS.


Indeed, but QNX4 is a very old OS and hasn't seen any major improvement in about 10 years so don't expect these short coming to be resolved. (But it's worth contacting QSS about this)


davepage wrote:
Fsys is multithreaded -- but how? Does the implementation of disk_space block every running thread? Or just threads doing IO to the same device? Am I simply running out of Fsys threads? There are very few files open, and only these two processes are doing file IO, so the default (4 threads) seems sufficient.


It's not really multithread (in the QNX6 sense). I suspect it't bound to the device but calculating free and used space should not be an atomic operation anyway.

davepage wrote:
I would like to try increasing the cluster size to reduce the inode table. dinit does not appear to let me do that. Any ideas?


QNX4 doesn't support that. It's fixed.
mario
QNX Master
 
Posts: 4132
Joined: Sun Sep 01, 2002 1:04 am


Return to QNX2 and QNX4

Who is online

Users browsing this forum: No registered users and 2 guests