Closed
Bug 746112
Opened 12 years ago
Closed 12 years ago
RegExp hang on ppc64 in execute
Categories
(Core :: JavaScript Engine, defect)
Tracking
()
RESOLVED
FIXED
mozilla18
People
(Reporter: stransky, Assigned: terrence)
References
()
Details
Attachments
(5 files, 1 obsolete file)
654 bytes,
patch
|
Details | Diff | Splinter Review | |
1.10 MB,
application/octet-stream
|
Details | |
156.96 KB,
text/plain
|
Details | |
1.71 KB,
patch
|
Details | Diff | Splinter Review | |
4.77 KB,
patch
|
billm
:
review+
|
Details | Diff | Splinter Review |
js::gc::DecommitMemory() and madvise() seems to fall in endless loop on ppc64/Fedora 17. Downstream bug - https://bugzilla.redhat.com/show_bug.cgi?id=813095 Description of problem: Firefox freezes after a few minutes of usage. I'm not sure what triggers the issue, though a few minutes browsing random pages has been enough to reproduce the issue. Firefox seems to be stuck, there is a LOT of messages in the strace like this: madvise(0xfff574ce000, 4096, MADV_DONTNEED) = -1 EINVAL (Invalid argument) (looks like an endless loop there). bt: #0 0x00000fff96f447b0 in .madvise () from /lib64/libc.so.6 No symbol table info available. #1 0x00000fff9698f9f4 in js::gc::DecommitMemory (addr=<error reading variable: value has been optimized out>, size=<error reading variable: value has been optimized out>) at /usr/src/debug/xulrunner-10.0.1/mozilla-release/js/src/jsgcchunk.cpp:370 result = <optimized out> #2 0x00000fff96980c94 in DecommitFreePages (cx=<optimized out>) at /usr/src/debug/xulrunner-10.0.1/mozilla-release/js/src/jsgc.cpp:2406 next = <error reading variable next (value has been optimized out)> success = <optimized out> aheader = <error reading variable aheader (value has been optimized out)> chunk = 0xfff7eb00000 r = {cur = 0xfff79192a00, end = <optimized out>} rt = 0xfff89b60000 #3 SweepPhase (gckind=GC_SHRINK, gcmarker=0xfffee8f90b8, cx=0xfff93b81c30) at /usr/src/debug/xulrunner-10.0.1/mozilla-release/js/src/jsgc.cpp:2640 ap = {stats = @0xfff89b60158, phase = <optimized out>} rt = 0xfff89b60000 ap = {stats = @0xfff89b60158, phase = <optimized out>} releaseTypes = <optimized out> #4 MarkAndSweep (gckind=GC_SHRINK, cx=0xfff93b81c30) at /usr/src/debug/xulrunner-10.0.1/mozilla-release/js/src/jsgc.cpp:2677 rt = 0xfff89b60000 unlock = {rt = 0xfff89b60000} gcmarker = {<JSTracer> = {context = 0xfff93b81c30, callback = 0, debugPrinter = 0, debugPrintArg = 0x0, debugPrintIndex = 18446744073709551615, eagerlyTraceWeakMaps = 0}, color = 0, unmarkedArenaStackTop = 0x0, markLaterArenas = {<No data fields>}, objStack = {stack = 0xfff89b60290, tos = 0, limit = 32767}, ropeStack = {stack = 0xfff89ba0290, tos = 0, limit = 1023}, typeStack = {stack = 0xfff89ba2290, tos = 0, limit = 1023}, xmlStack = {stack = 0xfff89ba4290, tos = 0, limit = 1023}, largeStack = {stack = 0xfff89ba6290, tos = 0, limit = 63}} #5 GCCycle (cx=cx@entry=0xfff93b81c30, comp=comp@entry=0x0, gckind=gckind@entry=GC_SHRINK) at /usr/src/debug/xulrunner-10.0.1/mozilla-release/js/src/jsgc.cpp:2921 rt = 0xfff89b60000 gcsession = {context = 0xfff93b81c30} sc = {<js::PreserveCompartment> = {cx = 0xfff93b81c30, oldCompartment = 0xfff8a5ca000, oldInferenceEnabled = <optimized out>}, <No data fields>} #6 0x00000fff9698105c in js_GC (cx=0xfff93b81c30, comp=0x0, gckind=<optimized out>, reason=<optimized out>) at /usr/src/debug/xulrunner-10.0.1/mozilla-release/js/src/jsgc.cpp:2983 rt = 0xfff89b60000 (full bt at https://bugzilla.redhat.com/attachment.cgi?id=577846)
Reporter | ||
Updated•12 years ago
|
Hardware: x86_64 → PowerPC
Comment 1•12 years ago
|
||
Do you happen to be using JACK on that machine? http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=660960 is a similar issue triggered by JACK locking memory it doesn't own.
Reporter | ||
Comment 2•12 years ago
|
||
Fhe Fedora reporter does not run jack plugin, but he find some info about possible page size issues: <gustavold> I saw in the error messages that ff is calling madvise on a memory address that is not 64K aligned <gustavold> ppc64 kernel on f17 uses 64k pages... <gustavold> I was able to build a custom kernel with 4k pages <gustavold> and ff works fine with this kernel <gustavold> I'm wondering if the js_gc could be ignoring getpagesize() somewhere
Comment 3•12 years ago
|
||
I wouldn't be surprised if the js code assumed 4k pages. Which makes this actually be *two* different bugs: infinite loop when madvise returns EINVALID, and wrong page size being used.
Assignee | ||
Comment 4•12 years ago
|
||
I think for version 10, Mike is right here. The change that introduced DecommitFreePages also broke Solaris on Sparc v9 with 8k pages. Apparently it worked fine to just change the page size, as here: http://mxr.mozilla.org/mozilla-central/source/js/src/jsgc.h?force=1#104 This entire section of code has been under extremely heavy work: DecommitFreePages actually got rewritten in FF12 and runs in a background thread now with better control logic (like checking the error code): http://mxr.mozilla.org/mozilla-central/source/js/src/jsgc.cpp#2439 Then, for FF13, I simplified the memory subsystem and added an assert that the PageSize that is defined matches the one in sysconf: http://mxr.mozilla.org/mozilla-central/source/js/src/gc/Memory.cpp#278 So I think the important bugs are already fixed here. It may still make sense to add a JS_OPT_ASSERT_IF(rv == -1, errno != EINVAL) after the MarkPagesUnused in the Decommit loop to ensure that we don't run into a similar situation again, so I'll leave this bug open for that. Martin, if you really need to get FF*10* working on ppc64 right this instant, it is perfectly safe to just comment out the MarkPagesUnused and force the return to success. You could also theoretically adjust PageShift for ppc64, but unfortunately, we tie the arena size to the page size, so this would be disastrously memory inefficient. Good luck, and let me know if you need any more specific help.
Assignee: general → terrence
Status: NEW → ASSIGNED
Reporter | ||
Comment 5•12 years ago
|
||
Thanks for the update. The fix for FF10 makes sense because the Firefox ESR is based on it and we ship it on PPC(64).
Comment 6•12 years ago
|
||
Nice! If you guys have a fix I can test it.
Reporter | ||
Comment 7•12 years ago
|
||
Comment 8•12 years ago
|
||
Comment 9•12 years ago
|
||
Comment 10•12 years ago
|
||
I applied Martin's patch to xulrunner package (Fedora 17) and tested Firefox, but I still experience Firefox freezes. It may be a different bug though. Attached strace and gdb backtrace for Firefox with Martin's patch.
Assignee | ||
Comment 11•12 years ago
|
||
Sorry it took me so long to get to looking at this, Gustavo! The new hang you're experiencing is, as you surmised, unrelated to GC. It appears to be in the regex engine. I know nothing about the regex engine, other than that we have some ongoing work in Bug 691898 to make the situation better for "non Tier-1 platforms".
Assignee: terrence → general
Status: ASSIGNED → NEW
Summary: madvise()/js::gc::DecommitMemory() endless loop on ppc64 → RegExp hang on ppc64 in execute
Reporter | ||
Comment 12•12 years ago
|
||
It's something in Fedora 17, the RegExp crash does not affect RHEL systems which work fine with the workaround.
Comment 13•12 years ago
|
||
This issue seems to happen only if firefox is compiled using gcc 4.7 (this explains why RHEL is not affected). I compiled firefox with the latest revision from gcc 4.7 branch to make sure it is not a fixed gcc bug, but this issue still happens. Any idea on how to narrow down this issue to a minimal test case? Maybe we could reproduce it using the js testsuite. I'd appreciate any guidance on how to run it.
Assignee | ||
Comment 14•12 years ago
|
||
Gustavo, thanks for tracking this down! Looks like it's time to bring in the compiler experts.
Comment 15•12 years ago
|
||
Here is a smaller test case. Though I'm not sure if this specific test case is related to firefox freezes I'm experiencing. echo '/a(b)c/.exec("abc")' | js/src/shell/js gcc 4.7 result: ["abc", (void 0)] gcc 4.6 result: ["abc", "b"] It is, in fact, from js/src/jit-test/tests/basic/bug594108.js There are some other tests failing, but this one was the only one I tested with both gcc versions.
Assignee | ||
Comment 16•12 years ago
|
||
Hurm. That's pretty wildly wrong. If there are many other test suite failures, I would actually begin to suspect that gcc-4.7 is generating wrong code for PPC. Is there a way we could verify that gcc is being sane here? Was the rest of your system built with gcc-4.7 or only the browser? What flags are you building with and does the behavior change if you change the optimization level?
Comment 17•12 years ago
|
||
(In reply to Terrence Cole [:terrence] from comment #16) > Hurm. That's pretty wildly wrong. If there are many other test suite > failures, I would actually begin to suspect that gcc-4.7 is generating wrong > code for PPC. Following is the list of tests failing (along with its variants with different js shell options like -m -a -d etc.). js/src/jit-test/tests/basic/bug594108.js js/src/jit-test/tests/basic/bug599854.js js/src/jit-test/tests/basic/bug627609.js js/src/jit-test/tests/basic/bug691797-regexp-2.js js/src/jit-test/tests/basic/regexp-reset-input.js js/src/jit-test/tests/basic/testStackIter.js js/src/jit-test/tests/debug/Frame-this-03.js js/src/jit-test/tests/debug/Frame-this-04.js js/src/jit-test/tests/sunspider/check-date-format-tofte.js js/src/jit-test/tests/sunspider/check-string-tagcloud.js I've run the complete test suite only built with gcc-4.7. I will try to run the complete test suite built with gcc-4.6 and report the results here. > Is there a way we could verify that gcc is being sane here? Was the rest of > your system built with gcc-4.7 or only the browser? The entire system was built with gcc-4.7 (Fedora17). > What flags are you > building with and does the behavior change if you change the optimization > level? Using the configure option --disable-optimize makes the test case I mentioned on comment #15 work fine. However I was not able to generate an rpm package with --disable-optimize because the rpm generation fails in the install phase with the following error message: /home/gustavold/rpmbuild/BUILD/xulrunner-10.0.4/mozilla-esr10/xulrunner/installer/../../toolkit/mozapps/installer/precompile_cache.js:73: TypeError: Cc is null
Assignee | ||
Comment 18•12 years ago
|
||
If changing the compiler flags makes the crashes go away, then this is almost certainly a bug in gcc.
Comment 19•12 years ago
|
||
I've been in touch with the gcc maintainers for ppc, but it is hard for them to tell what is wrong without a small test case. As soon as we find the peace of code that is misbehaving, they can jump in and tell if there is something wrong with gcc (and hopefully fix it).
Comment 20•12 years ago
|
||
I tracked down what part of the RegExp code is misbehaving. The attached patch makes firefox freezes go away. Note that it is not a real solution to the problem. It is just a proof of concept to help tracking down what is the real problem (be it on the RegExp code or the gcc). Description of the misbehavior: In "YarrPatternConstructor::atomParenthesesSubpatternBegin()" after "m_alternative->m_terms.append(...);" if you check the value of "m_alternative->m_terms.last().m_capture" you will find the value 1 when built with gcc 4.6 and 0 when built with gcc 4.7 (considering the attached patch is NOT applied)
Comment 21•12 years ago
|
||
Here is the gcc bugz that is causing this issue: http://gcc.gnu.org/bugzilla/show_bug.cgi?id=53438
Assignee | ||
Comment 22•12 years ago
|
||
Excellent find!
Assignee | ||
Comment 23•12 years ago
|
||
Comment on attachment 658591 [details] [diff] [review] v0 I just have a few minor nits here. I'd like to see one block in gc/Heap.h that sets both the page shift and the arena shift based on the CPU. It would look like this: /* ...comment that explains all three cases... */ #if SPARC const size_t PageShift = 13; const size_t ArenaShift = PageShift; #elif PPC const size_t PageShift = 16; const size_t ArenaShift = 12; #else const size_t PageShift = 12; const size_t ArenaShift = PageShift; #endif const size_t PageSize = size_t(1) << PageShift; const size_t ArenaSize = PageSize; const size_t ArenaMask = ArenaSize - 1; Then we would have a new inline function somewhere: bool DecommittEnabled() { return PageSize == ArenaSize; } and we would check that before doing any decommit work.
Attachment #658591 -
Flags: review?(wmccloskey)
Assignee | ||
Comment 25•12 years ago
|
||
That's much better looking, thanks!
Attachment #658591 -
Attachment is obsolete: true
Attachment #660255 -
Flags: review?(wmccloskey)
Comment on attachment 660255 [details] [diff] [review] v1 Review of attachment 660255 [details] [diff] [review]: ----------------------------------------------------------------- ::: js/src/jsgc.cpp @@ +2730,5 @@ > Maybe<AutoUnlockGC> maybeUnlock; > if (!rt->isHeapBusy()) > maybeUnlock.construct(rt); > + if (DecommitEnabled()) > + ok = MarkPagesUnused(aheader->getArena(), ArenaSize); Rather than making the change here, how about changing DecommitArenas to return early if !DecommitEnabled()?
Attachment #660255 -
Flags: review?(wmccloskey) → review+
Assignee | ||
Comment 27•12 years ago
|
||
https://hg.mozilla.org/integration/mozilla-inbound/rev/0558ede9693e
Comment 28•12 years ago
|
||
https://hg.mozilla.org/mozilla-central/rev/0558ede9693e
Status: ASSIGNED → RESOLVED
Closed: 12 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla18
Comment 29•11 years ago
|
||
Can we re-open this one ? The page size at compile time cannot work. By fixing 64k you just broke 4k such as all 32-bit powerpc's. In fact, on my G5, I'm using a 64-bit kernel with a 4k page size because the nouveau driver doesn't work with 64k Would it be possible to make the code support a variable system page size ? (As long as it remains a multiple of the Arena size is ok ... even a power of two multiple). This problem will affect other architectures that support variable page sizes (I know at least of mips and ia64 but I heard ARM is moving toward a 64k option as well). I suspect that a deeper rework of that code might be needed...
You need to log in
before you can comment on or make changes to this bug.
Description
•