Closed Bug 746112 Opened 12 years ago Closed 12 years ago

RegExp hang on ppc64 in execute

Categories

(Core :: JavaScript Engine, defect)

10 Branch
PowerPC
Linux
defect
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla18

People

(Reporter: stransky, Assigned: terrence)

References

()

Details

Attachments

(5 files, 1 obsolete file)

js::gc::DecommitMemory() and madvise() seems to fall in endless loop on ppc64/Fedora 17.

Downstream bug - https://bugzilla.redhat.com/show_bug.cgi?id=813095

Description of problem:
Firefox freezes after a few minutes of usage. I'm not sure what triggers the
issue, though a few minutes browsing random pages has been enough to reproduce
the issue.

Firefox seems to be stuck, there is a LOT of messages in the strace like this:
madvise(0xfff574ce000, 4096, MADV_DONTNEED) = -1 EINVAL (Invalid argument)
(looks like an endless loop there).

bt:
#0  0x00000fff96f447b0 in .madvise () from /lib64/libc.so.6
No symbol table info available.
#1  0x00000fff9698f9f4 in js::gc::DecommitMemory (addr=<error reading variable: value has been optimized out>, 
    size=<error reading variable: value has been optimized out>) at /usr/src/debug/xulrunner-10.0.1/mozilla-release/js/src/jsgcchunk.cpp:370
        result = <optimized out>
#2  0x00000fff96980c94 in DecommitFreePages (cx=<optimized out>) at /usr/src/debug/xulrunner-10.0.1/mozilla-release/js/src/jsgc.cpp:2406
        next = <error reading variable next (value has been optimized out)>
        success = <optimized out>
        aheader = <error reading variable aheader (value has been optimized out)>
        chunk = 0xfff7eb00000
        r = {cur = 0xfff79192a00, end = <optimized out>}
        rt = 0xfff89b60000
#3  SweepPhase (gckind=GC_SHRINK, gcmarker=0xfffee8f90b8, cx=0xfff93b81c30) at /usr/src/debug/xulrunner-10.0.1/mozilla-release/js/src/jsgc.cpp:2640
        ap = {stats = @0xfff89b60158, phase = <optimized out>}
        rt = 0xfff89b60000
        ap = {stats = @0xfff89b60158, phase = <optimized out>}
        releaseTypes = <optimized out>
#4  MarkAndSweep (gckind=GC_SHRINK, cx=0xfff93b81c30) at /usr/src/debug/xulrunner-10.0.1/mozilla-release/js/src/jsgc.cpp:2677
        rt = 0xfff89b60000
        unlock = {rt = 0xfff89b60000}
        gcmarker = {<JSTracer> = {context = 0xfff93b81c30, callback = 0, debugPrinter = 0, debugPrintArg = 0x0, debugPrintIndex = 18446744073709551615, 
            eagerlyTraceWeakMaps = 0}, color = 0, unmarkedArenaStackTop = 0x0, markLaterArenas = {<No data fields>}, objStack = {stack = 0xfff89b60290, 
            tos = 0, limit = 32767}, ropeStack = {stack = 0xfff89ba0290, tos = 0, limit = 1023}, typeStack = {stack = 0xfff89ba2290, tos = 0, 
            limit = 1023}, xmlStack = {stack = 0xfff89ba4290, tos = 0, limit = 1023}, largeStack = {stack = 0xfff89ba6290, tos = 0, limit = 63}}
#5  GCCycle (cx=cx@entry=0xfff93b81c30, comp=comp@entry=0x0, gckind=gckind@entry=GC_SHRINK)
    at /usr/src/debug/xulrunner-10.0.1/mozilla-release/js/src/jsgc.cpp:2921
        rt = 0xfff89b60000
        gcsession = {context = 0xfff93b81c30}
        sc = {<js::PreserveCompartment> = {cx = 0xfff93b81c30, oldCompartment = 0xfff8a5ca000, oldInferenceEnabled = <optimized out>}, <No data fields>}
#6  0x00000fff9698105c in js_GC (cx=0xfff93b81c30, comp=0x0, gckind=<optimized out>, reason=<optimized out>)
    at /usr/src/debug/xulrunner-10.0.1/mozilla-release/js/src/jsgc.cpp:2983
        rt = 0xfff89b60000

(full bt at https://bugzilla.redhat.com/attachment.cgi?id=577846)
Hardware: x86_64 → PowerPC
Do you happen to be using JACK on that machine?
http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=660960 is a similar issue triggered by JACK locking memory it doesn't own.
Fhe Fedora reporter does not run jack plugin, but he find some info about possible page size issues:

<gustavold> I saw in the error messages that ff is calling madvise on a memory address that is not 64K aligned
<gustavold> ppc64 kernel on f17 uses 64k pages...
<gustavold> I was able to build a custom kernel with 4k pages
<gustavold> and ff works fine with this kernel
<gustavold> I'm wondering if the js_gc could be ignoring getpagesize() somewhere
I wouldn't be surprised if the js code assumed 4k pages. Which makes this actually be *two* different bugs: infinite loop when madvise returns EINVALID, and wrong page size being used.
I think for version 10, Mike is right here.  The change that introduced DecommitFreePages also broke Solaris on Sparc v9 with 8k pages.  Apparently it worked fine to just change the page size, as here:
http://mxr.mozilla.org/mozilla-central/source/js/src/jsgc.h?force=1#104

This entire section of code has been under extremely heavy work: DecommitFreePages actually got rewritten in FF12 and runs in a background thread now with better control logic (like checking the error code):
http://mxr.mozilla.org/mozilla-central/source/js/src/jsgc.cpp#2439

Then, for FF13, I simplified the memory subsystem and added an assert that the PageSize that is defined matches the one in sysconf:
http://mxr.mozilla.org/mozilla-central/source/js/src/gc/Memory.cpp#278

So I think the important bugs are already fixed here.  It may still make sense to add a JS_OPT_ASSERT_IF(rv == -1, errno != EINVAL) after the MarkPagesUnused in the Decommit loop to ensure that we don't run into a similar situation again, so I'll leave this bug open for that.

Martin, if you really need to get FF*10* working on ppc64 right this instant, it is perfectly safe to just comment out the MarkPagesUnused and force the return to success.  You could also theoretically adjust PageShift for ppc64, but unfortunately, we tie the arena size to the page size, so this would be disastrously memory inefficient.  Good luck, and let me know if you need any more specific help.
Assignee: general → terrence
Status: NEW → ASSIGNED
Thanks for the update. The fix for FF10 makes sense because the Firefox ESR is based on it and we ship it on PPC(64).
Nice! If you guys have a fix I can test it.
I applied Martin's patch to xulrunner package (Fedora 17) and tested Firefox, but I still experience Firefox freezes. It may be a different bug though.

Attached strace and gdb backtrace for Firefox with Martin's patch.
Sorry it took me so long to get to looking at this, Gustavo!

The new hang you're experiencing is, as you surmised, unrelated to GC.  It appears to be in the regex engine.  I know nothing about the regex engine, other than that we have some ongoing work in Bug 691898 to make the situation better for "non Tier-1 platforms".
Assignee: terrence → general
Status: ASSIGNED → NEW
Summary: madvise()/js::gc::DecommitMemory() endless loop on ppc64 → RegExp hang on ppc64 in execute
It's something in Fedora 17, the RegExp crash does not affect RHEL systems which work fine with the workaround.
This issue seems to happen only if firefox is compiled using gcc 4.7 (this explains why RHEL is not affected).
I compiled firefox with the latest revision from gcc 4.7 branch to make sure it is not a fixed gcc bug, but this issue still happens.

Any idea on how to narrow down this issue to a minimal test case?
Maybe we could reproduce it using the js testsuite. I'd appreciate any guidance on how to run it.
Gustavo, thanks for tracking this down!  Looks like it's time to bring in the compiler experts.
Here is a smaller test case. Though I'm not sure if this specific test case is related to firefox freezes I'm experiencing.

echo '/a(b)c/.exec("abc")' | js/src/shell/js

gcc 4.7 result:
["abc", (void 0)]

gcc 4.6 result:
["abc", "b"]

It is, in fact, from js/src/jit-test/tests/basic/bug594108.js
There are some other tests failing, but this one was the only one I tested with both gcc versions.
Hurm.  That's pretty wildly wrong.  If there are many other test suite failures, I would actually begin to suspect that gcc-4.7 is generating wrong code for PPC.

Is there a way we could verify that gcc is being sane here?  Was the rest of your system built with gcc-4.7 or only the browser?  What flags are you building with and does the behavior change if you change the optimization level?
(In reply to Terrence Cole [:terrence] from comment #16)
> Hurm.  That's pretty wildly wrong.  If there are many other test suite
> failures, I would actually begin to suspect that gcc-4.7 is generating wrong
> code for PPC.

Following is the list of tests failing (along with its variants with different js shell options like -m -a -d etc.).

js/src/jit-test/tests/basic/bug594108.js
js/src/jit-test/tests/basic/bug599854.js
js/src/jit-test/tests/basic/bug627609.js
js/src/jit-test/tests/basic/bug691797-regexp-2.js
js/src/jit-test/tests/basic/regexp-reset-input.js
js/src/jit-test/tests/basic/testStackIter.js
js/src/jit-test/tests/debug/Frame-this-03.js
js/src/jit-test/tests/debug/Frame-this-04.js
js/src/jit-test/tests/sunspider/check-date-format-tofte.js
js/src/jit-test/tests/sunspider/check-string-tagcloud.js

I've run the complete test suite only built with gcc-4.7. I will try to run the complete test suite built with gcc-4.6 and report the results here.


> Is there a way we could verify that gcc is being sane here?  Was the rest of
> your system built with gcc-4.7 or only the browser? 

The entire system was built with gcc-4.7 (Fedora17).

> What flags are you
> building with and does the behavior change if you change the optimization
> level?

Using the configure option --disable-optimize makes the test case I mentioned on comment #15 work fine. However I was not able to generate an rpm package with --disable-optimize because the rpm generation fails in the install phase with the following error message:
/home/gustavold/rpmbuild/BUILD/xulrunner-10.0.4/mozilla-esr10/xulrunner/installer/../../toolkit/mozapps/installer/precompile_cache.js:73:
TypeError: Cc is null
If changing the compiler flags makes the crashes go away, then this is almost certainly a bug in gcc.
I've been in touch with the gcc maintainers for ppc, but it is hard for them to tell what is wrong without a small test case. As soon as we find the peace of code that is misbehaving, they can jump in and tell if there is something wrong with gcc (and hopefully fix it).
I tracked down what part of the RegExp code is misbehaving. The attached patch makes firefox freezes go away.
Note that it is not a real solution to the problem. It is just a proof of concept to help tracking down what is the real problem (be it on the RegExp code or the gcc).

Description of the misbehavior:
In "YarrPatternConstructor::atomParenthesesSubpatternBegin()"
after "m_alternative->m_terms.append(...);"
if you check the value of "m_alternative->m_terms.last().m_capture"
you will find the value 1 when built with gcc 4.6 and 0 when built with gcc 4.7
(considering the attached patch is NOT applied)
Here is the gcc bugz that is causing this issue: http://gcc.gnu.org/bugzilla/show_bug.cgi?id=53438
Excellent find!
Attached patch v0 (obsolete) — Splinter Review
Assignee: general → terrence
Status: NEW → ASSIGNED
Attachment #658591 - Flags: review?(wmccloskey)
Comment on attachment 658591 [details] [diff] [review]
v0

I just have a few minor nits here. I'd like to see one block in gc/Heap.h that sets both the page shift and the arena shift based on the CPU. It would look like this:

/* ...comment that explains all three cases... */
#if SPARC
const size_t PageShift = 13;
const size_t ArenaShift = PageShift;
#elif PPC
const size_t PageShift = 16;
const size_t ArenaShift = 12;
#else
const size_t PageShift = 12;
const size_t ArenaShift = PageShift;
#endif

const size_t PageSize = size_t(1) << PageShift;
const size_t ArenaSize = PageSize;
const size_t ArenaMask = ArenaSize - 1;

Then we would have a new inline function somewhere:

bool DecommittEnabled() { return PageSize == ArenaSize; }

and we would check that before doing any decommit work.
Attachment #658591 - Flags: review?(wmccloskey)
Attached patch v1Splinter Review
That's much better looking, thanks!
Attachment #658591 - Attachment is obsolete: true
Attachment #660255 - Flags: review?(wmccloskey)
Comment on attachment 660255 [details] [diff] [review]
v1

Review of attachment 660255 [details] [diff] [review]:
-----------------------------------------------------------------

::: js/src/jsgc.cpp
@@ +2730,5 @@
>                  Maybe<AutoUnlockGC> maybeUnlock;
>                  if (!rt->isHeapBusy())
>                      maybeUnlock.construct(rt);
> +                if (DecommitEnabled())
> +                    ok = MarkPagesUnused(aheader->getArena(), ArenaSize);

Rather than making the change here, how about changing DecommitArenas to return early if !DecommitEnabled()?
Attachment #660255 - Flags: review?(wmccloskey) → review+
https://hg.mozilla.org/mozilla-central/rev/0558ede9693e
Status: ASSIGNED → RESOLVED
Closed: 12 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla18
Blocks: 825165
Can we re-open this one ?

The page size at compile time cannot work. By fixing 64k you just broke 4k such as all 32-bit powerpc's. In fact, on my G5, I'm using a 64-bit kernel with a 4k page size because the nouveau driver doesn't work with 64k

Would it be possible to make the code support a variable system page size ? (As long as it remains a multiple of the Arena size is ok ... even a power of two multiple).

This problem will affect other architectures that support variable page sizes (I know at least of mips and ia64 but I heard ARM is moving toward a 64k option as well).

I suspect that a deeper rework of that code might be needed...
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: