Closed Bug 565489 Opened 14 years ago Closed 14 years ago

VarTracker non-null table may grow quite large

Categories

(Tamarin Graveyard :: Baseline JIT (CodegenLIR), defect, P1)

defect

Tracking

(Not tracked)

RESOLVED FIXED
flash10.1.x-Salt

People

(Reporter: rreitmai, Assigned: wmaddox)

References

Details

(Whiteboard: WE:2621248)

Attachments

(12 files, 5 obsolete files)

694 bytes, patch
wmaddox
: review+
wsharp
: review+
stejohns
: review+
Details | Diff | Splinter Review
2.98 KB, patch
Details | Diff | Splinter Review
14.17 KB, image/png
Details
63.40 KB, image/png
Details
20.57 KB, text/plain
Details
6.14 KB, patch
Details | Diff | Splinter Review
23.00 KB, application/x-tar
Details
2.65 KB, application/x-gzip
Details
4.77 KB, patch
rreitmai
: review+
edwsmith
: feedback-
Details | Diff | Splinter Review
23.69 KB, patch
Details | Diff | Splinter Review
5.14 KB, patch
rreitmai
: review+
edwsmith
: feedback+
Details | Diff | Splinter Review
6.48 KB, patch
edwsmith
: feedback+
Details | Diff | Splinter Review
The 'checked' hashtable of VarTracker may grow quite large under certain conditions (e.g. alchemy generated abc) since LIR instructions are added to the table until a label is emitted, which for machine generated code could occur infrequently.
Attached patch short term fixSplinter Review
Assignee: nobody → rreitmai
Status: NEW → ASSIGNED
Attachment #445013 - Flags: review?(wmaddox)
Attachment #445013 - Flags: feedback?
Attachment #445013 - Flags: review?(wsharp)
Attachment #445013 - Flags: feedback?
This patch also fixes the issue removing the need to create a large table.

This change might generate a lot of garbage since we are allocating and disposing of a table with each call of trackForwardEdge.

Another approach would be to re-use the table, iterating over the list a 2nd time filling it with varTracker[i] based on notnull->get(i), this might be faster for small values of nvar.

Will run performance tests to see if there is any difference.  But please comment on general approach.
Attachment #445025 - Flags: feedback?(wmaddox)
Conditionally including in 10.1.  Let's understand the performance implications and risks before submitting to Argo.
Flags: flashplayer-qrb+
Priority: -- → P2
Target Milestone: --- → flash10.1
Comment on attachment 445025 [details] [diff] [review]
rebuild the nonnull table for each forward edge

Conceptually looks reasonably, though I wonder if it will cause a lot of unnecessary reallocation in non-pathological cases.

Question: what's with "nvar*4"? It was just "nvar" before.
Performance-wise on the micro-benchmarks I'm not seeing a difference between this patch and 445025.

So recommend we go with this change; Bill thoughts?
Attachment #445146 - Flags: feedback?(wmaddox)
Attachment #445013 - Flags: review?(stejohns)
Comment on attachment 445013 [details] [diff] [review]
short term fix

r+ conditional on a comment explaining how the magic value of 16700 was calculated
Attachment #445013 - Flags: review?(stejohns) → review+
(In reply to comment #4)
> (From update of attachment 445025 [details] [diff] [review])
> Conceptually looks reasonably, though I wonder if it will cause a lot of
> unnecessary reallocation in non-pathological cases.
> 
> Question: what's with "nvar*4"? It was just "nvar" before.

A bit of a safety valve.  Since the table doesn't rehash we could still get in a situation where we have a long block (i.e. no jumps or labels) and thus fill the table beyond nvar.
Comment on attachment 445013 [details] [diff] [review]
short term fix

What was the largest observed number of entries in the HashMap?  While we should strive for a low load factor in the normal case, a dozen or so items per bucket is nothing to worry about for the tails of the distribution.  Taking a cue from the previous use of nvars, a more robust method of sizing would be to divide the size of the method by the estimated number of instruction bytes per store.  This gives an approximate upper bound on the number of entries in checked, which can be further divided by the desired bound on bucket size.

I'm sure the patch will work as-is, however, and suspect a smaller fixed size would be adequate.
Attachment #445013 - Flags: review?(wmaddox) → review+
Comment on attachment 445025 [details] [diff] [review]
rebuild the nonnull table for each forward edge

This creates a lot of pool-allocated garbage that will hang around for a while. It would make more sense to just allocate a "new" and and "old" checked array, and alternately clear and reverse their roles.  But this is moot because your next patch is cleaner.
Attachment #445025 - Flags: review?
Attachment #445025 - Flags: feedback?(wmaddox)
Attachment #445025 - Flags: feedback-
Comment on attachment 445146 [details] [diff] [review]
loop version of the checked table update

This looks like a clean and correct fix.

For an expected hashtable load of one entry per bucket, the magic number '4' effectively represents the expected number of assignments per variable in a block.  Previously, we sync'd up once per superblock, at labels.  Now we sync up once per basic block, at labels, and at branches.
Attachment #445146 - Flags: feedback?(wmaddox) → feedback-
Attachment #445146 - Flags: feedback- → feedback+
Attachment #445013 - Flags: review?(wsharp) → review+
Whiteboard: WE:2621248
Comment on attachment 445146 [details] [diff] [review]
loop version of the checked table update

Adding Ed to review final patch for landing in redux.
Attachment #445146 - Flags: superreview?(edwsmith)
Patch #1 - "short term fix" landed in argo gmc see cl 671384

Justification for table size:

"The low risk fix is to expand the table to cover worst case and since we have a
bound on method size , we can bound the table size.  64K bytes (extreme case of 
1 opcode per byte) and ~5x expansion (to LIR) gives us around 20 entries per
bucket with ~16k table size."
Target Milestone: flash10.1 → flash10.2
Comment on attachment 445146 [details] [diff] [review]
loop version of the checked table update

I would like to see data posted on the effect of the patch -- what is the effect on memory usage for a reasosnable set of methods?

Note that the HashMap class uses an Allocator, which is an arena allocator, nothing is ever freed until the arena is freed.  HashMap internally could recycle bucket nodes without much work, which would reduce the wasted memory when clear() is called.

It is perhaps odd to give HashMap this capability since one would expect an allocator to have a free() method, however in this situation, Allocator should remain simple and fast (bump-pointer arena allocation).  checking for a free recycled node is more optimal if HashMap does it since it will be the correct size and type, and no other clients of Allocator would pay for the capability.
Attachment #445146 - Flags: superreview?(edwsmith) → superreview+
Attachment #445025 - Flags: review?
vprof data on VorbisEnc.swf obtained from http://watsonexp.corp.adobe.com/#bug=2621248  
states a max probe count of 26474 (i.e. max iterations of loop within find()), average 13205 over ~13k calls.  Time-wise, the average is 335 ticks, max 3663.

Post patch the numbers are max probes 3, average 3; time-wise, average is 0.010 ticks, max 326.

Any suggestions for how to measure the memory effects?  I don't believe we have any existing mechanisms in place for gleaning this data.
VarTracker's hashtable allocates from a nanojit::Allocator, which is an arena allocator.  its possible to define a vm-specific subclass and in that subclass's allocate method, keep track of how much ram is used.  report it when the allocator is destroyed.
Post patch I'm seeing a ~18% slowdown in jsmicro/switch-1/2/3.as that is completely attributable to the newly introduced loop.

Will post some memory numbers shortly, but unfortunately the test case that prompted this work is proving tough to work with.
Attached image Doom
A couple of charts that track the size of VarTrackers 'checked' table, when running the alchemy generated Doom swf.

Samples were generated on calls to Allocator.postReset(), measuring the high-water mark of the Allocator's table size obtained through allocation requests via allocChunk().

Three data series are displayed 'Prior' - the original behaviour prior to the bug being identified. 'LargeAlloc' - the patch identified as short term fix and 'Patch' - the patch for the loop version (attachment #445146 [details] [diff] [review])  

The 2nd graph, labeled 'Doom - log', displays the same data as the first, except in a log form and with a single outlier data point removed (this data point showed identical values across all three series).
 
In general, 'Patch' is showing increased memory usage, but it does not seem excessive with this test case
(In reply to comment #17)
> Post patch I'm seeing a ~18% slowdown in jsmicro/switch-1/2/3.as that is
> completely attributable to the newly introduced loop.

This sounds alarming, but I don't understand.  The JIT is 18% slower?  (jsmicro/switch*.js are very small tests), or the code is?  I would not have expected the code to have changed at all, especially for the switch-*.js tests since there aren't many long basic blocks.
The test result is showing an 18% reduction in performance post-patch (attachment #445146 [details] [diff] [review]).

Interestingly, a previously proposed patch attachment #445025 [details] [diff] [review]) that builds a new checked table during the iteration in which 'notnull' is being set, does not have this hit.

This shows, in this case, that cost of allocation (time-wise) is lower than that of iteration.

Possibly then, it makes sense to use that patch as opposed to the one that clear's the current checked table and does a 2nd iterates.

So, does it make sense to place that patch up for review again?
Some tests are showing significant speed-ups while jsmicro/string-casechange-1.as appears to be the only significant loss.

Will post doom data on the patch next.
Component: Virtual Machine → JIT Compiler (NanoJIT)
QA Contact: vm → nanojit
Silly me, doom data of course showing no difference as both patches effectively clearing the array.
Attachment #445025 - Flags: feedback- → superreview?(edwsmith)
Comment on attachment 445025 [details] [diff] [review]
rebuild the nonnull table for each forward edge

re-posting original patch for review
Picking this up, as Rick is going on vacation.
Assignee: rreitmai → wmaddox
This implementation treats the 'checked' hashmap as a single sparse array allocated for the entire lifetime of the VarTracker instance, and does not remove elements from the hashmap or explicitly check for their presence.  Instead, expressions are mapped to a timestamp value, and an expression is considered to be a member of the checked expression set when the timestamp matches a globally-maintained current value.  In this way, the entire set may be cleared in constant time simply by incrementing the global timestamp.  Furthermore, the timestamps provide a limited versioning mechanism that allows us to perform an update-in-place while maintaining sufficient access to the original value of the set.

The patch includes a block of code guarded by an #ifdef DEBUG.  It is not my intention that this conditionalization remain, rather I wished to present two alternatives for comment:  In one case, I have preserved a pre-existing assertion at the cost of slightly more complex logic.  In the other, the logic is simpler, but the state necessary to implement the assertion is lost.
Attachment #452415 - Flags: review?(edwsmith)
Revised patch sans egregious bug.  Removed conditionalization mentioned above.
Attachment #452415 - Attachment is obsolete: true
Attachment #452443 - Flags: review?(edwsmith)
Attachment #452415 - Flags: review?(edwsmith)
Attachment #445025 - Flags: superreview?(edwsmith)
Attachment #452443 - Flags: review?(edwsmith)
Retracted review request for patch above.  The patch addressed the memory consumption due to unreclaimed allocations in Rick's patch, but did not address the principal concern of this bug, which is the excessive load factor in the hashmap for very large methods due to the fixed number of buckets, effectively re-introducing the problem Rick was aiming to solve.
(In reply to comment #21)
> The test result is showing an 18% reduction in performance post-patch
> (attachment #445146 [details] [diff] [review]).
..
> This shows, in this case, that cost of allocation (time-wise) is lower than
> that of iteration.

I simply didn't believe that this trivial loop would add 18% to the runtime of the entire switch-1 benchmark.  In fact, experiments showed that runtime, not JIT-time, dominates the benchmark result.  Although the benchmark score scales roughly linearly with the iteration count in the benchmark, the percentage difference in the loop version versus the rebuild version remained similar.
This would seem to indicate that the difference is in the generated code, not in the overhead of compilation, though a diff of the -Dverbose=jit output is full of spurious differences due to differing code addresses.  Further investigation is warranted here.

Along a different tack, I noticed that the loop version could be easily modified to reallocate, so that the relevant variable is only the reallocation, not the second loop.  While working with this code, I also added a check for
non-null varTracker[i], as an assertion in the original was firing in debug
builds.

            // update our current set of known non-null LIns with those
            //being monitored by the varTracker
            #ifdef REALLOCATE
                checked = new (alloc) HashMap<LIns*,bool>(alloc, 4*nvar);
            #else
                checked->clear();
            #endif
            for (int i=0, n=nvar; i < n; i++) {
                if (varTracker[i] != NULL && notnull->get(i)) {
                    //AvmAssert(varTracker[i]);
                    markNonNull(varTracker[i]);
                }
            }

Note that the constructor for HashMap calls clear() anyhow, as the storage returned by the underlying allocator is not zeroed, therefore the REALLOCATE case would appear to be performing more work.

In fact, the reallocating version of the loop patch gives the same result as the original rebuild patch, while the loop version continues to show the ~18% regression (in particular, the missing null check was not the cause of it).
That is, clearing the table by calling clear(), which simply zeroes out the buckets, appears more costly than allocating a fresh table, which also zeroes the buckets.  Then again, it seems clear enough that the cost of the clearing operation itself is not the cause of the regression.

Here's the kicker: This behavior is observed on my MacOS 10.6.3 system *only* when the shell is built using xcodebuild.  If I build with the cross-platform scripts, I don't see the difference.  Originally, I was using incremental builds, so I repeated the experiments with fresh builds for each configuration.
The shell binaries are as follows:

shell-loop:       clearing, built with xcodebuild
shell-realloc:    reallocating, buit with xcodebuild
shell-loop-cp:    clearing, built with configure/make
shell-realloc-cp: reallocating, built with configure/make

These were the results obtained running these shells on jsmicro/switch-1.abc:

wmaddox-macpro:~ wmaddox$ tr-vt3/shell-loop tr-vt2/test/performance/jsmicro/switch-1.abc
name switch-1
metric v8 82

wmaddox-macpro:~ wmaddox$ tr-vt4/shell-realloc tr-vt2/test/performance/jsmicro/switch-1.abc
name switch-1
metric v8 100

wmaddox-macpro:~ wmaddox$ tr-vt3/shell-loop-cp tr-vt2/test/performance/jsmicro/switch-1.abc
name switch-1
metric v8 98

wmaddox-macpro:~ wmaddox$ tr-vt4/shell-realloc-cp tr-vt2/test/performance/jsmicro/switch-1.abc
name switch-1
metric v8 97

There was some variation in the scores, but the runs shown above were representative.  We observe a clear difference between the shell-loop and shell-realloc, while shell-loop-cp and shell-realloc-cp show similar scores.  It is also noteworthy that the cross-platform builds both scored slightly less than shell-realloc-cp -- a difference of a few percentage points was quite robust.

I re-ran these with a modified benchmark that cut the trip count by a factor of ten:

wmaddox-macpro:~ wmaddox$ tr-vt3/shell-loop tr-vt2/test/performance/jsmicro/switch-1.x10.abc
name switch-1
metric v8 808

wmaddox-macpro:~ wmaddox$ tr-vt4/shell-realloc tr-vt2/test/performance/jsmicro/switch-1.x10.abc
name switch-1
metric v8 992

wmaddox-macpro:~ wmaddox$ tr-vt3/shell-loop-cp tr-vt2/test/performance/jsmicro/switch-1.x10.abc
name switch-1
metric v8 958

wmaddox-macpro:~ wmaddox$ tr-vt4/shell-realloc-cp tr-vt2/test/performance/jsmicro/switch-1.x10.abc
name switch-1
metric v8 958

We see the expected order-of-magnitude increase in the score, but the difference between the clearing and reallocating cases, when built with xcodebuild only, persists.
Debug builds yield similar behavior for the clear and realloc cases, regardless of the build system used.  The anomaly appears to be specific to release builds under xcodebuild.
It turns out that it sufficient to include a dummy allocation, which eliminates the regression.  Lines marked '###' work, the others don't.

            // update our current set of known non-null LIns with
            // those being monitored by the varTracker
            //(void)new (alloc) HashMap<LIns*,bool>(alloc, 4*nvar);  //###
            //(void)new (alloc) int[nvar];
            //(void)new (alloc) HashMap<LIns*,bool>(alloc, 1);
            (void)new (alloc) int[4*nvar];  //###
            checked->clear();
            //(void)new (alloc) HashMap<LIns*,bool>(alloc, 4*nvar);  //###
            //(void)new (alloc) int[nvar];
            //(void)new (alloc) HashMap<LIns*,bool>(alloc, 1);
            //(void)new (alloc) int[4*nvar];  //###
            for (int i=0, n=nvar; i < n; i++) {
                if (varTracker[i] != NULL && notnull->get(i)) {
                    //AvmAssert(varTracker[i]);
                    markNonNull(varTracker[i]);
                }
            }

This smacks of heap corruption, or of failure to initialize heap memory before use, as our custom allocator does not do so.  I modified it as follows (nanojit/Allocator.h):

        /** alloc memory, never return null. */
        void* alloc(size_t nbytes) {
            nbytes = (nbytes + 7) & ~7; // round up
            if (current_top + nbytes <= current_limit) {
                void *p = current_top;
                current_top += nbytes;
                VMPI_memset(p, 0, nbytes);
                return p;
            }
            void *p = allocSlow(nbytes);
            VMPI_memset(p, 0, nbytes);
            return p;
        }

This does not appear to have any effect.
I added a hack to CodegenLIR::emitMD() to capture the address of the generated code for the method of interest:

    // save pointer to generated code
    code = (GprMethodProc) frag->code();
    //### hack below for debugging
    {
        static int count = 0;
        uint8_t *codebytes = (uint8_t *)code;
        uint32_t codesize = CodeAlloc::size(assm->codeList);
        fprintf(stderr, "***method address %p bytes %d\n", codebytes, codesize);
        // Trap into debugger just after printing address of the 7th method
        // compiled.
        // This turns out to be the main loop of the switch-1.js benchmark.
        if (++count == 7) asm volatile("int3");
     }

This generates a debug break at an appropriate point to examine the generated code.
Other than incidental differences in code placement, the generated code is the same.

I am beginning to wonder if this is a pathological code placement issue.  The benchmark is long enough that I wouldn't expect to see this much difference based merely on alignment with respect to cache line boundaries, but perhaps there is a mapping conflict or a branch prediction issue.

A possible line of attack would be to modify nanojit:CodeAlloc to allocate the same block of memory in both cases.
Chaning pagesPerAlloc from 16 to something big (16*256?) in CodeAlloc.cpp is a quick test.
(In reply to comment #35)
> Chaning pagesPerAlloc from 16 to something big (16*256?) in CodeAlloc.cpp is a
> quick test.

With this change, the score is ~100 in all cases, i.e., there is no degradation without the dummy allocation.  The same result may also be achieved by modifying CodeAlloc:alloc() to avoid re-use of blocks and call allocCodeChunk() in all cases.  These results still smack of corruption or failure to initialize storage, so I continued to look for a way to reproduce the problem without perturbing the allocator.

I eventually found scenario in which all of the JIT-ed methods were placed at the same locations, both with and without the performance degradation.  While adding a diagnostic printfs to nanojit, I at various times perturbed the code in apparently relevant ways.  Shark showed that a large part of the execution time was due to AvmCore::stricteq, so I conjectured that its placement might be relevant.  I observed that the insertion of one diagnostic printf that was never executed nevertheless resulted in AvmCore::stricteq() landing at a very different address.  I added a dummy function in AvmCore.cpp just prior to AvmCore::stricteq(), attempting to fine-tune its position, but discovered that it was likely to land at a very different address with even a small dummy function -- apparently, the functions are not being laid out in the order in which they appear.  I then tried simply moving the function to a different source file, as this would a more conventional way to get it to land somewhere else in the executable.  See the next comment for a patch and some benchmark results.  The upshot is that the originally-observed degradation is no longer observed, but there are various gains and losses in other benchmarks.  These are smaller, but still worrisome.  It appears that the microbenchmarks are rather sensitive to code placement, at least in this case.

If my hypothesis is correct, it should be possible to observe a significant difference in the icache miss rate between the two scenarios.  Shark supposedly allows access to the x86 performance counters, but all of the built-in configurations claimed to be incompatible with my hardware.  I have not used Shark prior to this exercise today, and claim no mastery of it.  I would appreciate the assistance of anyone who has successfully used the hardware performance counter monitoring in Shark.
The benchmarks compare the following with an unmodified TR:

1) TR with AvmCore::stricteq() moved from AvmCore.cpp to CodegenLIR.cpp
2) TR with Rick's patch 445146
3) TR with the patch and with stricteq() moved

We can observe variability in the scores due to the code placement change,
as well as mitigation of the effect of the patch on the switch-1 benchmark.
I note that in all of my experiments examining the performance of jsmicro/switch-1 while perturbing allocator behavior and code placement, the resulting score exhibited strongly bimodal characteristics.  It always toggled between 100 and 80, +/- 5 or less.  This suggests a single underlying cause.
Executables under test were built with cross-platform scripts (configure/make) on MacOS X (10.6.3).
The patches above may throw away information at branches, as there are cases where an expression is known to be non-null but is not the value of a variable.  We would like to move in the direction of preserving more information across block boundaries, e.g. bug 564580, but these patches move in the opposite direction.  Where we generally optimize at the superblock level, we are now restricting ourselves to basic blocks.

The large superblocks that are causing problems are not the usual case, so it is not clear that we should penalize the normal case on their behalf.  Rather than preemptively restricting the growth of the non-null table to avoid performance degradation with a fixed-sized table, we might simply allow the table to grow as needed to maintain a reasonable load factor.

The attached patch extends the HashMap class to grow dynamically.  When an initial maximum is exceeded, both the number of buckets and the allowable size are doubled, so as to maintain a set upper bound on the load factor.

A weakness of this strategy is that pathological cases are likely to result in storage exhaustion, whereas they would previously be more likely to manifest as an extreme slowdown, which might be seen as a more graceful degradation.  This could be fixed by setting an absolute upper limit on the table size, or by ramping up the allowable load factor as the size increases.
Attachment #455336 - Flags: review?(rreitmai)
Attachment #455336 - Flags: feedback?(edwsmith)
Comment on attachment 455231 [details] [diff] [review]
Patch to rebuild nonnull table for each forward edge ("loop version") rebased with bugfix

I've re-submitted Rick's original patch with a trivial bugfix, based on the analysis above indicating that the observed performance regressions are not due to an intrinsic fault in the patch.  I've also floated the idea of simply allowing the nonnull table to grow.
Attachment #455231 - Flags: review?(rreitmai)
Attachment #455231 - Flags: feedback?(edwsmith)
Attachment #455231 - Flags: review?(rreitmai) → review+
The rehash code looks ok, but is nrehash really suppose to default to -1?  

Also, I like the suggestion in comment 42 about allowing the rehashes but providing an upper bound on how large the table grows before clearing it and repopulating it.
(In reply to comment #44)
> The rehash code looks ok, but is nrehash really suppose to default to -1?

size_t is always unsigned, thus size_t(-1) yields the largest possible
size_t_t value.  There does not appear to be a SIZET_MAX or similar defined.
This is an idiom, but a bit obscure.  A comment or named constant is in order.
Comment on attachment 455336 [details] [diff] [review]
Resize varTracker non-null table to maintain reasonable load factor

Forgot to + after reviewing.   Bill, which approach were you planning on taking?  And were you looking to address the issues highlighted in comment 42.
Attachment #455336 - Flags: review?(rreitmai) → review+
Comment on attachment 455231 [details] [diff] [review]
Patch to rebuild nonnull table for each forward edge ("loop version") rebased with bugfix

If I understand the approach, it means we will perform *less* null pointer elimination due to clearing the table along forward control edges, in the common case (reasonably-sized superblocks).

I dont know that benchmarks will reflect this result since pointer checks are relatively fast.  (well-predicted compare+branch).  Data on static code size, or static # of null pointer checks, would be helpful.

More below on the other patch.
Attachment #455231 - Flags: feedback?(edwsmith) → feedback-
Comment on attachment 455336 [details] [diff] [review]
Resize varTracker non-null table to maintain reasonable load factor

I like this better since it shouldn't regress the common case.

However, a hazard is that we're re-allocating from an arena-allocator, which can be pretty wasteful.  Two alternatives:

1. use a hashmap that really does free when reallocating.  Soon, avmplus::Hashtable might be a candidate (stejohns is working on major cleanups, to support non-Atom keys).  but, this adds a dependency.

2. Choose an initial hash table size based on a function of code-length.  (or, some linear function of code_length and nvar).  maybe, scale back the multiple for large values of code_length.  (e.g. 0.5*code_length for code length up to 1000, and 0.1* code_length plus nvar for larger values).

Using brightspot and -Dverifyonly, choose reasonable thresholds and multiples.  then, never resize the table.

Since we already consume transient memory thats a multiple (>>1, probably >10) of code_length, The worst case scenario is this table ends up being too big.  since it will still be the minority of transient memory, thats okay.
Attachment #455336 - Flags: feedback?(edwsmith) → feedback+
(In reply to comment #48)
> Comment on attachment 455336 [details] [diff] [review]

> 1. use a hashmap that really does free when reallocating.

The resizing already reuses the map entries themselves -- only the
buckets array is reallocated without freeing.  The intent is that resizing/rehashing would not occur often, so that the waste of storage in the average case would be much less than that due to preemptive overallocation as in alternative 2.  The loss could be mitigated by ramping up the allowable load factor aggressively rather than maintaining a constant one up to a hard limit on the buckets array size.  The 75% load factor I'm using now is way too slanted toward speed over space once we approach the limits of what we are willing to allocate in space.

Is it even feasible for us to free space in nanojit::HashMap?  This would require nanojit::Allocator to support free.  It is certainly possible for
an arena allocator to manage the arena as a heap during its lifetime, but
this is an added layer of complexity.
I dont think we should change the allocator, its simple and fast like it is.  But it'd be easy to make it possible for HashMap to use different allocators.  for example right now when we clear the hashmap, all the nodes are just leaked (in the arena).  rehashing also leaks the old table.  to save memory, we could just use a dedicated Allocator instance for Hashmap, and make a whole new hashmap once and a while. 

ie, rehash without clearing (leak the bucket table), but when clearing, just free the Allocator too and make a new HashMap with a new Allocator.  

anyway, just looking for something simple and fast that doesn't leak too much.  Even if overestimating the table size and never rehashing uses more memory than growing, its faster, and no big deal if the amount of memory we're talking about is a small % of codesize.  (compare to LirBuffer, for example).  but either one probably works fine, 99% of the time, and should degrade gracefully the other 1%.
in the pahalogical (large superblock case), youre right, probably the nodes are much bigger than the leaked bucket table.  as long as rehashing preserves nodes, fine.  Estimating the table based on code size (not just nvar) initially will avoid most rehashes, probably.
In a -Dverifyonly profile of 76 random large SWFs with AS3, reducing the hashmap bucket size from 16700 to code_length reduced overall time in the jit by 12%.  8.6 sec before, 7.6 sec after.

The hottest function was _bzero (memset), called from clearState().
Blocks: 583955
Here's a dumb patch that uses code_length for the initial bucket size.  This yeilds a 12% jit-speed improvement on x86, and 20% on ARM (beagleboard).
Comment on attachment 462509 [details] [diff] [review]
use code_length as the bucket estimate instead of 16,700

On TestEncoder.swf from the test case, this patch reduces JIT time from 32sec to 29.4 sec on a beagleboard.
I dont know how long the original code took (using nvar buckets) but i killed the test after 10 minutes.

Here are the nvar and code-len statistics for TestEncoder.swf

event avg     [min : max] count
nvar  6.04    [1   : 264] 2661 
  (2654 < 33)
  (6 < 65)
  (1 >= 65) 
codelen 581.4 [1 : 941708] 2661
  (952 < 10)
  (1418 < 100)
  (210 < 1000)
  (70 < 10000)
  (10 < 100000)
  (1 >= 100000) 

presumably the "bad" case is the one method whose length is 941708, but there's also several in the 10K-100K range, which is still pretty big.  

Luckily, the huge method is modStaticInit(), which apprently is alchemy's output for initializing the static data segment.  One would not expect that method to run more than once, or ever get hot.  so a slightly less dumb JIT strategoy (see bug 539094) should help.
Raising the priority on this because it's been identified as a startup time hotspot, and is desired to be backported to target Zephyr and maybe Salt; a simple bucket-size heuristic that doesn't horribly over or under-estimate, is all thats needed.  potentially a <10 line low risk fix?
Priority: P2 → P1
Target Milestone: flash10.2 → flash10.1.x-Salt
"right-sizing" the table is giving me:  10% speedup on a nehalem mac pro, 13% speedup on core-2 duo, 20% speedup on beagleboard (cortex-a8), and >25% speedup on android; fairly consistently across differerent jit-speed and startup-time tests.
This uses min(code_length, 16700) as a more conservative estimate.  Thus, it will be faster than what's there now, but no bigger.  I would like to land this as an intirim patch until a better heuristic/fix can be developed.
Attachment #462509 - Attachment is obsolete: true
Attachment #463617 - Flags: review?(wmaddox)
Comment on attachment 463617 [details] [diff] [review]
(v2) use code_length as the bucket estimate instead of 16,700

This looks to be a win based on the test results, e.g., comment #54.
Go for it.  R+
Attachment #463617 - Flags: review?(wmaddox) → review+
Comment on attachment 463617 [details] [diff] [review]
(v2) use code_length as the bucket estimate instead of 16,700

Partial fix pushed to http://hg.mozilla.org/tamarin-redux/rev/a4778e824047
Attachment #463617 - Attachment is obsolete: true
This patch implements a constant-time hashmap clear, at the cost of one additional comparison per get/put/remove/constainsKey operation.  Note that iteration remains O(nbuckets + size), where size is the number of entries in the HashMap.
Attachment #465838 - Flags: feedback?(edwsmith)
Comment on attachment 465838 [details] [diff] [review]
Clear HashMap in constant time

nice.

    // QUERY: Should we just include <stdint.h> and get UINT32_MAX ?

since the field is declared uint32_t, its max value should be obvious and a comment is good enough and better than adding a new header dependency.

This doubles the size of the bucket array.  do we care?  probably not.  If we did, we could use two arrays, one with Seq<T>** pointers, and the other with an 8 or 16-bit timestamp.  Bucket as currently defined will waste 32bits per bucket sizeof(bucket) = 8+4+pad

Doing a memset on wraparound seems sane.

The timestamp field, and clear(), need a "here's whats going on" comment.

Why did you inline find()?
Attachment #465838 - Flags: feedback?(edwsmith) → feedback+
(In reply to comment #63)
> Comment on attachment 465838 [details] [diff] [review]

> This doubles the size of the bucket array.  do we care?  probably not.  If we
> did, we could use two arrays, one with Seq<T>** pointers, and the other with an
> 8 or 16-bit timestamp.  Bucket as currently defined will waste 32bits per
> bucket sizeof(bucket) = 8+4+pad

I considered this. as the 32-bit timestamp is excessive.  I conjectured that it might be more important to keep the timestamp and and chain nearby, likely on the same cache line.  It's worth an experiment.  I want to get some numbers before floating this for review in any case.

> The timestamp field, and clear(), need a "here's whats going on" comment.

Will do.

> Why did you inline find()?

I initially modified find() so that it lazily zeroed out the bucket when the timestamp was out of date, which was necessary for the call from put().  This wasn't actually necessary for get(), however, which was content to rely on the timestamp.  Neither get() nor put() could use the result of the find() directly, but required a check on the nullness of its result.  In the end, it seemed cleaner to expand out two slightly-specialized versions of the handful of lines in find(), particularly when I realized that containsKey() could simply call get(), thus not losing the opportunity to share code in these nearly identical cases.
Ping...What of this needs to land in p4 for Salt?
//depot/main/FlashRuntime/Milestones/Argo_Athena_GMC/
(In reply to comment #65)
> Ping...What of this needs to land in p4 for Salt?
> //depot/main/FlashRuntime/Milestones/Argo_Athena_GMC/

This patch appears to be a pretty good fix, and should be landed:

http://hg.mozilla.org/tamarin-redux/rev/a4778e824047

It's not clear that 1x code_length is the optimum multiplier,
but it works well.  On a sampling of brightspot swfs, I found
that the subsequent constant-time clear patch was not a big win,
indicating that we're not losing much time clearing the buckets,
on average.
echoing Bill -- the above patch that simply uses min(code_len, 16700) instead of 16700 as the table size, is the patch I previously recommended be backported to Salt.
Blocks: 593188
Bill, please reference th Perforce CL# here once you've checked it into //depot/main/FlashRuntime/Milestones/Argo_Athena_GMC/
(In reply to comment #68)
> Bill, please reference th Perforce CL# here once you've checked it into
> //depot/main/FlashRuntime/Milestones/Argo_Athena_GMC/

Perforce changelist CL# 717854
Submitted by rreitmai at 9/14/10 1:26 PM
Bill, this can be closed, yes?  It appears to have landed in both Salt and TR.
(In reply to comment #70)
> Bill, this can be closed, yes?  It appears to have landed in both Salt and TR.

I left the bug open because the present fix has not been tuned for an "optimum" compile-speed/memory-consumption tradeoff, and there are a couple of patches that could develop into a better solution.  This is fairly low priority, as the present patch looks good enough.  I could move this work item to a new bug, however, if it makes sense to close this one to track the closing of the WE bug.
Let's resolve this bug and create a follow-on, Future'd bug for the follow-on work.   Marking as Resolved.
Status: ASSIGNED → RESOLVED
Closed: 14 years ago
Resolution: --- → FIXED
Blocks: 596858
Follow-up work items moved to bug 596858 .
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: