Improper HeuristicMixedException ?

classic Classic list List threaded Threaded
9 messages Options
Reply | Threaded
Open this post in threaded view
|

Improper HeuristicMixedException ?

James House

Hi,

I tested the following failure scenario (Bitronix 1.3.2):

* Connect to two different instances of PostgreSQL (using driver
8.3-604) with XA connections
* Insert records into a table in each database
* Call commit on UserTransaction
* Break-point was set in BitronixTransaction.commit() approx line 180 -
between prepare and commit
* While paused on break point (after prepare) I killed one of the databases
* Then I resumed execution
* The result was a HeuristicMixedException
* When I started the Postgresql instance back up, the commit recovered
(as expected), and I ended up with the record in both databases - as
should be, since prepare succeeded

But why did my application get a HeuristicMixedException  ???

Per spec, this exception is defined as:
 >  thrown when a heuristic decision was made and some updates have been
committed and others were rolled back

But no rollback occurred.  Both resorces voted commit on prepare, only
the second commit failed due to the resource being down.  And in fact
the commit happened when the resource was started back up.


BTM threw this to my application:
===============================================
bitronix.tm.internal.BitronixHeuristicMixedException: transaction failed
during commit of a Bitronix Transaction with GTRID
[6D79544D546573740000011F4883413200000008], status=UNKNOWN, 2
resource(s) enlisted (started Thu Feb 05 15:16:15 MST 2009): resource(s)
[datasource/ds2] improperly unilaterally rolled back (or hazard happened)
    at bitronix.tm.twopc.Committer.throwException(Committer.java:104)
    at bitronix.tm.twopc.Committer.commit(Committer.java:63)
    at bitronix.tm.BitronixTransaction.commit(BitronixTransaction.java:183)
    at
bitronix.tm.BitronixTransactionManager.commit(BitronixTransactionManager.java:96)
    at com.foo.TestTM.main(TestTM.java:116)


BTM logged this when the commit failed:
=====================================
SEVERE: resource datasource/ds2 failed on a Bitronix XID
[6D79544D546573740000011F4883413200000008 :
6D79544D546573740000011F488341350000000B]
bitronix.tm.internal.BitronixXAException: resource reported !invalid
error code (0)! when asked to commit transaction branch
    at
bitronix.tm.twopc.Committer$CommitJob.handleXAException(Committer.java:173)
    at
bitronix.tm.twopc.Committer$CommitJob.commitResource(Committer.java:156)
    at bitronix.tm.twopc.Committer$CommitJob.run(Committer.java:142)
    at bitronix.tm.twopc.executor.SyncExecutor.submit(SyncExecutor.java:12)
    at
bitronix.tm.twopc.AbstractPhaseEngine.runJobsForPosition(AbstractPhaseEngine.java:108)
    at
bitronix.tm.twopc.AbstractPhaseEngine.executePhase(AbstractPhaseEngine.java:70)
    at bitronix.tm.twopc.Committer.commit(Committer.java:59)
    at bitronix.tm.BitronixTransaction.commit(BitronixTransaction.java:183)
    at
bitronix.tm.BitronixTransactionManager.commit(BitronixTransactionManager.java:96)
    at com.foo.TestTM.main(TestTM.java:116)
Caused by: javax.transaction.xa.XAException:
org.postgresql.util.PSQLException: An I/O error occured while sending to
the backend.
    at
org.postgresql.xa.PGXAConnection.commitPrepared(PGXAConnection.java:444)
    at org.postgresql.xa.PGXAConnection.commit(PGXAConnection.java:371)
    at
bitronix.tm.twopc.Committer$CommitJob.commitResource(Committer.java:153)
    ... 8 more

Is this a fault in Bitronix, PostgreSQL driver, or the marriage of the two?


james



---------------------------------------------------------------------
To unsubscribe from this list, please visit:

    http://xircles.codehaus.org/manage_email


Reply | Threaded
Open this post in threaded view
|

Re: Improper HeuristicMixedException ?

Ludovic Orban
Administrator
Hi,

That's a very interesting test rasing a good question.

First I have to agree that the error reporting is not ideal in this situation even if integrity has been preserved.

Second I noticed that PGSQL reports an invalid error code (bitronix.tm.internal.BitronixXAException: resource reported !invalid
error code (0)! when asked to commit transaction branch) while it should have reported XAER_RMFAIL.

I also checked the 2PC engine's code and I have to say that even with a proper error code reported by PGSQL the result would be the same.

Now I have to admit that you're reaching the limits of my knowledge as I don't know for sure what should be reported in such case.

Any opinion ?

Thanks,
Ludovic
Reply | Threaded
Open this post in threaded view
|

Re: Improper HeuristicMixedException ?

James House

I believe that in this scenario the application should not receive any
exception whatsoever from it's call to commit.

Once prepare has succeeded (all resources voting commit), the TM should
be able to guarantee that the commit worked (i.e. the fully prepared
state is recorded in the TM's journal file, and all resources have
persisted the changes on their side)  --  and thus the TM should return
cleanly to the application as if it did succeed - as it did in this
case, although delayed for the one resource, which had to be restarted
before commit could occur.

This is the behavior I've seen out of Weblogic and other TM's over the
years, and coincides with my reading of specs.

The problem with throwing an exception to the application is that the
application will report to the user that the operation failed, and they
will then do it again, but the data changes in fact were already made,
and thus, depending on the application, you may end up with duplicated
orders, or double addition of funds to accounts, or what have you.


james


---------------------------------------------------------------------
To unsubscribe from this list, please visit:

    http://xircles.codehaus.org/manage_email


Reply | Threaded
Open this post in threaded view
|

Re: Improper HeuristicMixedException ?

James House

To be more explicit about the full behavior I believe should occur
during the second (commit) phase:

0- persist state in TM journal that all resources voted commit
1- iterate over all resources, for each
2- call commit()
3- catch and log an exceptions, if any, 'remember' that commit failed
for this resource, and whether it was specifically a rollback exception
4- continue iteration
5- (iteration complete - all resources have had commit() called on them)
6- persist state in TM journal which resources could not be committed
(due to non rollback exceptions)
7- return cleanly - unless one of the resources gave a rollback
exception, in which case a HeuristicMixedException is in order

Then of course, background recovery will work on completing the tx for
those resources which failed during commit (excepting any that rolled-back).

james


---------------------------------------------------------------------
To unsubscribe from this list, please visit:

    http://xircles.codehaus.org/manage_email


Reply | Threaded
Open this post in threaded view
|

Re: Improper HeuristicMixedException ?

Ludovic Orban
Administrator
BTM strictly follows the flow you described except for step 7.

I agree with you that in this case BTM thows a heuristic exception while it shouldn't but I see one serious caveat of cleanly returning at step 7. While the data integrity can be guaranteed by the TM in such case (and as you noticed it actually is) I wonder about the execution of future transactions.


Let's say I modified BTM to cleanly return in such situation. Your application would expect that data to be committed in the DB while it would actually be in-doubt as long as the recoverer did not run.

There is a time window between when the DB comes back after a crash and the background recoverer gets a chance to run. During this window your application would expect the data to be committed while it actually isn't.

Don't you think this could lead to serious troubles ? I also don't speak about all the other issues that could arise due to buggy/incorrect JDBC drivers, background recoverer being disabled and others I can't think of right now.

BTW, prepare and rollback error handling should probably be rethought too following the same logic.

Thanks for your insights.
Reply | Threaded
Open this post in threaded view
|

Re: Improper HeuristicMixedException ?

James House

Regarding the time window between when the DB comes back after the crash,
and when the background recoverer runs and completes the TX:

If a high degree of integrity matters to your application then the
solution is the use of
row-locking -- which you would be using anyway in such a scenario (if
you must be
sure that you're not reading data that another tx is in the middle of
changing then you
write the code to first obtain appropriate locks, then read, then modify
data, then
commit).

If your application is working is such a way, then the row locks are
held while the tx is
in-doubt (prepared but awaiting the commit).  So when the DB starts back
up, the rows
are still locked, and other application threads will block while waiting
for the locks related
to the data that was affected by the tx.  Once the recoverer runs, the
commit occurs, and
the locks released.

I just double-checked this with a test and it all worked out fine -
locks were still in place
as postgres fired back up, and my attempt to alter the row blocked until
the recoverer ran
-- again the only problem being that the application thought the call to
commit() failed,
when in fact it succeeded.

Of course all of this means that you want the background recoverer to
run very often,
so that you don't have to wait a long time for the commit to occur, so
I'd plan on using
the smallest interval allowed in the configuration - 1 minute.

It would be a bonus if there is an API call that can be made that
triggers the recovery
to happen "now", rather than waiting for the timeout.


james


---------------------------------------------------------------------
To unsubscribe from this list, please visit:

    http://xircles.codehaus.org/manage_email


Reply | Threaded
Open this post in threaded view
|

Re: Improper HeuristicMixedException ?

Ludovic Orban
Administrator
I think you have a point and convinced me to make the change, or at least to try it out.

I've only done a quick code scan to figure out what needs to be modified and I think that's pretty minimal so you could try to patch the code yourself:

In bitronix.tm.twopc.Committer.java:173, change:

throw new BitronixXAException("resource reported " + Decoder.decodeXAExceptionErrorCode(xaException) + " when asked to commit transaction branch", XAException.XA_HEURHAZ, xaException);

with:

log.error("resource '" + failedResourceHolder.getUniqueName() + "' reported " + Decoder.decodeXAExceptionErrorCode(xaException) + " when asked to commit transaction branch", xaException);

and that should do the trick. Please try this out and let me know how things turned out. Please note that the same change should be applied to the Rollbacker class as well. I think the fact that Rollbacker is used both for phase 1 and 2 shouldn't make any difference.

A simple recovery trigger could also be implemented without too much hassle with a slight change to the XAPool.getConnectionHandle() method to flag the pool as dirty when a connection fails its test and run incremental recovery on it when a new connection can be created.

I'll try this all out when time permits and register a JIRA issue to make sure all that gets included in the next release.

Thanks.
Reply | Threaded
Open this post in threaded view
|

Re: Improper HeuristicMixedException ?

James House

Hi Ludovic,

I finally got a chance to try out your quick idea of a patch, and I
regret to report that it does not solve the problem.

It does avoid the problem of my application catching an exception, but
the end result is that the failed resource gets told to rollback the TX
(instead of commit) once recovery runs!

Here's the new message from the recovery process:


Feb 11, 2009 10:04:36 AM bitronix.tm.recovery.Recoverer run
INFO: recovery committed 0 dangling transaction(s) and rolled back 1
aborted transaction(s) on 2 resource(s) [datasource/ds1,
datasource/ds2], discarded 0 unrecoverable resource(s) []


For reference, here's the message from the recovery process without the
patch applied (commit occurs as appropriate):

Feb 11, 2009 10:08:30 AM bitronix.tm.recovery.Recoverer run
INFO: recovery committed 1 dangling transaction(s) and rolled back 0
aborted transaction(s) on 2 resource(s) [datasource/ds1,
datasource/ds2], discarded 0 unrecoverable resource(s) []



I haven't had time to look into the cause of this incorrect behavior ,
but I'll let you know if I do before I here from you.

Also, your idea of triggering recovery on the resource once a connection
can be successfully obtained sounds perfect.


james


---------------------------------------------------------------------
To unsubscribe from this list, please visit:

    http://xircles.codehaus.org/manage_email


Reply | Threaded
Open this post in threaded view
|

Re: Improper HeuristicMixedException ?

Ludovic Orban
Administrator
Of course, that makes sense.

With that simplistic patch the commit phase does not report errors anymore to the calling code and outputs error logs instead. This means the calling code will just log in the journal that the transaction succeeded on all resources since no exception is thrown anymore. When recovery kicks in it finds an in-doubt transaction on the resource but no uncommitted entry in the journal so it presumes the branch has to be rolled back.

The proper fix will require more work: the Committer needs to be changed so it can report which resources committed and which didn't. The calling code then need to log in the journal that only some resources did commit so that the recovery process knows the dangling ones. I will need to review the 2nd phase implementation of the Rollbacker as well.

This will take me a bit of time to implement those changes so please log a JIRA issue to make sure I won't forget all this.

Thanks for reporting this bug and your help in testing.