Fix for BTM-33 leads to another problem

classic Classic list List threaded Threaded
15 messages Options
Reply | Threaded
Open this post in threaded view
|

Fix for BTM-33 leads to another problem

Dennis Brakhane-2
Hi,

We are using BTM in Connection with Spring and iBATIS and PostgreSQL
8.2. Recently, our live application
ran out of connections and therefore grinded to a halt.

The exception we saw is the same as the one in BTM-33, so we replaced
BTM 1.3.2 with the snapshot version referenced therein.

However, this only seemed to have made matters worse, as we are now
getting the following exception:

 java.sql.SQLException: error requeueing a JdbcPooledConnection from
datasource personalities2 in state ACCESSIBLE wrapping org.postgresql.xa.PGXACo
nnection@1681426
        at bitronix.tm.resource.jdbc.JdbcPooledConnection.release(JdbcPooledCon
nection.java:148)
        at bitronix.tm.resource.jdbc.JdbcConnectionHandle.close(JdbcConnectionH
andle.java:77)
        at org.springframework.jdbc.datasource.DataSourceUtils.doReleaseConnect
ion(DataSourceUtils.java:313)
        at org.springframework.jdbc.datasource.TransactionAwareDataSourceProxy$
TransactionAwareInvocationHandler.invoke(TransactionAwareDataSourceProxy.java:2
00)
        at $Proxy4.close(Unknown Source)
        at com.ibatis.sqlmap.engine.transaction.external.ExternalTransaction.cl
ose(ExternalTransaction.java:82)
        at com.ibatis.sqlmap.engine.transaction.TransactionManager.end(Transact
ionManager.java:111)
        at com.ibatis.sqlmap.engine.impl.SqlMapExecutorDelegate.endTransaction(

SqlMapExecutorDelegate.java:782)
        at com.ibatis.sqlmap.engine.impl.SqlMapSessionImpl.endTransaction(SqlMa
pSessionImpl.java:176)
        at com.ibatis.sqlmap.engine.impl.SqlMapClientImpl.endTransaction(SqlMap
ClientImpl.java:154)
        at com.ibatis.sqlmap.engine.impl.SqlMapExecutorDelegate.autoEndTransact
ion(SqlMapExecutorDelegate.java:883)
        at com.ibatis.sqlmap.engine.impl.SqlMapExecutorDelegate.queryForObject(
SqlMapExecutorDelegate.java:573)
        at com.ibatis.sqlmap.engine.impl.SqlMapExecutorDelegate.queryForObject(
SqlMapExecutorDelegate.java:541)
        at com.ibatis.sqlmap.engine.impl.SqlMapSessionImpl.queryForObject(SqlMa
pSessionImpl.java:106)
        at com.ibatis.sqlmap.engine.impl.SqlMapClientImpl.queryForObject(SqlMap
ClientImpl.java:83)
        ... 131 more
Caused by: bitronix.tm.internal.BitronixSystemException: cannot close a resourc
e when its XAResource is taking part in an unfinished global transaction
        at bitronix.tm.resource.common.TransactionContextHelper.requeue(Transac
tionContextHelper.java:117)
        at bitronix.tm.resource.jdbc.JdbcPooledConnection.release(JdbcPooledCon
nection.java:146)

These exceptions appear much more often than the BTM-33 used to do,
and the connection pool is much sooner exhausted.


I don't know what information is needed to analyse this problem, so
for now, I just sketch the configuration,
let me know if you need more information.

The SQLMappers are configured using
org.springframework.orm.ibatis.SqlMapClientFactoryBean, the DataSource
is a bitronix.tm.resource.jdbc.PoolingDataSource, with
allowLocalTransactions set to true, and preparedstatementcachesize set
to 50 or so (we cannot deactive preparedstatementcaching as this
degrades performance too much)

The SqlMapConfig looks something like
 <settings
    useStatementNamespaces="true"
    lazyLoadingEnabled="false"
    cacheModelsEnabled="true"
  />
<sqlMap resource="..."/>


Transactions are used using Spring's @Transactional annotation, spring
is configured to use the JtaTransactionManager, where userTransaction
and transactionManager are both set to
"transactionManagerInterceptor", this is just a simple wrapper that
notifies listeners on begin() and rollback()/commit() calls, but it
always delegates to the BitronixTransactionManager

Any help is greatly appreciated,
  Dennis

---------------------------------------------------------------------
To unsubscribe from this list, please visit:

    http://xircles.codehaus.org/manage_email


Reply | Threaded
Open this post in threaded view
|

Re: Fix for BTM-33 leads to another problem

Ludovic Orban
Administrator
For some reason I expected more problems to come with that issue.

I need to narrow down the problem as much as possible to fix it. Can you reproduce the issue outside of your production environment ? How long does it take before you see those errors being logged ? How often do you see them ? What kind of load does it require ? Do you mix JDBC and JMS in those transactions or only JDBC ?

I doubt the statements cache has to do with this issue but I would like to close that path first. If there is any way you can run without the statements cache for some time that would help.

The allowLocalTransactions is a classic usual suspect. Why do you need it ? Can you do without it ?

Any information you can grasp that would help pinpointing what your application does (or doesn't do) to trigger this error will help. Can you figure out a single use case that can reproduce the error ?

In the end I will have to reproduce this error some way or another and dig the debug logs to figure out what is going wrong. The smallest you can make the debug logs file the faster I will be able to fix the issue. In the mean time, enable them and send me what you have. I'll have a first look which might help understanding what's going on.


Ludovic
Reply | Threaded
Open this post in threaded view
|

Re: Fix for BTM-33 leads to another problem

Dennis Brakhane-2
Hi.

Thanks for the prompt response. I will answer the questions I can
answer at this point, although I fear it's not much:

I do not have direct access to the application server and cannot
directly redeploy or reconfigure logging levels, but I will have  a
load test system up to be able to get more information soon.


On Mon, Mar 2, 2009 at 1:43 PM, Ludovic Orban <[hidden email]> wrote:
>
> I need to narrow down the problem as much as possible to fix it. Can you
> reproduce the issue outside of your production environment ? How long does
> it take before you see those errors being logged ? How often do you see them
> ? What kind of load does it require ? Do you mix JDBC and JMS in those
> transactions or only JDBC ?

We only use JDBC connections. I haven't looked at every logfile (we
have quite a few servers), but the exception happens as soon as 6
minutes after booting up.

> I doubt the statements cache has to do with this issue but I would like to
> close that path first. If there is any way you can run without the
> statements cache for some time that would help.

I can't do this on the live system, but I'll try on the load test system.

>
> The allowLocalTransactions is a classic usual suspect. Why do you need it ?
> Can you do without it ?

We "need" it because we have many SELECT statements and very few
INSERT/UPDATE-Statements. While most of the write-statements are
actually on the same database, BTM can do a 1PC. However, many
SELECT-statements are spanning several databases. Doing a global
transaction and 2PC when actually nothing changed seems to be
wasteful.

> Any information you can grasp that would help pinpointing what your
> application does (or doesn't do) to trigger this error will help. Can you
> figure out a single use case that can reproduce the error ?

I'm trying. I redid my testcase for BTM-28 (as it's the same
exception), but I wasn't able to reproduce it. So it seems unrelated.

> In the end I will have to reproduce this error some way or another and dig
> the debug logs to figure out what is going wrong. The smallest you can make
> the debug logs file the faster I will be able to fix the issue. In the mean
> time, enable them and send me what you have. I'll have a first look which
> might help understanding what's going on.

I cannot do that at the moment as the logs contain sensitive
information, besides, I cannot change the log levels on the live
system. But I will try to get some useful logs from the load test
system for you.

---------------------------------------------------------------------
To unsubscribe from this list, please visit:

    http://xircles.codehaus.org/manage_email


Reply | Threaded
Open this post in threaded view
|

Re: Fix for BTM-33 leads to another problem

Ludovic Orban
Administrator
Ok, I'm going to wait for your server logs.

About your allowLocalTransactions usage: I see your point but that's breaking design for performance. That's the kind of tradeoff you can expect in a project but I think if PostgreSQL was smarter with its handling of the Connection readOnly flag and XAResource READONLY hint then BTM would be able to optimize the transaction to avoid the disk sync.

Maybe you can hint the PostgreSQL guys in that direction.

Ludovic
Reply | Threaded
Open this post in threaded view
|

Re: Fix for BTM-33 leads to another problem

Dennis Brakhane-2
I've managed to reproduce the problem!

Skimming through debug logs taken on the live system lead me to
re-examine my testcase for BTM-28 (I already did last week). This
time, I was able to cause the problem without much problems!

As this is a mailing list, I won't attach it; it's basically the same
code as the one posted in BTM-28, only the Aspect part is missing (as
it isn't needed) and after initialising the PoolingDataSource, the
rest is run in an endless loop until an exception occurs (which
usually happens after a few seconds).

So far, I've been able to confirm that change r242 is the most likely culprit:

* With r242, the exception DOES occur
* With r241, it DOES NOT occur
* With HEAD, it DOES occur
* When executing the second query (main thread, see BTM-28) in a
global transaction, it DOES NOT occur
* Adding synchronized on every function in XAResourceManager does not
help, the exception still DOES occur

I tried the latter because XAResourceManager in r242 uses an
unsynchronized ArrayList, which is not thread-safe. So although I
believe XARM is buggy, I either wasn't able to "fix" it, or the
problem lies somewhere else.

Please let me know if you need further information, I can provide the
log files, of course, although they don't seem to be really useful,
they don't seem to differ in any interesting way between a run that
does work and one that doesn't.

In the meantime, I will try to see what modification causes the bug to
disappear.

---------------------------------------------------------------------
To unsubscribe from this list, please visit:

    http://xircles.codehaus.org/manage_email


Reply | Threaded
Open this post in threaded view
|

Re: Fix for BTM-33 leads to another problem

Dennis Brakhane-2
On Thu, Mar 12, 2009 at 11:56 AM, Dennis Brakhane
<[hidden email]> wrote:
> * Adding synchronized on every function in XAResourceManager does not
> help, the exception still DOES occur

Whoops. I meant AbstractXAResourceHolder, of course

---------------------------------------------------------------------
To unsubscribe from this list, please visit:

    http://xircles.codehaus.org/manage_email


Reply | Threaded
Open this post in threaded view
|

Re: Fix for BTM-33 leads to another problem

Ludovic Orban
Administrator
Hi,

That's great news !

Can you please post all the details you have about the issue ? How did you reproduce it ? Can you send me a sample app that I can run on my side ? Can you also post the debug logs you collected ?

Thanks,
Ludovic
Reply | Threaded
Open this post in threaded view
|

Re: Fix for BTM-33 leads to another problem

Dennis Brakhane-2
Sorry for the late reply, I was ill.

On Thu, Mar 12, 2009 at 8:55 PM, Ludovic Orban <[hidden email]> wrote:
> Can you please post all the details you have about the issue ? How did you
> reproduce it ? Can you send me a sample app that I can run on my side ? Can
> you also post the debug logs you collected ?

I will now post a new bug report with the program and logs of a sample run.

---------------------------------------------------------------------
To unsubscribe from this list, please visit:

    http://xircles.codehaus.org/manage_email


Reply | Threaded
Open this post in threaded view
|

Re: Fix for BTM-33 leads to another problem

Dennis Brakhane-2
> I will now post a new bug report with the program and logs of a sample run.

http://jira.codehaus.org/browse/BTM-43

---------------------------------------------------------------------
To unsubscribe from this list, please visit:

    http://xircles.codehaus.org/manage_email


Reply | Threaded
Open this post in threaded view
|

Re: Fix for BTM-33 leads to another problem

Ludovic Orban
Administrator
Dennis,

Thanks for your efforts on this excellent test tool even if it takes 15+ minutes to reproduce on my computer. That's probably because we're running different OS, JVM version on different hardware.

I think I spotted where the race condition triggers and I've prepared a snapshot build. The download link is in a comment in the JIRA issue. I've tested this build with your tool for 1 hour without any problem for now

Please let me know if that fixes the issue on your side as well.

Ludovic
Reply | Threaded
Open this post in threaded view
|

Re: Fix for BTM-33 leads to another problem

Dennis Brakhane-2
On Tue, Mar 17, 2009 at 10:18 PM, Ludovic Orban <[hidden email]> wrote:
> Thanks for your efforts on this excellent test tool [...]

:-)

> I think I spotted where the race condition triggers and I've prepared a
> snapshot build. The download link is in a comment in the JIRA issue. [...]

Wow. That was fast! Thanks!

> Please let me know if that fixes the issue on your side as well.

I've already noted this in the JIRA issue, but could you provide me
with the sources for the snapshot? It's because the "real" problem
only occurs on the live system, not on the test system, and getting
approval to deploy "unknown, binary only" changes might be difficult.

Greetings,
  Dennis

---------------------------------------------------------------------
To unsubscribe from this list, please visit:

    http://xircles.codehaus.org/manage_email


Reply | Threaded
Open this post in threaded view
|

Re: Fix for BTM-33 leads to another problem

Ludovic Orban
Administrator
The toughest part is to reproduce and understand the bug, the rest is easy.

Thanks to your test it took me less than an hour to exactly figure out what was wrong. Fixing the problem was a matter of minutes then.

I've committed the changes in the trunk, I was busy doing that when you commented the JIRA issue so next time please give me a few more minutes. ;-)
Reply | Threaded
Open this post in threaded view
|

Re: Fix for BTM-33 leads to another problem

Dennis Brakhane-2
On Tue, Mar 17, 2009 at 10:43 PM, Ludovic Orban <[hidden email]> wrote:
> I've committed the changes in the trunk, I was busy doing that when you
> commented the JIRA issue so next time please give me a few more minutes. ;-)

OK, I'll make an exception for you next time, but don't think I will
always be so lenient ;-)

(Seriously, I forgot that JIRA shows EST and not MET and therefore
thought the post was 6 hours ago.)

---------------------------------------------------------------------
To unsubscribe from this list, please visit:

    http://xircles.codehaus.org/manage_email


Reply | Threaded
Open this post in threaded view
|

Re: Fix for BTM-33 leads to another problem

Dennis Brakhane-2
In reply to this post by Ludovic Orban
BTW: I noticed AbstractXAResourceHolder still uses ArrayList in what
seems to me a not thread safe way, I can't see any synchronization.
Specifically, two parallel calls to setXAResourceHolderState might
cause havoc, AFAICT

---------------------------------------------------------------------
To unsubscribe from this list, please visit:

    http://xircles.codehaus.org/manage_email


Reply | Threaded
Open this post in threaded view
|

Re: Fix for BTM-33 leads to another problem

Ludovic Orban
Administrator
Good point, this was left pending because of a long overdue refactoring of the XAResourceHolder interface. I've now committed the refactoring + fix and re-uploaded the snapshot build of the BTM-43 issue.

Actually there is no need for this array to be made thread-safe when the resource has deferConnectionRelease set to true as the AbstractXAResourceHolder methods cannot be called concurrently then. But thread-safety definitely is needed when deferConnectionRelease is false so this was a bug.

Thanks for reporting it.