细节
描述
support-core-plugin检测到死锁
==============发现死锁==============
“master的Executor #-1:执行xxxx。xxxx@57a2067d" id=6472210 (0x62c212) state=WAITING cpu=0% - WAITING on <0x270b04ac> (a java.util.concurrent.locks.ReentrantLock$NonfairSync) - locked <0x270b04ac> (a java.util.concurrent.locks.ReentrantLock$NonfairSync)拥有"Executor #-1 for master" id=6472207 (0x62c20f) at sun.misc.Unsafe。公园(本机方法)java.util.concurrent.locks.LockSupport.park (LockSupport.java: 175) java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt (AbstractQueuedSynchronizer.java: 836) java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued (AbstractQueuedSynchronizer.java: 870) java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire (AbstractQueuedSynchronizer.java: 1199)美元java.util.concurrent.locks.ReentrantLock NonfairSync.lock (ReentrantLock.java: 209)at java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:285) at hudson.model.Queue._withLock(Queue.java:1437) at hudson.model.ResourceController.execute(ResourceController.java:81) at hudson.model.Executor.run(Executor.java:428)
"Executor #-1 for master" id=6472207 (0x62c20f) state=BLOCKED cpu=0% - waiting to lock <0x195cc02c> (a hudsonmodel .queue. futureimpl)属于"Executor #-1 for master:执行xxxxxxxx #183" id=6218806 (0x5ee436) at hudson.model.queue.FutureImpl.addExecutor(0x5ee436) at hudson.model.queue.WorkUnit.setExecutor(WorkUnit.java:73) at hudson.model.Executor$1.call(Executor.java:359) at hudson.model.queue. executor $1.call(Executor.java:346) at hudson.model.Queue.withLock(Queue.java: 1458) at hudson.model.Queue.withLock(Queue.java:1319) at hudson.model.Executor.run(Executor.java:346)
"Executor #1 for master:正在执行xxxxxxxxx #183" id=6218806 (0x5ee436) state=WAITING cpu=76% - WAITING on <0x270b04ac> (a java.util.concurrent.locks.ReentrantLock$NonfairSync) - locked <0x270b04ac> (a java.util.concurrent.locks.ReentrantLock$NonfairSync)由"Executor #1 for master" id=6472207 (0x62c20f) at sun.misc.Unsafe。公园(本机方法)java.util.concurrent.locks.LockSupport.park (LockSupport.java: 175) java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt (AbstractQueuedSynchronizer.java: 836) java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued (AbstractQueuedSynchronizer.java: 870) java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire (AbstractQueuedSynchronizer.java: 1199)美元java.util.concurrent.locks.ReentrantLock NonfairSync.lock (ReentrantLock.java: 209)at java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:285) at hudson.model.Queue.cancel(Queue.java:732) at hudson.model.queue.FutureImpl.cancel(FutureImpl.java:82)
我的想法(但是把这些竞争条件写成单元测试很复杂…):
线程A正在调用队列。_withLock(因此获得锁实例字段ReentrantLock锁)(https://github.com/必威国际有限公司jenkinsci/jenkins/blob/e065e79d9b19822593260f9db27d4e5b16939ef3/core/src/main/java/hudson/model/Queue.java#L1381)
线程B正在调用FutureImpl。取消这个方法,在Queue实例上有一个同步块(和上面一样,因为它是Jenkins中唯一的实例)必威国际有限公司https://github.com/必威国际有限公司jenkinsci/jenkins/blob/e065e79d9b19822593260f9db27d4e5b16939ef3/core/src/main/java/hudson/model/queue/FutureImpl.java#L74
线程B持有队列实例,并尝试从队列中取消方法,取消方法尝试从实例字段中获得锁,但这个已经被线程A持有。
线程A尝试返回锁,因为线程B有一个同步的队列实例。
解决方案似乎在这里删除了Queue实例上的同步块https://github.com/必威国际有限公司jenkinsci/jenkins/blob/e065e79d9b19822593260f9db27d4e5b16939ef3/core/src/main/java/hudson/model/queue/FutureImpl.java#L74因为在Queue中使用了锁。
看起来是一个安全的更改(再次强调,编写单元测试并不容易证明这一点)
另一种解决方案是让调用者不使用FutureImpl。取消但使用queue.cancel
公关https://github.com/必威国际有限公司jenkinsci/jenkins/pull/5305
这个提交引入了一个使用Lock的新策略https://github.com/必威国际有限公司jenkinsci/jenkins/commit/92147c3597308bc05e6448ccc41409fcc7c05fd7但是没有改变FutureImpl类不再使用synchronized on Queue实例。
可能的解决方法是使用Queue. cancel(FutureImpl.task),因此这将使用来自队列的锁。