Skip to content

Comments

<fix>[expon]: fix vhost installPath overwrite and test cleanup#3360

Open
ZStack-Robot wants to merge 51 commits into5.5.6from
sync/yaohua.wu/ZSTAC-82153@@2
Open

<fix>[expon]: fix vhost installPath overwrite and test cleanup#3360
ZStack-Robot wants to merge 51 commits into5.5.6from
sync/yaohua.wu/ZSTAC-82153@@2

Conversation

@ZStack-Robot
Copy link
Collaborator

ZSTAC-82153

Changes

  • Fix vhost installPath being overwritten during volume migration
  • Make vhost bound USS simulator stateful for test stability
  • Move XinfiniPrimaryStorageCase from premium to community test
  • Add null host check in preReleaseVmResource
  • Fix CD-ROM null path handling

Test

  • ExponPrimaryStorageCase: 10/10 stability test passed
  • XinfiniPrimaryStorageCase: 10/10 stability test passed

Related: ZSTAC-82153

sync from gitlab !9194

…isk offering

Resolves: ZSTAC-74683

Change-Id: Id0339ed0221e92e506f60745cde972cc3ee6d9ae
When anti-split-brain check selects a disconnected MDS node, the HTTP
call now times out after 30s instead of 5+ minutes, and automatically
retries the next available MDS via tryNext mechanism.

Resolves: ZSTAC-80595

Change-Id: I1be80f1b70cad1606eb38d1f0078c8f2781e6941
When MN restarts during a destroy operation, the hypervisor may report
the VM as Stopped. Without this transition, the state machine throws
an exception and the VM stays stuck in Destroying state forever.

Resolves: ZSTAC-80620

Change-Id: I037edba70d145a44a88ce0d3573089182fedb162
…pacity

Resolves: ZSTAC-79709

Change-Id: I45a2133bbb8c51c25ae3549d59e588976192a08d
Resolves: ZSTAC-78989

Change-Id: I0fe3a56ab724978944c69afadaab7ff7353e4c0f
Resolves: ZSTAC-82153

Change-Id: Ib51c2e21553277416d1a9444be55aca2aa4b2fc4
…fterAddIpAddress to prevent NPE during rollback

Resolves: ZSTAC-81741

Change-Id: I53bcf20a10306afc7b6172da294d347b74e6c41f
Resolves: ZSTAC-81182

Change-Id: Id1bb642154dc66ae9995dcc4d9fc00cdce9bcaf8
@coderabbitai
Copy link

coderabbitai bot commented Feb 13, 2026

Walkthrough

此 PR 为两个外部主存储后端(Expon、Xinfini)添加方案配置与大量测试支持,调整若干存储、KVM、调度、核心与管理节点相关的健壮性与并发边界处理,并在多处增加空值/错误处理与日志改进(含若干接口行为细微更改)。

Changes

Cohort / File(s) Summary
Expon SDK/Controller
plugin/expon/src/main/java/org/zstack/expon/ExponStorageController.java, plugin/expon/src/main/java/org/zstack/expon/sdk/ExponClient.java, plugin/expon/src/main/java/org/zstack/expon/sdk/ExponConnectConfig.java
新增 ExponConnectConfig.scheme(默认 "https");ExponStorageController 根据 URL 填充 scheme 与默认端口;ExponClient 使用 config.scheme 构建 URL,替换硬编码 https。
Xinfini 引入与 SpringSpec 调整
testlib/src/main/java/org/zstack/testlib/SpringSpec.groovy, test/src/test/groovy/org/zstack/test/integration/storage/StorageTest.groovy, test/src/test/groovy/.../primary/.../XinfiniPrimaryStorageCase.groovy, testlib/src/main/java/org/zstack/testlib/ExternalPrimaryStorageSpec.groovy
新增 xinfini 包含点与大型 Xinfini 模拟器(XinfiniSimulators);扩充 ExternalPrimaryStorageSpec 加入 Expon/Xinfini 模拟集合;新增完整 XinfiniPrimaryStorageCase 集成测试类并在测试初始化中加入 xinfini。
Expon 测试与模拟器增强
test/src/test/groovy/org/zstack/test/integration/storage/primary/addon/expon/ExponPrimaryStorageCase.groovy, testlib/src/main/java/org/zstack/testlib/ExternalPrimaryStorageSpec.groovy
将 Expon 测试目标从 https/远程切换为 http/localhost;移除跳过 guard,改为顺序 VM 流程;增强 USS/vhost 模拟器行为与清理/断言逻辑。
KVM CBD 路径处理
plugin/cbd/src/main/java/org/zstack/cbd/kvm/KvmCbdNodeServer.java
调整路径转换:当卷协议非 CBD 或找不到节点服务时返回 null;仅在非空路径时设置字段,避免用 null 覆盖原值。
外部主存储工厂与本地解析健壮性
storage/src/main/java/org/zstack/storage/addon/primary/ExternalPrimaryStorageFactory.java, plugin/localstorage/src/main/java/org/zstack/storage/primary/local/LocalStorageAllocatorFactory.java, plugin/localstorage/src/main/java/org/zstack/storage/primary/local/LocalStorageKvmBackend.java
为 preReleaseVmResource 增加 null 目标主机保护;URI 解析器缺失时抛出清晰 OperationFailureException;修复错误处理中的 NPE 细节读取。
测试库模拟器扩展
testlib/src/main/java/org/zstack/testlib/ExternalPrimaryStorageSpec.groovy
新增 ExponSimulators 与 XinfiniSimulators 两大模拟器类,注册大量模拟路由以支持集成测试场景。
管理节点与资源目的地并发/一致性改动
core/src/main/java/org/zstack/core/cloudbus/ResourceDestinationMakerImpl.java, portal/src/main/java/org/zstack/portal/managementnode/ManagementNodeManagerImpl.java
ResourceDestinationMakerImpl 多个方法改为 synchronized 并在 iJoin 填充 NodeInfo;ManagementNodeManagerImpl 增加 lifecycleLock、两步消失确认与 suspectedMissingFromDb,以避免心跳/事件竞态。
调度、主机、VM 相关健壮性改动
compute/src/main/java/org/zstack/compute/allocator/HostAllocatorChain.java, compute/src/main/java/org/zstack/compute/allocator/HostAllocatorManagerImpl.java, compute/src/main/java/org/zstack/compute/host/HostBase.java, compute/src/main/java/org/zstack/compute/vm/VmCapabilitiesJudger.java, compute/src/main/java/org/zstack/compute/vm/VmNicManagerImpl.java
改进异常处理与日志(统一 errMsg);内存超额由抛异常改为 clamp 并警告;若干 null 守护(主机 self、psType、VmNic)以避免 NPE;保持原有功能但更稳健。
VM 同步与孤立 VM 跳过逻辑
plugin/kvm/src/main/java/org/zstack/kvm/KvmVmSyncPingTask.java
引入 orphanedSkipVms TTL 机制:在 MN 离线时将此前跳过的 VM 标记为孤立并在 TTL 内继续跳过,支持清理过期条目并在同步结果处理中保留一致性。
Ceph 与 ZBS 行为调整
plugin/ceph/src/main/java/.../CephPrimaryStorageBase.java, plugin/ceph/src/main/java/.../CephPrimaryStorageMonBase.java, plugin/zbs/src/main/java/org/zstack/storage/zbs/ZbsStorageController.java
减少直接对 null 的访问(mon UUID、防护 unknown);删除快照后使用带比率的容量释放方法;Zbs 使用 HttpCaller 支持 tryNext、并同步 MDS 状态到 DB。
安全与日志序列化、防抖配置
core/src/main/java/org/zstack/core/log/LogSafeGson.java, core/src/main/java/org/zstack/core/telemetry/SentryInitHelper.java, core/src/main/java/org/zstack/core/telemetry/TelemetryGlobalProperty.java, header/src/main/java/.../ExternalPrimaryStorageInventory.java
LogSafeGson 空值返回 JsonNull;Sentry 初始化增加 maxQueueSize、调试诊断 logger,并暴露 SENTRY_DEBUG 配置;ExternalPrimaryStorageInventory 对 config 中的 mdsUrls/mdsInfos 进行敏感信息脱敏(屏蔽 userinfo)。
API/权限/LDAP/查询错误类型调整
identity/src/main/java/org/zstack/identity/AuthorizationManager.java, plugin/ldap/src/main/java/org/zstack/ldap/LdapUtil.java, search/src/main/java/org/zstack/query/QueryFacadeImpl.java
增加 evaluated session null 检测并抛出 ApiMessageInterceptionException;LDAP 查询错误由 CloudRuntimeException 改为 OperationFailureException;API 查询异常路径改为 OperationFailureException。
存储网络与声明式变更
network/src/main/java/org/zstack/network/l3/L3BasicNetwork.java, header/src/main/java/org/zstack/header/vm/VmInstanceSpec.java, header/src/main/java/org/zstack/header/vm/VmInstanceState.java, header/src/main/java/org/zstack/header/vm/VmAbnormalLifeCycleStruct.java, conf/db/upgrade/V5.5.12__schema.sql
过滤 ReservedIpRange 的 free IP;root 磁盘大小回退逻辑改为 max(virtual, actual);Destroying 状态允许 stopped 转换;修复 null-unsafe equals;新增 GPU 规范 normalizedModelName 列与索引。
LoadBalancer 灰度升级保护
plugin/loadBalancer/src/main/java/org/zstack/network/service/lb/LoadBalancerApiInterceptor.java
添加针对 APIDeleteLoadBalancerMsg 的验证:若处于灰度升级且类型为 SLB,阻止删除并抛出拦截异常。
错误码与常量扩充
utils/src/main/java/org/zstack/utils/clouderrorcode/CloudOperationsErrorCode.java
新增若干静态错误码常量(例如 ORG_ZSTACK_PREMIUM_EXTERNALSERVICE_PROMETHEUS_10014),扩充错误码集合以支持更多服务标识。

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Poem

🐰 我在代码树下嗅风声,
Expon 与 Xinfini 共鸣中,
Scheme 与路径更谨慎,
模拟器花开测试中,
跳过孤影亦有钟。 🥕

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 2.20% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed 标题遵循了规定的[scope]: 格式,长度为62个字符,未超过72字符限制。标题清晰地总结了主要变更:修复vhost installPath覆盖问题和测试清理。
Description check ✅ Passed 拉取请求描述详细说明了所有主要变更,包括修复vhost installPath覆盖、使USS模拟器有状态、移动XinfiniPrimaryStorageCase、添加空主机检查和修复CD-ROM路径处理等,与代码变更相关。

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
  • 📝 Generate docstrings (stacked PR)
  • 📝 Generate docstrings (commit on current branch)
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch sync/yaohua.wu/ZSTAC-82153@@2

Warning

There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure.

🔧 ast-grep (0.41.0)
plugin/ceph/src/main/java/org/zstack/storage/ceph/primary/CephPrimaryStorageBase.java
plugin/localstorage/src/main/java/org/zstack/storage/primary/local/LocalStorageKvmBackend.java

[]

utils/src/main/java/org/zstack/utils/clouderrorcode/CloudOperationsErrorCode.java

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
testlib/src/main/java/org/zstack/testlib/ExternalPrimaryStorageSpec.groovy (1)

23-24: ⚠️ Potential issue | 🟡 Minor

注释中包含中文文本,违反编码规范。

第 23 行的注释 下午2:28 包含中文,根据编码规范要求,代码中不应包含中文(包括注释)。

建议修复
-/**
- * `@author` Xingwei Yu
- * `@date` 2024/4/19 下午2:28
- */
+/**
+ * `@author` Xingwei Yu
+ * `@date` 2024/4/19 14:28
+ */

As per coding guidelines: **/*.*: 代码里不应当有有中文,包括报错、注释等都应当使用正确的、无拼写错误的英文来写

🧹 Nitpick comments (3)
testlib/src/main/java/org/zstack/testlib/ExternalPrimaryStorageSpec.groovy (1)

1010-1020: 建议合并重复的 null 检查逻辑。

/:flatten/:rollback 处理器中,vol == null 的检查逻辑可以提前,避免先修改状态再检查 null 的情况。

建议优化(以 /:flatten 为例)
 // 14. POST /afa/v1/bs-volumes/{id}/:flatten - FlattenVolumeRequest
 simulator("/afa/v1/bs-volumes/\\d+/:flatten") { HttpServletRequest req, HttpEntity<String> e, EnvSpec spec ->
     int volId = extractIdFromUri(req.getRequestURI())
     def vol = volumes.get(volId)
-    if (vol != null) {
-        vol.spec.flattened = true
-    }
     if (vol == null) {
         return makeNotFoundResponse()
     }
+    vol.spec.flattened = true
     return makeItemResponse(vol)
 }

Also applies to: 1023-1030

test/src/test/groovy/org/zstack/test/integration/storage/primary/addon/xinfini/XinfiniPrimaryStorageCase.groovy (1)

372-394: 考虑使用 retryInSecs 替代固定 sleep(1000)

当前使用 sleep(1000) 等待异步操作完成,这可能导致测试不稳定(在慢速环境下可能不够,在快速环境下又浪费时间)。建议参考第 396-399 行的 retryInSecs 模式。

建议优化示例
-        sleep(1000)
-        // vm in running, not deactivate volume
-        bdev = controller.apiHelper.queryBdcBdevByVolumeIdAndBdcId(volId, bdc.spec.id)
-        assert bdev != null
+        retryInSecs {
+            // vm in running, not deactivate volume
+            bdev = controller.apiHelper.queryBdcBdevByVolumeIdAndBdcId(volId, bdc.spec.id)
+            assert bdev != null
+        }
plugin/expon/src/main/java/org/zstack/expon/sdk/ExponConnectConfig.java (1)

8-46: Add validation and normalization to setScheme method.

The scheme field is passed directly to OkHttp's HttpUrl.Builder.scheme(), which only accepts "http" or "https". While OkHttp normalizes these case-insensitively, it is better to validate and normalize the input here to prevent potential issues and ensure consistency with RFC 3986. Also trim user input to handle accidental whitespace.

♻️ Recommended implementation (validation + normalization)
-import java.util.concurrent.TimeUnit;
+import java.util.Locale;
+import java.util.concurrent.TimeUnit;
@@
         public Builder setScheme(String scheme) {
-            config.scheme = scheme;
+            if (scheme == null || scheme.trim().isEmpty()) {
+                config.scheme = "https";
+                return this;
+            }
+            String normalized = scheme.trim().toLowerCase(Locale.ROOT);
+            if (!"http".equals(normalized) && !"https".equals(normalized)) {
+                throw new IllegalArgumentException("scheme must be http or https");
+            }
+            config.scheme = normalized;
             return this;
         }

gitlab and others added 20 commits February 16, 2026 00:50
<fix>[vm]: add Destroying->Stopped state transition

See merge request zstackio/zstack!9156
<fix>[i18n]: improve snapshot error message for unattached volume

See merge request zstackio/zstack!9192
<fix>[zbs]: reduce mds connect timeout and enable tryNext for volume clients

See merge request zstackio/zstack!9153
<fix>[vm]: use max of virtual and actual size for root disk allocation

See merge request zstackio/zstack!9155
…talling

In dual management node scenarios, concurrent modifications to the
consistent hash ring from heartbeat reconciliation and canonical event
callbacks can cause NodeHash/Nodes inconsistency, leading to message
routing failures and task timeouts.

Fix: (1) synchronized all ResourceDestinationMakerImpl methods to
ensure atomic nodeHash+nodes updates, (2) added lifecycleLock in
ManagementNodeManagerImpl to serialize heartbeat reconciliation with
event callbacks, (3) added two-round delayed confirmation before
removing nodes from hash ring to avoid race with NodeJoin events.

Resolves: ZSTAC-77711

Change-Id: I3d33d53595dd302784dff17417a5b25f2d0f3426
<fix>[network]: filter reserved IPs in getFreeIp

See merge request zstackio/zstack!9170
<fix>[ceph]: apply over-provisioning ratio when releasing snapshot capacity

See merge request zstackio/zstack!9162
<fix>[network]: add null check for L3 network system tags in API interceptor

See merge request zstackio/zstack!9169
The mdsUrls field in ExternalPrimaryStorage config contains
user:password@host format credentials. Add desensitization to
mask credentials as ***@host in API/CLI output.

Resolves: ZSTAC-80664

Change-Id: I94bdede5a1b52eb039de70efb5458693484405f7
Add ORG_ZSTACK_STORAGE_BACKUP_CANCEL_TIMEOUT constant to
CloudOperationsErrorCode for use in premium volumebackup module.

Resolves: ZSTAC-82195

Change-Id: Ibc405876e1171b637cf76b91a6822574fb6e7811
<fix>[core]: synchronize consistent hash ring to prevent dual-MN race condition

See merge request zstackio/zstack!9154
SyncTaskFuture constructor calls Context.current() unconditionally,
triggering ServiceLoader for ContextStorageProvider even when
telemetry is disabled. If sentry-opentelemetry-bootstrap jar is
on classpath, ServiceLoader fails with "not a subtype" due to
ClassLoader isolation in Tomcat, throwing ServiceConfigurationError
(extends Error) that escapes all catch(Exception) blocks.

1. Why is this change necessary?
MN startup crashes with ORG_ZSTACK_CORE_WORKFLOW_10001 because
Context.current() triggers ServiceLoader unconditionally in
SyncTaskFuture constructor, even when telemetry is disabled.

2. How does it address the problem?
Only call Context.current() when isTelemetryEnabled() returns
true, matching the existing guard pattern used in other
DispatchQueueImpl code paths (lines 351, 1069).

3. Are there any side effects?
None. When telemetry is disabled, parentContext was never used.

# Summary of changes (by module):
- core/thread: conditionalize Context.current() in SyncTaskFuture

Related: ZSTAC-82275
Change-Id: I5c0e1f15769c746c630028a29df8cf1815620608
<fix>[thread]: guard Context.current() with telemetry check

See merge request zstackio/zstack!9202
Resolves: ZSTAC-82195

Change-Id: I3d5e91d09d7c088d3c53e3839f8b32f4bce32dec
<fix>[loadBalancer]: block SLB deletion during grayscale upgrade

See merge request zstackio/zstack!9187
<fix>[volumebackup]: add backup cancel timeout error code

See merge request zstackio/zstack!9200
<fix>[core]: add @nologging to sensitive config fields

See merge request zstackio/zstack!9171
…plit-brain

When a management node departs, its VM skip-trace entries were
immediately removed. If VMs were still being started by kvmagent,
the next VM sync would falsely detect them as Stopped and trigger
HA, causing split-brain.

Fix: transfer departed MN skip-trace entries to an orphaned set with
10-minute TTL instead of immediate deletion. VMs in the orphaned set
remain skip-traced until the TTL expires or they are explicitly
continued, preventing false HA triggers during MN restart scenarios.

Resolves: ZSTAC-80821

Change-Id: I3222e260b2d7b33dc43aba0431ce59a788566b34
…anup

Resolves: ZSTAC-80821

Change-Id: I59284c4e69f5d2ee357b1836b7c243200e30949a
Resolves: ZSTAC-77544

Change-Id: I1f711bff9c1e87a8cbf6a2eb310ca6086f0f99ba
AlanJager and others added 16 commits February 18, 2026 20:29
Resolves: ZSTAC-80821

Change-Id: Ia9a9597feceb96b3e6e22259e2d0be7bde8ae499
…V bond check

add new error code constant for SR-IOV bond slave NIC count validation.

Resolves: ZSTAC-81163

Change-Id: Ie2a74411129a98c3c03a4a085e94a3bd45922da5
Resolves: ZSTAC-81354

Change-Id: Iff2131b3a878444fa27641f24dd727fe4fa176fb
- UpdateQueryImpl: guard val.getClass() NPE
- LogSafeGson: return JsonNull when input is null
- HostAllocatorChain: null check completion
- VmInstanceVO: use Objects.equals to avoid NPE
- SessionManagerImpl: guard null session
- VmCapabilitiesJudger: guard null PS type result
- CephPSMonBase: guard null self when evicted
- CephPSBase: guard null mon.getSelf()
- HostBase: guard null self when HostVO deleted
- ExternalPSFactory: guard null URI protocol
- LocalStorageBase: guard null errorCode.getCause()

Resolves: ZSTAC-69300, ZSTAC-69957, ZSTAC-71973,
 ZSTAC-81294, ZSTAC-70180, ZSTAC-70181,
 ZSTAC-78309, ZSTAC-78310, ZSTAC-70668,
 ZSTAC-71909, ZSTAC-80555, ZSTAC-81270,
 ZSTAC-70101, ZSTAC-72034, ZSTAC-73197,
 ZSTAC-79921, ZSTAC-81160, ZSTAC-81224,
 ZSTAC-81805, ZSTAC-72304, ZSTAC-81804,
 ZSTAC-74898, ZSTAC-69215, ZSTAC-70151,
 ZSTAC-68933

Change-Id: I910e9b542ecd254fdf7e956f943316988a56a1f9
- LdapUtil: CRE to OperationFailureException
- QueryFacadeImpl: CRE to OperationFailureException
- HostAllocatorManagerImpl: CRE to warn + clamp
- CloudOperationsErrorCode: add LDAP/PROMETHEUS codes

Resolves: ZSTAC-81334

Change-Id: Iab947b0476e9174d5a61baa095847b521b1f59fa
<fix>[pciDevice]: add error code ORG_ZSTACK_PCIDEVICE_10077

See merge request zstackio/zstack!9205
Add ORG_ZSTACK_AI_10134 for ModelCenter disconnected
check and ORG_ZSTACK_PCIDEVICE_10077 for SR-IOV bond
validation.

Resolves: ZSTAC-72783

Change-Id: I504a415a6e822513df955be600188ae88e2e1058
fix(ai): add error codes for AI and PCI modules [ZSTAC-72783]

See merge request zstackio/zstack!9215
<fix>[multi]: batch fix CRE quality issues

See merge request zstackio/zstack!9214
<fix>[volumebackup]: add backup cancel timeout error code

See merge request zstackio/zstack!9204
Companion DB migration for premium MR !13025. Adds normalizedModelName column and index to GpuDeviceSpecVO for GPU spec dedup by normalized model name.

Resolves: ZSTAC-75319

Change-Id: If15e615bcbda955cc1d6c58527bae27d4af4b497
<fix>[compute]: respect vm.migrationQuantity during host maintenance

See merge request zstackio/zstack!9209
…ttachedPSMountPath NPE

Resolves: ZSTAC-69300

Change-Id: I1b39a9e7b76751e8a4ef4cc53e9ac2028386e334
Resolves: ZSTAC-61988

Change-Id: Id3d5a48801fda21e2a51d96949c743bac254b2e6
<fix>[kvm]: configurable orphan skip timeout

See merge request zstackio/zstack!9203
Resolves: ZSTAC-61988

Change-Id: I0908fce97128904f9954198645290f4e5709252e
@ZStack-Robot
Copy link
Collaborator Author

Comment from yaohua.wu:

更新 Review — 本次 Review 替代以下之前的 Review

Review: MR !9194 — ZSTAC-82153

MR 范围: 23 文件变更,涵盖 CBD 路径覆盖修复、Expon/Xinfini scheme 可配置化、LongJob 挂起竞态修复、查询分页稳定性、配额空指针防护、VM 设备信息同步、DHCP/Userdata 空值防护、ApplianceVm 设备名修正、SQL 数据清理迁移、测试重构。

Jira 背景: ZSTAC-82153 — 重构 Expon/Xinfini Case 使其不依赖真实存储环境。MR 实际包含远超 Jira 描述的多项生产修复。

🟡 Warning

  1. [LongJobManagerImpl.java:710-721] LongJob 挂起态完成的 TOCTOU 窗口dbf.findByUuid() 读取状态后与 updateByUuid() 之间存在时间窗口,期间状态可能再次变化(如用户恰好执行 resume)。虽然窗口极小且是对 ZSTAC-82086 的务实修复,但建议考虑用 CAS 条件更新(UPDATE ... WHERE state='Suspended')替代先读后写,或在注释中明确记录此已知限制。

  2. [LongJobManagerImpl.java:~581] rerun 时移除 vo.setApiId(...) 的副作用 — 旧代码在 rerun 时将 apiId 更新为当前 API 上下文,移除后 rerun 的 LongJob 将保持原始 apiId。如果前端通过新 API 请求 ID 轮询进度,可能找不到对应记录。建议确认前端是通过 longjob UUID 还是 apiId 追踪进度,并在注释中说明移除原因。

  3. [MysqlQueryBuilderImpl3.java:738-742 & QueryFacadeImpl.java:493-497] 分页 tiebreaker 与 GROUP BY 交互 — 当同时指定 groupBylimit(无 sortBy)时,新增的默认 ORDER BY pk ASC 会被追加在 GROUP BY 之前(代码先拼 ORDER BY 再拼 GROUP BY),生成 ... ORDER BY pk ASC GROUP BY col 的 SQL。MySQL 虽能容忍此语法,但语义上 GROUP BY 应在 ORDER BY 之前。建议在 else if 分支中检查 msg.getGroupBy() == null 或将 ORDER BY 拼接移到 GROUP BY 之后。

  4. [VmInstanceBase.java:8476-8496] syncVmDevicesAddressInfo 为 fire-and-forget — 在 start/reboot/resume/newCreated 4 个关键路径中异步发送 SyncVmDeviceInfoMsg,但不参与 completion 链。如果 host 端处理失败,VM 状态已切为 running 但设备地址信息可能过时。当前仅 warn 日志,无重试机制。建议评估是否需要在失败时安排延迟重试,至少确保关键场景(如 GPU 直通 VM)的设备信息准确。

  5. [ExternalPrimaryStorageFactory.java:429-432] null host 保护范围是否充分 — 仅在 preReleaseVmResource 添加了 spec.getDestHost() == null 检查。建议确认同模块其他使用 spec.getDestHost() 的路径是否也需要类似保护。

🟢 Suggestion

  1. [ExponConnectConfig.java:8] scheme 字段缺少校验scheme 直接传递给 OkHttp 的 HttpUrl.Builder.scheme(),仅接受 "http""https"。建议在 setScheme() 中做 trim + toLowerCase + 白名单校验,防止无效输入导致运行时异常。

  2. [ExponPrimaryStorageCase.groovy] 死代码testCreateVmWhenSpecifiedSblk() 方法已从 test() 中移除调用(line 203),但方法体(约 30 行)仍保留,建议清理。

  3. [ApplianceVmFacadeImpl.java:429] defaultRouteNic 潜在 NPE — 如果 additionalNics 中无匹配 defaultL3Uuid 的 nic,defaultRouteNic 为 null,后续 defaultRouteNic.getDeviceId() 将 NPE。这是 pre-existing issue,但既然修改了此行,建议顺手加 null 检查。

  4. [V5.5.6__schema.sql] 迁移脚本放在 V5.5.6 但 target branch 是 5.5.12 — 确认此迁移文件的版本命名与实际发布版本对齐,避免升级路径中迁移被跳过。

  5. [QuotaUtil.java:157] 新增 isAdminAccount 检查的必要性 — 原有 isAdminPermission 与新增 isAdminAccount 的区别是什么?建议在注释中说明为什么需要双重检查,以免后续维护者困惑。

✅ 亮点

  • CBD 路径覆盖修复KvmCbdNodeServer.java):非 CBD 协议返回 null 并仅在非 null 时设值,逻辑清晰正确,完美解决了 vhost installPath 被意外覆盖的问题。
  • 查询分页 tiebreaker 设计合理,getPrimaryKeyFieldNameSafely 的 try-catch 降级策略稳健。
  • DhcpExtension 使用 Objects.equals() 替代 .equals() 和显式 null 检查,是防御性编程的好实践。
  • Quota NPE 防护 修复得当,先检查 pair 是否为 null 再 getValue(),避免动态注册 quota 时的边界场景。
  • 测试重构 将 Expon/Xinfini Case 从依赖真实存储转为模拟器驱动,USS simulator 的有状态设计提高了测试稳定性(10/10 通过)。

Upstream Freshness

  • Target: 5.5.12,upstream 有 20+ 新提交
  • 12 个文件同时在 upstream 和 MR 中被修改(含 KvmCbdNodeServer、ExponStorageController、QueryFacadeImpl 等核心文件)
  • GitLab 报告 merge_status: can_be_merged,MR 已于 2026-02-24 rebase
  • 🟡 建议合并前再次验证与最新 upstream 的兼容性

Verdict: APPROVED with warnings

核心修复(CBD 路径覆盖、LongJob 挂起态、分页稳定性、配额 NPE)方向正确,测试重构质量高。Warning 项建议逐条评估,尤其是 GROUP BY + ORDER BY 顺序问题和 LongJob TOCTOU 窗口。


🤖 Robot Reviewer

gitlab and others added 7 commits February 24, 2026 12:52
<fix>[multi]: batch guard NPE quality issues

See merge request zstackio/zstack!9213
<fix>[telemetry]: fix Sentry transaction loss and add debug logging

See merge request zstackio/zstack!9220
<fix>[gpu]: add normalizedModelName migration SQL

See merge request zstackio/zstack!9218
<fix>[zbs]: sync MDS node statuses to DB when reconnect fails

See merge request zstackio/zstack!9161
Add missing error codes ORG_ZSTACK_STORAGE_PRIMARY_10048
and ORG_ZSTACK_STORAGE_PRIMARY_10051 to all 10 language
files. Fix zh_CN mistranslations replaced with correct
term. Fix zh_TW garbled characters in error messages.

Resolves: ZSTAC-72656

Change-Id: I5f08109d1c415b751ec130285b9d92522f1e0a34
fix(ZSTAC-72656): improve i18n error messages for PS UUID conflicts

See merge request zstackio/zstack!9224
Fix CBD KvmCbdNodeServer overwriting vhost volume installPath
and add Expon/Xinfini storage API simulators for integration
test.

1. Why is this change necessary?
KvmCbdNodeServer.convertPathIfNeeded unconditionally returned
the original installPath for non-CBD protocols, causing
convertAndSetPathIfNeeded to overwrite the /var/run/vhost path
that KvmVhostNodeServer had already set. This made the
ExponPrimaryStorageCase test fail on installPath assertion.
Additionally, vol2 was not cleaned up in testClean, causing
a ConstraintViolationException during env.delete.

2. How does it address the problem?
- Changed convertPathIfNeeded to return null for non-CBD
  protocols so the setter is skipped
- Added vol2 cleanup in testClean to prevent FK violation
- Added Expon API simulators in ExternalPrimaryStorageSpec
- Updated ExponStorageController and SDK for test support
- Added ExternalPrimaryStorageFactory null host guard
- Added XinfiniPrimaryStorageCase integration test

3. Are there any side effects?
None. The CBD fix only affects non-CBD protocol volumes
which should not have been modified by CBD code.

# Summary of changes (by module):
- cbd: fix convertPathIfNeeded to skip non-CBD protocols
- expon: add scheme support for ExponStorageController
- expon/sdk: update ExponClient and ExponConnectConfig
- storage: add null host check in ExternalPSFactory
- test: fix vol2 cleanup and update test assertions
- test: add XinfiniPrimaryStorageCase
- testlib: add Expon/Xinfini API simulators

Related: ZSTAC-82153
Change-Id: I14f8d5c9155dccf2803566d133813ca5675feb76
@MatheMatrix MatheMatrix force-pushed the sync/yaohua.wu/ZSTAC-82153@@2 branch from da1858b to b217085 Compare February 25, 2026 09:40
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 12

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (5)
core/src/main/java/org/zstack/core/cloudbus/ResourceDestinationMakerImpl.java (1)

79-93: ⚠️ Potential issue | 🔴 Critical

Bug: Map.put() 返回的是旧值(null),导致缓存未命中时 getNodeInfo 始终返回 null

Map.put(key, value) 的返回值是该 key 之前关联的旧值。由于进入此分支时 nodes 中不存在该 key,put 返回 null,因此 info 被赋为 null,第 92 行返回 null。调用方拿到 null 后极有可能触发 NPE。

🐛 修复:将 put 的结果与返回值分离
     public synchronized NodeInfo getNodeInfo(String nodeUuid) {
         NodeInfo info = nodes.get(nodeUuid);
         if (info == null) {
             ManagementNodeVO vo = dbf.findByUuid(nodeUuid, ManagementNodeVO.class);
             if (vo == null) {
                 throw new ManagementNodeNotFoundException(nodeUuid);
             }
 
             nodeHash.add(nodeUuid);
-            info = nodes.put(nodeUuid, new NodeInfo(vo));
+            info = new NodeInfo(vo);
+            nodes.put(nodeUuid, info);
         }
 
         return info;
     }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@core/src/main/java/org/zstack/core/cloudbus/ResourceDestinationMakerImpl.java`
around lines 79 - 93, The getNodeInfo method incorrectly assigns the return of
nodes.put(...) (which is the previous value, null on cache miss) to info; change
the flow in getNodeInfo so that after loading ManagementNodeVO via
dbf.findByUuid(...) and creating a new NodeInfo(vo) you store that new instance
into a local variable (e.g. newInfo), add nodeUuid to nodeHash, put newInfo into
the nodes map, and then return newInfo; keep the ManagementNodeNotFoundException
behavior and the synchronized signature intact (references: getNodeInfo,
NodeInfo, nodes, nodeHash, ManagementNodeVO, dbf.findByUuid).
compute/src/main/java/org/zstack/compute/vm/VmNicManagerImpl.java (1)

56-64: ⚠️ Potential issue | 🟡 Minor

UsedIpVO 更新发生在 null 检查之前,可能产生悬挂外键引用

afterAddIpAddress 中,第 58 行的 SQL 无条件地将 UsedIpVO.vmNicUuid 设为 vmNicUUid,而这一步在第 61 行的 null 检查之前执行。如果此时 VmNicVO 已被删除(即 nic == null),UsedIpVO 就会持有一个指向不存在的 NIC UUID 的外键,造成数据孤立。

相比之下,afterDelIpAddress 中 null 检查被正确地放在方法最开头(第 95 行),两者处理策略不一致。

建议将第 58 行的 SQL 更新移到 null 检查之后(或在 null 时同样跳过该更新):

🐛 建议修复顺序
 `@Override`
 public void afterAddIpAddress(String vmNicUUid, String usedIpUuid) {
-    /* update UsedIpVO */
-    SQL.New(UsedIpVO.class).eq(UsedIpVO_.uuid, usedIpUuid).set(UsedIpVO_.vmNicUuid, vmNicUUid).update();
-
     VmNicVO nic = Q.New(VmNicVO.class).eq(VmNicVO_.uuid, vmNicUUid).find();
     if (nic == null) {
         logger.debug(String.format("VmNic[uuid:%s] not found, skip afterAddIpAddress", vmNicUUid));
         return;
     }
+
+    /* update UsedIpVO */
+    SQL.New(UsedIpVO.class).eq(UsedIpVO_.uuid, usedIpUuid).set(UsedIpVO_.vmNicUuid, vmNicUUid).update();
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@compute/src/main/java/org/zstack/compute/vm/VmNicManagerImpl.java` around
lines 56 - 64, The SQL update that sets UsedIpVO.vmNicUuid is executed before
verifying the VmNicVO exists, risking a dangling FK; in the afterAddIpAddress
method move the SQL.New(UsedIpVO.class)...update() call to after the VmNicVO nic
null-check (the Q.New(VmNicVO.class).eq(...).find() block) or guard it with an
early return so that when nic == null you skip updating UsedIpVO; update
references: afterAddIpAddress, UsedIpVO, VmNicVO, vmNicUUid, usedIpUuid.
header/src/main/java/org/zstack/header/vm/VmAbnormalLifeCycleStruct.java (2)

53-60: ⚠️ Potential issue | 🔴 Critical

VmRunningFromUnknownStateHostChanged 的匹配条件与其名称语义相反(疑似 copy-paste 错误)

VmRunningFromUnknownStateHostChanged(第 53–60 行)与 VmRunningFromUnknownStateHostNotChanged(第 45–52 行)的 match() 条件完全相同,均断言 currentHostUuid == originalHostUuid(宿主机未变更)。

由于 getVmAbnormalLifeCycleOperationFromStruct 使用 findFirst()VmRunningFromUnknownStateHostChanged 永远不会被返回:在宿主机未变更时 VmRunningFromUnknownStateHostNotChanged 先匹配;而在宿主机真正变更时,两者均不匹配,该枚举值成为死代码

根据命名语义,条件中的相等判断应取反:

🐛 建议修复
 VmRunningFromUnknownStateHostChanged {
     `@Override`
     boolean match(VmAbnormalLifeCycleStruct struct) {
         return struct.getOriginalState() == VmInstanceState.Unknown
                 && struct.getCurrentState() == VmInstanceState.Running
-                && struct.getCurrentHostUuid().equals(struct.getOriginalHostUuid());
+                && !Objects.equals(struct.getCurrentHostUuid(), struct.getOriginalHostUuid());
     }
 },
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@header/src/main/java/org/zstack/header/vm/VmAbnormalLifeCycleStruct.java`
around lines 53 - 60, The enum value VmRunningFromUnknownStateHostChanged in
VmAbnormalLifeCycleStruct has the same match() predicate as
VmRunningFromUnknownStateHostNotChanged, causing it to never match; update
VmRunningFromUnknownStateHostChanged.match() to invert the host comparison so it
returns true when struct.getCurrentHostUuid() does NOT equal
struct.getOriginalHostUuid(), leaving other checks (original state == Unknown
and current state == Running) unchanged so
getVmAbnormalLifeCycleOperationFromStruct can correctly find the "host changed"
case.

42-42: ⚠️ Potential issue | 🟠 Major

此次 null 安全修复不完整——其余 10 处 .equals() 调用仍存在 NPE 风险

本 PR 仅修复了 VmStoppedFromPausedStateHostNotChanged(第 75 行),但以下 10 个 match() 方法仍在 currentHostUuid 可能为 null 时直接调用 .equals(),会抛出 NullPointerException

枚举常量 行号 危险调用
VmPausedFromUnknownStateHostNotChanged 42 struct.getCurrentHostUuid().equals(struct.getOriginalHostUuid())
VmRunningFromUnknownStateHostNotChanged 50 struct.getCurrentHostUuid().equals(struct.getOriginalHostUuid())
VmRunningFromUnknownStateHostChanged 58 struct.getCurrentHostUuid().equals(struct.getOriginalHostUuid())
VmStoppedFromUnknownStateHostNotChanged 67 struct.getCurrentHostUuid().equals(struct.getVmLastHostUuid())
VmPausedFromRunningStateHostNotChanged 107 struct.getCurrentHostUuid().equals(struct.getOriginalHostUuid())
VmRunningFromPausedStateHostNotChanged 115 struct.getCurrentHostUuid().equals(struct.getOriginalHostUuid())
VmPausedFromMigratingStateHostNotChanged 123 struct.getCurrentHostUuid().equals(struct.getOriginalHostUuid())
VmCrashedFromRunningStateHostNotChanged 131 struct.getCurrentHostUuid().equals(struct.getOriginalHostUuid())
VmRunningFromCrashedStateHostNotChanged 139 struct.getCurrentHostUuid().equals(struct.getOriginalHostUuid())
VmStoppedFromCrashedStateHostNotChanged 155 struct.getCurrentHostUuid().equals(struct.getOriginalHostUuid())

请将上述所有调用统一替换为 Objects.equals(),与已修复的位置保持一致。

🛡️ 建议修复(以 VmPausedFromRunningStateHostNotChanged 为例,其余同理)
-                        && struct.getCurrentHostUuid().equals(struct.getOriginalHostUuid());
+                        && Objects.equals(struct.getCurrentHostUuid(), struct.getOriginalHostUuid());

Also applies to: 50-50, 58-58, 67-67, 107-107, 115-115, 123-123, 131-131, 139-139, 155-155

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@header/src/main/java/org/zstack/header/vm/VmAbnormalLifeCycleStruct.java` at
line 42, The fix is incomplete: multiple enum constants' match() implementations
call struct.getCurrentHostUuid().equals(...) and can NPE when currentHostUuid is
null; update each listed enum constant (VmPausedFromUnknownStateHostNotChanged,
VmRunningFromUnknownStateHostNotChanged, VmRunningFromUnknownStateHostChanged,
VmStoppedFromUnknownStateHostNotChanged, VmPausedFromRunningStateHostNotChanged,
VmRunningFromPausedStateHostNotChanged,
VmPausedFromMigratingStateHostNotChanged,
VmCrashedFromRunningStateHostNotChanged,
VmRunningFromCrashedStateHostNotChanged,
VmStoppedFromCrashedStateHostNotChanged) to use
Objects.equals(struct.getCurrentHostUuid(), struct.getOriginalHostUuid()) or
Objects.equals(struct.getCurrentHostUuid(), struct.getVmLastHostUuid()) as
appropriate (same pattern used in the already-fixed
VmStoppedFromPausedStateHostNotChanged); import java.util.Objects if not
present.
identity/src/main/java/org/zstack/identity/AuthorizationManager.java (1)

112-122: ⚠️ Potential issue | 🟡 Minor

第 114 行 DebugUtils.Assert 抛出的异常类型与新增检查不一致

DebugUtils.Assert(session != null, ...)session == null 时会抛出 RuntimeException,而第 119–122 行新增的检查抛出的是 ApiMessageInterceptionException。虽然这两个代码路径都涉及 session 为空的验证,但异常类型不同,会导致上层调用者的错误处理逻辑不一致。

根据 ZStack 的设计规范,API 拦截层的参数验证应当统一使用 ApiMessageInterceptionException 以保持一致性。建议将第 114 行的 DebugUtils.Assert 替换为相同的 ApiMessageInterceptionException 抛出,或者在 Assert 失败时捕获 RuntimeException 并转换为 ApiMessageInterceptionException,确保两个路径使用统一的异常类型。

此外,第 66 行和第 120 行均复用了错误码 ORG_ZSTACK_IDENTITY_10012,但两处的语义略有不同(前者是原始 session 为 null,后者是 renewSession 返回 null),可以考虑使用不同的错误码加以区分,便于问题诊断。

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@identity/src/main/java/org/zstack/identity/AuthorizationManager.java` around
lines 112 - 122, Replace the DebugUtils.Assert in AuthorizationManager where
IdentityByPassCheck.NoSessionEvaluation is handled so that when session == null
it throws an ApiMessageInterceptionException (using the same err(...) helper)
instead of a RuntimeException; locate the block referencing
IdentityByPassCheck.NoSessionEvaluation and msg.getSession(), and change the
assertion to throw new
ApiMessageInterceptionException(err(ORG_ZSTACK_IDENTITY_10012,
IdentityErrors.INVALID_SESSION, "NoSessionEvaluation set but no session set in
the message: %s", CloudBusGson.toJson(msg))) or alternatively catch the
RuntimeException from DebugUtils.Assert and rethrow an
ApiMessageInterceptionException to ensure both the NoSessionEvaluation path and
the evaluateSession path use the same exception type (and consider using
distinct error codes if you want to differentiate original-null vs renewed-null
cases).
🧹 Nitpick comments (11)
core/src/main/java/org/zstack/core/log/LogSafeGson.java (1)

211-215: LGTM — null 防护正确

返回 JsonNull.INSTANCE 是 Gson 表示 JSON null 的惯用方式,可防止集合元素为 null 时(L191、L202)触发 NPE,修复合理。

不过,相邻的 toJson(Object o)(L218)存在相同的隐患——若 onullo.getClass() 仍会抛出 NPE。目前其唯一调用点 desensitize(L246)已在外层做了 null 检查,暂无风险,但建议保持一致性,同步加上 null 防护:

♻️ 建议同步修复 toJson
 public static String toJson(Object o) {
+    if (o == null) {
+        return logSafeGson.toJson(JsonNull.INSTANCE);
+    }
     return logSafeGson.toJson(o, getGsonType(o.getClass()));
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@core/src/main/java/org/zstack/core/log/LogSafeGson.java` around lines 211 -
215, The toJson(Object o) method in LogSafeGson can NPE when o is null because
it calls o.getClass(); add a null guard at the start of
LogSafeGson.toJson(Object o) (similar to the existing toJsonTree(Object o)
check) and return the appropriate JSON null representation (e.g. the JSON string
"null" or use logSafeGson.toJson(JsonNull.INSTANCE)) instead of calling
getGsonType(o.getClass()); ensure calls like desensitize continue to work
unchanged.
core/src/main/java/org/zstack/core/db/UpdateQueryImpl.java (1)

57-58: LGTM — NPE 修复正确。

valnull 时,null instanceof Collection 在 Java 中返回 false,因此异常分支确实可达;此次修改通过三元表达式安全地避免了 val.getClass() 的 NPE,逻辑正确。

可选优化:非 null 分支中 val.getClass() 会经 Class.toString() 产生 class fully.qualified.Name 这种带 class 前缀的冗余格式,使用 val.getClass().getName() 可使输出更简洁。这是已有行为而非此次引入的问题,可按需改动。

💡 可选改进(非阻塞)
-            throw new CloudRuntimeException(String.format("for operation IN or NOT IN, a Collection value is expected, but %s got",
-                    val == null ? "null" : val.getClass()));
+            throw new CloudRuntimeException(String.format("for operation IN or NOT IN, a Collection value is expected, but %s got",
+                    val == null ? "null" : val.getClass().getName()));
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@core/src/main/java/org/zstack/core/db/UpdateQueryImpl.java` around lines 57 -
58, The exception message in UpdateQueryImpl's throw currently uses
val.getClass() which prints the redundant "class ..." prefix; change it to use
val.getClass().getName() so the message shows the simple fully-qualified class
name (and keep the existing null-safe ternary check that avoids NPE when val is
null).
search/src/main/java/org/zstack/query/QueryFacadeImpl.java (1)

381-381: 建议保留原始异常上下文,避免排障信息丢失

Line 381 当前只透传 e.getMessage(),如果上游未记录日志会丢失堆栈信息,定位线上查询失败会比较困难。建议在此处先记录带异常对象的日志,再抛出 OperationFailureException

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@search/src/main/java/org/zstack/query/QueryFacadeImpl.java` at line 381, In
QueryFacadeImpl replace the current throw of OperationFailureException that only
uses e.getMessage() by first logging the failure with the exception object
(e.g., logger.error("failed to query", e)) and then rethrowing an
OperationFailureException that preserves the original cause (attach e as the
cause or include the exception in the constructor) so the stacktrace is retained
for debugging; update the throw site where OperationFailureException is created
to include the original exception.
core/src/main/java/org/zstack/core/cloudbus/ResourceDestinationMakerImpl.java (1)

80-93: synchronized 块内执行数据库查询可能导致锁竞争。

getNodeInfo 在持有 this 监视器锁的同时调用 dbf.findByUuid,这会阻塞所有其他同步方法(包括 makeDestinationisManagedByUs 等高频路径)直到 DB 查询完成。如果未来调用频率增加,可以考虑先在锁外查询 DB,再在锁内做 double-check 后写入缓存。当前管理节点数量有限,实际影响不大。

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@core/src/main/java/org/zstack/core/cloudbus/ResourceDestinationMakerImpl.java`
around lines 80 - 93, getNodeInfo currently holds the synchronized lock while
calling dbf.findByUuid, causing unnecessary lock contention; change it to first
check nodes.get(nodeUuid) and if missing call dbf.findByUuid outside the
synchronized block, then enter synchronized(this) and double-check
nodes.get(nodeUuid) again; if still null validate vo (throw
ManagementNodeNotFoundException if vo is null), create the new NodeInfo from
ManagementNodeVO, add nodeUuid to nodeHash and put the new NodeInfo into nodes
(use the newly created NodeInfo as the value to insert and return it), otherwise
return the cached NodeInfo—refer to getNodeInfo, nodes, nodeHash,
dbf.findByUuid, ManagementNodeVO, NodeInfo, and ManagementNodeNotFoundException.
compute/src/main/java/org/zstack/compute/allocator/HostAllocatorChain.java (1)

149-149: t != null 检查是无效的死代码

catch (Throwable t) 块中,JVM 规范保证捕获的异常引用 t 永远不会为 null,因此三元表达式的 "unknown error" 分支永远不会被执行,这会误导阅读代码的人。

♻️ 建议修复
-            String errMsg = t != null ? t.toString() : "unknown error";
+            String errMsg = t.toString();
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@compute/src/main/java/org/zstack/compute/allocator/HostAllocatorChain.java`
at line 149, In HostAllocatorChain's catch (Throwable t) block replace the
redundant ternary that checks "t != null ? t.toString() : 'unknown error'" with
a direct use of the caught throwable (e.g., t.toString() or t.getMessage()) to
form errMsg; remove the dead null branch so the assignment to errMsg uses the
guaranteed-non-null t (reference: variable errMsg and the catch (Throwable t) in
HostAllocatorChain).
core/src/main/java/org/zstack/core/telemetry/SentryInitHelper.java (1)

58-58: 建议将 1000 提取为具名常量,避免魔法值。

这有助于表达意图,并降低后续调整成本。

♻️ 建议重构
 public final class SentryInitHelper {
     private static final CLogger logger = Utils.getLogger(SentryInitHelper.class);
     private static final Object INIT_LOCK = new Object();
+    private static final int DEFAULT_SENTRY_MAX_QUEUE_SIZE = 1000;
@@
-                    options.setMaxQueueSize(1000);
+                    options.setMaxQueueSize(DEFAULT_SENTRY_MAX_QUEUE_SIZE);

As per coding guidelines: “避免使用魔法值(Magic Value):直接使用未经定义的数值或字符串应替换为常量或枚举。”

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@core/src/main/java/org/zstack/core/telemetry/SentryInitHelper.java` at line
58, Replace the magic literal 1000 used in options.setMaxQueueSize(1000) with a
well-named constant (e.g., MAX_QUEUE_SIZE) declared near the top of
SentryInitHelper (static final int MAX_QUEUE_SIZE = 1000) and then use
options.setMaxQueueSize(MAX_QUEUE_SIZE) so the intent is clear and the value is
easy to adjust.
plugin/kvm/src/main/java/org/zstack/kvm/KvmVmSyncPingTask.java (3)

513-525: getOrphanTtlMs() 在循环内被重复调用,建议提至循环外

nodeLeft() 中同样的问题:getOrphanTtlMs()(第 520 行)在 for 循环内每次迭代均读取全局配置,应在循环前缓存一次以避免冗余调用。

♻️ 建议修改
-        long now = System.currentTimeMillis();
-        for (Map.Entry<String, Long> entry : orphanedSkipVms.entrySet()) {
-            if (now - entry.getValue() >= getOrphanTtlMs()) {
+        long now = System.currentTimeMillis();
+        long ttlMs = getOrphanTtlMs();
+        for (Map.Entry<String, Long> entry : orphanedSkipVms.entrySet()) {
+            if (now - entry.getValue() >= ttlMs) {
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@plugin/kvm/src/main/java/org/zstack/kvm/KvmVmSyncPingTask.java` around lines
513 - 525, The loop in cleanupExpiredOrphanedSkipVms() repeatedly calls
getOrphanTtlMs(), causing redundant global config reads; cache the TTL once
before the for-loop (e.g., long ttl = getOrphanTtlMs()) and use ttl inside the
loop when comparing (referencing orphanedSkipVms, getOrphanTtlMs(), and
cleanupExpiredOrphanedSkipVms()) so you avoid repeated calls as done similarly
in nodeLeft().

469-477: getOrphanTtlMs() 调用提至循环外;消除魔法值 60000

  1. getOrphanTtlMs()for 循环内(第 475 行)每次迭代均读取全局配置,应在循环开始前缓存一次。
  2. 60000(毫秒/分钟)属于魔法值,应提取为具名常量(同一问题也出现在 isVmDoNotNeedToTrace 第 504-505 行),违反编码规范中"避免使用魔法值"的要求。
♻️ 建议修改

在类中增加常量(或与 504 行共用):

+    private static final long MILLIS_PER_MINUTE = 60_000L;

循环中修改:

-        long now = System.currentTimeMillis();
-        for (String vmUuid : skippedVms) {
-            orphanedSkipVms.put(vmUuid, now);
-            logger.info(String.format("moved VM[uuid:%s] from departed MN[uuid:%s] skip list to orphaned set" +
-                    " (will expire in %d minutes)", vmUuid, inv.getUuid(), getOrphanTtlMs() / 60000));
-        }
+        long now = System.currentTimeMillis();
+        long ttlMinutes = getOrphanTtlMs() / MILLIS_PER_MINUTE;
+        for (String vmUuid : skippedVms) {
+            orphanedSkipVms.put(vmUuid, now);
+            logger.info(String.format("moved VM[uuid:%s] from departed MN[uuid:%s] skip list to orphaned set" +
+                    " (will expire in %d minutes)", vmUuid, inv.getUuid(), ttlMinutes));
+        }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@plugin/kvm/src/main/java/org/zstack/kvm/KvmVmSyncPingTask.java` around lines
469 - 477, Cache the result of getOrphanTtlMs() before iterating skippedVms and
compute the minutes value once to avoid repeated global config reads, and
replace the hardcoded 60000 with a named constant (e.g., MS_PER_MINUTE) defined
in KvmVmSyncPingTask (or a shared constant used by isVmDoNotNeedToTrace) so the
logger uses the cached minutes value; apply the same replacement of the magic
60000 in isVmDoNotNeedToTrace (lines ~504-505) to use the named constant.

501-506: 日志文案 "expired after %d minutes" 语义不准确

此处 getOrphanTtlMs() / 60000 得到的是配置的 TTL 时长,而非实际已过去的时间。日志文字 "expired after %d minutes" 容易被误读为"已经过去了多少分钟",建议改为描述 TTL 本身的措辞,例如 "TTL of %d minutes"。同时,此处的 60000 也应使用上方建议的具名常量替代。

♻️ 建议修改
-                logger.info(String.format("orphaned skip entry for VM[uuid:%s] expired after %d minutes, resuming trace",
-                        vmUuid, getOrphanTtlMs() / 60000));
+                logger.info(String.format("orphaned skip entry for VM[uuid:%s] expired (TTL %d minutes), resuming trace",
+                        vmUuid, getOrphanTtlMs() / MILLIS_PER_MINUTE));
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@plugin/kvm/src/main/java/org/zstack/kvm/KvmVmSyncPingTask.java` around lines
501 - 506, The log message in KvmVmSyncPingTask uses getOrphanTtlMs() / 60000
and "expired after %d minutes" which is misleading; update the logger.info call
that references vmUuid and getOrphanTtlMs() so it reports the configured TTL
(e.g., "TTL of %d minutes") rather than elapsed time, and replace the magic
number 60000 with a named constant (e.g., MILLIS_PER_MINUTE) defined near the
class or file; keep the orphanedSkipVms.remove(vmUuid, orphanedAt) behavior
unchanged.
testlib/src/main/java/org/zstack/testlib/ExternalPrimaryStorageSpec.groovy (1)

298-301: 建议在 registerSimulators() 开头调用 clear() 重置静态状态。

当前状态容器是 static;若同一 JVM 连续跑多个用例,历史数据可能串入后续用例,影响稳定性和可复现性。

♻️ 建议修改
         `@Override`
         void registerSimulators(EnvSpec espec) {
+            clear()
             def simulator = { arg1, arg2 ->
                 espec.simulator(arg1, arg2)
             }
         `@Override`
         void registerSimulators(EnvSpec espec) {
+            clear()
             def simulator = { arg1, arg2 ->
                 espec.simulator(arg1, arg2)
             }

Also applies to: 902-905

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@testlib/src/main/java/org/zstack/testlib/ExternalPrimaryStorageSpec.groovy`
around lines 298 - 301, 在 ExternalPrimaryStorageSpec 的 registerSimulators()
方法开头调用静态状态容器的 clear() 方法以重置全局/静态状态,防止同 JVM 中多次用例运行时历史数据污染后续用例;定位到
registerSimulators(...) 并在其第一行添加对相应静态容器的 clear() 调用(并同样在另一个相同模式的位置,即文件中约 902-905
行对应的 registerSimulators 实现处也做相同修改)。
portal/src/main/java/org/zstack/portal/managementnode/ManagementNodeManagerImpl.java (1)

893-896: 异常分支建议补齐堆栈日志并设置明确回退主机名。

Line 893-896 目前只记录文本日志且 hostName 可能保持 null。建议在 catch 分支带上异常对象,并写入可识别回退值,避免下游监听器处理空值时出现隐性分支。

可参考的最小改动
-                                try {
-                                    inv.setHostName(destinationMaker.getNodeInfo(nodeUuid).getNodeIP());
-                                } catch (Exception e) {
-                                    logger.warn(String.format("cannot get node info for node[uuid:%s], use empty hostname", nodeUuid));
-                                }
+                                try {
+                                    if (destinationMaker.getNodeInfo(nodeUuid) != null) {
+                                        inv.setHostName(destinationMaker.getNodeInfo(nodeUuid).getNodeIP());
+                                    } else {
+                                        inv.setHostName("unknown");
+                                    }
+                                } catch (Exception e) {
+                                    inv.setHostName("unknown");
+                                    logger.warn(String.format("cannot get node info for node[uuid:%s], use fallback hostname", nodeUuid), e);
+                                }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@portal/src/main/java/org/zstack/portal/managementnode/ManagementNodeManagerImpl.java`
around lines 893 - 896, In ManagementNodeManagerImpl inside the catch for
destinationMaker.getNodeInfo(nodeUuid), update the exception handling so the
logger.warn call includes the caught Exception object (e) and set a clear
fallback hostname via inv.setHostName (e.g., "unknown-host" or ""), so hostName
is never left null; reference the existing inv.setHostName,
destinationMaker.getNodeInfo, and logger.warn identifiers when making this
change.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@compute/src/main/java/org/zstack/compute/allocator/HostAllocatorChain.java`:
- Around line 150-156: The code in HostAllocatorChain handling errors silently
swallows Throwables when isDryRun is true and dryRunCompletion==null (or when
completion==null), causing callers to hang; change the logic in the error path
to mirror done(): remove the null guards and always call
dryRunCompletion.fail(...) or completion.fail(...) (or if you prefer to keep
guards, add an error-level log and an assertion/fail-fast when the completion is
null before returning) so that Throwable is not silently dropped; update the
error branch around isDryRun/dryRunCompletion/completion to ensure a fail(...)
is invoked (or a clear error+assert) using the same identifiers used in the
diff.

In `@compute/src/main/java/org/zstack/compute/host/HostBase.java`:
- Line 1445: The deduplicate key currently returns a fixed
"connect-host-unknown" when self is null, causing different hosts to share the
same key; change the key generation in HostBase to prefer msg.getHostUuid() as
the fallback (i.e., use self.getUuid() if self != null, else msg.getHostUuid()),
and only fall back to a literal "unknown" if both are null so each host's
connect-host-{uuid} key remains unique.

In `@core/src/main/java/org/zstack/core/telemetry/SentryInitHelper.java`:
- Around line 67-69: 在 SentryInitHelper 中不要把 Sentry 的 ERROR 级别降为 WARN;在处理 Sentry
日志级别的映射(在 switch 分支里处理 case ERROR/ WARNING/ default 的代码块,例如当前通过
logger.warn("[Sentry] " + formatted) 处理 ERROR 和 WARNING 的位置)将 ERROR 映射为
logger.error("[Sentry] " + formatted),将 WARNING 映射为 logger.warn("[Sentry] " +
formatted),保持 default(或 INFO)映射为 logger.info("[Sentry] " +
formatted);同样修正文件中另一个相同映射块(也就是你提到的 76-79 附近的 switch 分支)以保持一致。

In `@network/src/main/java/org/zstack/network/l3/L3BasicNetwork.java`:
- Around line 1081-1082: The removal logic for freeIpInventorys calls
NetworkUtils.isInRange against every reservedIpRanges entry regardless of IP
family, causing exceptions in dual-stack scenarios; update the predicate used in
freeIpInventorys.removeIf (the code that iterates freeIp ->
reservedIpRanges.stream().anyMatch(...)) to first compare IP versions (e.g.,
derive family from freeIp.getIp() and the reserved range start/end or use an
existing IP version field) and only call NetworkUtils.isInRange when the
versions match, ensuring compatibility for both IPv4 and IPv6 and preventing
APIGetFreeIp runtime errors.
- Around line 1079-1085: 当前实现是先按 limit 聚合各网段的 freeIpInventorys 再一次性过滤
reservedIpRanges(使用 NetworkUtils.isInRange),导致可能在达到 limit 后才删除保留段,返回的数量低于
msg.getLimit();修改方法:在遍历各网段并向 freeIpInventorys 添加每段结果的地方(相关代码使用 freeIpInventorys
集合和添加逻辑),在将单段 IP 加入集合前先用 reservedIpRanges.stream().anyMatch(...) 或等价逻辑对每个 freeIp
进行过滤(使用 NetworkUtils.isInRange 和 ReservedIpRangeVO 的
getStartIp/getEndIp),然后再判断是否已达到 msg.getLimit() 并决定是否 break;确保最后调用
reply.setInventories(freeIpInventorys) 的集合已经是过滤后的结果。

In
`@plugin/localstorage/src/main/java/org/zstack/storage/primary/local/LocalStorageAllocatorFactory.java`:
- Around line 388-393: The code in LocalStorageAllocatorFactory reuses error
code ORG_ZSTACK_STORAGE_PRIMARY_LOCAL_10023 for both URI parse failures and
unsupported protocol cases; update the unsupported-protocol branch (where parser
== null and the OperationFailureException is thrown using argerr) to use a new
distinct error code (e.g., ORG_ZSTACK_STORAGE_PRIMARY_LOCAL_10024) and add that
new error code constant to CloudOperationsErrorCode, keeping the existing
message text and leaving the parseUri/URISyntaxException path using
ORG_ZSTACK_STORAGE_PRIMARY_LOCAL_10023 so callers can distinguish malformed-URI
vs unsupported-protocol errors.

In `@plugin/zbs/src/main/java/org/zstack/storage/zbs/ZbsStorageController.java`:
- Around line 1365-1381: syncMdsStatuses can NPE when newAddonInfo.getMdsInfos()
or addonInfo.getMdsInfos() is null; add null/empty guards before iterating: in
ZbsStorageController.syncMdsStatuses check newAddonInfo != null &&
newAddonInfo.getMdsInfos() != null && !newAddonInfo.getMdsInfos().isEmpty() and
also ensure addonInfo != null && addonInfo.getMdsInfos() != null (or treat
missing lists as Collections.emptyList()) before the nested loops that reference
getMdsInfos(); keep the existing existMds.getAddr().equals(newMds.getAddr())
logic and still perform the SQL.New(...).set(...).update() call only after safe
iteration to avoid a secondary NPE masking the original error.

In
`@portal/src/main/java/org/zstack/portal/managementnode/ManagementNodeManagerImpl.java`:
- Around line 917-923: Inside ManagementNodeManagerImpl, the block that iterates
nodesInDb under synchronized(lifecycleLock) must not call new Runnable(){
`@AsyncThread` public void run(){ nodeLifeCycle.nodeJoin(...) }}.run() because
`@AsyncThread` on an anonymous method is ignored and .run() is synchronous;
instead, inside the synchronized(lifecycleLock) collect the
ManagementNodeInventory objects that need processing into a temporary list
(e.g., pendingJoins), then release the lock and for each item in pendingJoins
submit an asynchronous task to the proper executor (use the existing
ThreadFacade/async API the project uses or create a new Runnable and submit to a
thread pool) that calls
nodeLifeCycle.nodeJoin(ManagementNodeInventory.valueOf(n)); this ensures
nodeJoin runs asynchronously and the lifecycleLock is not held during nodeJoin.

In
`@test/src/test/groovy/org/zstack/test/integration/storage/primary/addon/expon/ExponPrimaryStorageCase.groovy`:
- Line 689: The cleanup currently calls deleteVolume(vol2.uuid) unconditionally
which can throw an NPE if vol2 was never created; update the cleanup in
ExponPrimaryStorageCase (reference variable vol2) to check that vol2 is not null
(and optionally that vol2.uuid is not null/empty) before calling deleteVolume,
and skip or log the deletion when vol2 is absent so the original failure isn't
masked.

In
`@test/src/test/groovy/org/zstack/test/integration/storage/primary/addon/xinfini/XinfiniPrimaryStorageCase.groovy`:
- Line 574: 在 testClean() 中对可能为 null 的 vol2 做保护性判断后再调用 deleteVolume;也可以改为在删除前使用
if (vol2 != null) deleteVolume(vol2.uuid) 或者封装一个安全删除方法(如
safeDeleteVolume(uuid))来避免在前序失败时因 vol2 为 null 导致空指针,定位到 testClean() 中涉及 vol2
变量并在调用 deleteVolume(...) 处添加该判空或安全删除调用。

In `@testlib/src/main/java/org/zstack/testlib/ExternalPrimaryStorageSpec.groovy`:
- Around line 455-457: The current else-if "remove" branch calls
iscsiClientSnapshots.getOrDefault(clientId, Collections.emptySet()).remove(...),
which may invoke remove() on an immutable emptySet and throw
UnsupportedOperationException; change the logic in
ExternalPrimaryStorageSpec.groovy to fetch the set via
iscsiClientSnapshots.get(clientId) (or check containsKey) and only call remove
on a non-null, mutable set (mirroring the computeIfAbsent() approach used in the
"add" branch) so removals operate on the actual stored collection for clientId
instead of on Collections.emptySet().

In
`@utils/src/main/java/org/zstack/utils/clouderrorcode/CloudOperationsErrorCode.java`:
- Line 6281: The constant ORG_ZSTACK_STORAGE_BACKUP_CANCEL_TIMEOUT in
CloudOperationsErrorCode breaks naming conventions by lacking the numeric
suffix; rename it to follow the existing pattern (e.g.
ORG_ZSTACK_STORAGE_BACKUP_CANCEL_TIMEOUT_XXXXX) and update every reference to
this symbol across the codebase (tests, serializations, config, logs) to use the
new name; ensure CloudOperationsErrorCode's string value is changed to match the
new identifier and run a global search/replace for
ORG_ZSTACK_STORAGE_BACKUP_CANCEL_TIMEOUT to avoid dangling references.

---

Outside diff comments:
In `@compute/src/main/java/org/zstack/compute/vm/VmNicManagerImpl.java`:
- Around line 56-64: The SQL update that sets UsedIpVO.vmNicUuid is executed
before verifying the VmNicVO exists, risking a dangling FK; in the
afterAddIpAddress method move the SQL.New(UsedIpVO.class)...update() call to
after the VmNicVO nic null-check (the Q.New(VmNicVO.class).eq(...).find() block)
or guard it with an early return so that when nic == null you skip updating
UsedIpVO; update references: afterAddIpAddress, UsedIpVO, VmNicVO, vmNicUUid,
usedIpUuid.

In
`@core/src/main/java/org/zstack/core/cloudbus/ResourceDestinationMakerImpl.java`:
- Around line 79-93: The getNodeInfo method incorrectly assigns the return of
nodes.put(...) (which is the previous value, null on cache miss) to info; change
the flow in getNodeInfo so that after loading ManagementNodeVO via
dbf.findByUuid(...) and creating a new NodeInfo(vo) you store that new instance
into a local variable (e.g. newInfo), add nodeUuid to nodeHash, put newInfo into
the nodes map, and then return newInfo; keep the ManagementNodeNotFoundException
behavior and the synchronized signature intact (references: getNodeInfo,
NodeInfo, nodes, nodeHash, ManagementNodeVO, dbf.findByUuid).

In `@header/src/main/java/org/zstack/header/vm/VmAbnormalLifeCycleStruct.java`:
- Around line 53-60: The enum value VmRunningFromUnknownStateHostChanged in
VmAbnormalLifeCycleStruct has the same match() predicate as
VmRunningFromUnknownStateHostNotChanged, causing it to never match; update
VmRunningFromUnknownStateHostChanged.match() to invert the host comparison so it
returns true when struct.getCurrentHostUuid() does NOT equal
struct.getOriginalHostUuid(), leaving other checks (original state == Unknown
and current state == Running) unchanged so
getVmAbnormalLifeCycleOperationFromStruct can correctly find the "host changed"
case.
- Line 42: The fix is incomplete: multiple enum constants' match()
implementations call struct.getCurrentHostUuid().equals(...) and can NPE when
currentHostUuid is null; update each listed enum constant
(VmPausedFromUnknownStateHostNotChanged,
VmRunningFromUnknownStateHostNotChanged, VmRunningFromUnknownStateHostChanged,
VmStoppedFromUnknownStateHostNotChanged, VmPausedFromRunningStateHostNotChanged,
VmRunningFromPausedStateHostNotChanged,
VmPausedFromMigratingStateHostNotChanged,
VmCrashedFromRunningStateHostNotChanged,
VmRunningFromCrashedStateHostNotChanged,
VmStoppedFromCrashedStateHostNotChanged) to use
Objects.equals(struct.getCurrentHostUuid(), struct.getOriginalHostUuid()) or
Objects.equals(struct.getCurrentHostUuid(), struct.getVmLastHostUuid()) as
appropriate (same pattern used in the already-fixed
VmStoppedFromPausedStateHostNotChanged); import java.util.Objects if not
present.

In `@identity/src/main/java/org/zstack/identity/AuthorizationManager.java`:
- Around line 112-122: Replace the DebugUtils.Assert in AuthorizationManager
where IdentityByPassCheck.NoSessionEvaluation is handled so that when session ==
null it throws an ApiMessageInterceptionException (using the same err(...)
helper) instead of a RuntimeException; locate the block referencing
IdentityByPassCheck.NoSessionEvaluation and msg.getSession(), and change the
assertion to throw new
ApiMessageInterceptionException(err(ORG_ZSTACK_IDENTITY_10012,
IdentityErrors.INVALID_SESSION, "NoSessionEvaluation set but no session set in
the message: %s", CloudBusGson.toJson(msg))) or alternatively catch the
RuntimeException from DebugUtils.Assert and rethrow an
ApiMessageInterceptionException to ensure both the NoSessionEvaluation path and
the evaluateSession path use the same exception type (and consider using
distinct error codes if you want to differentiate original-null vs renewed-null
cases).

---

Nitpick comments:
In `@compute/src/main/java/org/zstack/compute/allocator/HostAllocatorChain.java`:
- Line 149: In HostAllocatorChain's catch (Throwable t) block replace the
redundant ternary that checks "t != null ? t.toString() : 'unknown error'" with
a direct use of the caught throwable (e.g., t.toString() or t.getMessage()) to
form errMsg; remove the dead null branch so the assignment to errMsg uses the
guaranteed-non-null t (reference: variable errMsg and the catch (Throwable t) in
HostAllocatorChain).

In
`@core/src/main/java/org/zstack/core/cloudbus/ResourceDestinationMakerImpl.java`:
- Around line 80-93: getNodeInfo currently holds the synchronized lock while
calling dbf.findByUuid, causing unnecessary lock contention; change it to first
check nodes.get(nodeUuid) and if missing call dbf.findByUuid outside the
synchronized block, then enter synchronized(this) and double-check
nodes.get(nodeUuid) again; if still null validate vo (throw
ManagementNodeNotFoundException if vo is null), create the new NodeInfo from
ManagementNodeVO, add nodeUuid to nodeHash and put the new NodeInfo into nodes
(use the newly created NodeInfo as the value to insert and return it), otherwise
return the cached NodeInfo—refer to getNodeInfo, nodes, nodeHash,
dbf.findByUuid, ManagementNodeVO, NodeInfo, and ManagementNodeNotFoundException.

In `@core/src/main/java/org/zstack/core/db/UpdateQueryImpl.java`:
- Around line 57-58: The exception message in UpdateQueryImpl's throw currently
uses val.getClass() which prints the redundant "class ..." prefix; change it to
use val.getClass().getName() so the message shows the simple fully-qualified
class name (and keep the existing null-safe ternary check that avoids NPE when
val is null).

In `@core/src/main/java/org/zstack/core/log/LogSafeGson.java`:
- Around line 211-215: The toJson(Object o) method in LogSafeGson can NPE when o
is null because it calls o.getClass(); add a null guard at the start of
LogSafeGson.toJson(Object o) (similar to the existing toJsonTree(Object o)
check) and return the appropriate JSON null representation (e.g. the JSON string
"null" or use logSafeGson.toJson(JsonNull.INSTANCE)) instead of calling
getGsonType(o.getClass()); ensure calls like desensitize continue to work
unchanged.

In `@core/src/main/java/org/zstack/core/telemetry/SentryInitHelper.java`:
- Line 58: Replace the magic literal 1000 used in options.setMaxQueueSize(1000)
with a well-named constant (e.g., MAX_QUEUE_SIZE) declared near the top of
SentryInitHelper (static final int MAX_QUEUE_SIZE = 1000) and then use
options.setMaxQueueSize(MAX_QUEUE_SIZE) so the intent is clear and the value is
easy to adjust.

In `@plugin/kvm/src/main/java/org/zstack/kvm/KvmVmSyncPingTask.java`:
- Around line 513-525: The loop in cleanupExpiredOrphanedSkipVms() repeatedly
calls getOrphanTtlMs(), causing redundant global config reads; cache the TTL
once before the for-loop (e.g., long ttl = getOrphanTtlMs()) and use ttl inside
the loop when comparing (referencing orphanedSkipVms, getOrphanTtlMs(), and
cleanupExpiredOrphanedSkipVms()) so you avoid repeated calls as done similarly
in nodeLeft().
- Around line 469-477: Cache the result of getOrphanTtlMs() before iterating
skippedVms and compute the minutes value once to avoid repeated global config
reads, and replace the hardcoded 60000 with a named constant (e.g.,
MS_PER_MINUTE) defined in KvmVmSyncPingTask (or a shared constant used by
isVmDoNotNeedToTrace) so the logger uses the cached minutes value; apply the
same replacement of the magic 60000 in isVmDoNotNeedToTrace (lines ~504-505) to
use the named constant.
- Around line 501-506: The log message in KvmVmSyncPingTask uses
getOrphanTtlMs() / 60000 and "expired after %d minutes" which is misleading;
update the logger.info call that references vmUuid and getOrphanTtlMs() so it
reports the configured TTL (e.g., "TTL of %d minutes") rather than elapsed time,
and replace the magic number 60000 with a named constant (e.g.,
MILLIS_PER_MINUTE) defined near the class or file; keep the
orphanedSkipVms.remove(vmUuid, orphanedAt) behavior unchanged.

In
`@portal/src/main/java/org/zstack/portal/managementnode/ManagementNodeManagerImpl.java`:
- Around line 893-896: In ManagementNodeManagerImpl inside the catch for
destinationMaker.getNodeInfo(nodeUuid), update the exception handling so the
logger.warn call includes the caught Exception object (e) and set a clear
fallback hostname via inv.setHostName (e.g., "unknown-host" or ""), so hostName
is never left null; reference the existing inv.setHostName,
destinationMaker.getNodeInfo, and logger.warn identifiers when making this
change.

In `@search/src/main/java/org/zstack/query/QueryFacadeImpl.java`:
- Line 381: In QueryFacadeImpl replace the current throw of
OperationFailureException that only uses e.getMessage() by first logging the
failure with the exception object (e.g., logger.error("failed to query", e)) and
then rethrowing an OperationFailureException that preserves the original cause
(attach e as the cause or include the exception in the constructor) so the
stacktrace is retained for debugging; update the throw site where
OperationFailureException is created to include the original exception.

In `@testlib/src/main/java/org/zstack/testlib/ExternalPrimaryStorageSpec.groovy`:
- Around line 298-301: 在 ExternalPrimaryStorageSpec 的 registerSimulators()
方法开头调用静态状态容器的 clear() 方法以重置全局/静态状态,防止同 JVM 中多次用例运行时历史数据污染后续用例;定位到
registerSimulators(...) 并在其第一行添加对相应静态容器的 clear() 调用(并同样在另一个相同模式的位置,即文件中约 902-905
行对应的 registerSimulators 实现处也做相同修改)。

ℹ️ Review info

Configuration used: Path: http://open.zstack.ai:20001/code-reviews/zstack-cloud.yaml (via .coderabbit.yaml)

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between da1858b and b217085.

⛔ Files ignored due to path filters (13)
  • build/pom.xml is excluded by !**/*.xml
  • conf/i18n/globalErrorCodeMapping/global-error-de-DE.json is excluded by !**/*.json
  • conf/i18n/globalErrorCodeMapping/global-error-en_US.json is excluded by !**/*.json
  • conf/i18n/globalErrorCodeMapping/global-error-fr-FR.json is excluded by !**/*.json
  • conf/i18n/globalErrorCodeMapping/global-error-id-ID.json is excluded by !**/*.json
  • conf/i18n/globalErrorCodeMapping/global-error-ja-JP.json is excluded by !**/*.json
  • conf/i18n/globalErrorCodeMapping/global-error-ko-KR.json is excluded by !**/*.json
  • conf/i18n/globalErrorCodeMapping/global-error-ru-RU.json is excluded by !**/*.json
  • conf/i18n/globalErrorCodeMapping/global-error-th-TH.json is excluded by !**/*.json
  • conf/i18n/globalErrorCodeMapping/global-error-zh_CN.json is excluded by !**/*.json
  • conf/i18n/globalErrorCodeMapping/global-error-zh_TW.json is excluded by !**/*.json
  • core/pom.xml is excluded by !**/*.xml
  • pom.xml is excluded by !**/*.xml
📒 Files selected for processing (40)
  • compute/src/main/java/org/zstack/compute/allocator/HostAllocatorChain.java
  • compute/src/main/java/org/zstack/compute/allocator/HostAllocatorManagerImpl.java
  • compute/src/main/java/org/zstack/compute/host/HostBase.java
  • compute/src/main/java/org/zstack/compute/vm/VmCapabilitiesJudger.java
  • compute/src/main/java/org/zstack/compute/vm/VmNicManagerImpl.java
  • conf/db/upgrade/V5.5.12__schema.sql
  • core/src/main/java/org/zstack/core/cloudbus/ResourceDestinationMakerImpl.java
  • core/src/main/java/org/zstack/core/db/UpdateQueryImpl.java
  • core/src/main/java/org/zstack/core/log/LogSafeGson.java
  • core/src/main/java/org/zstack/core/telemetry/SentryInitHelper.java
  • core/src/main/java/org/zstack/core/telemetry/TelemetryGlobalProperty.java
  • header/src/main/java/org/zstack/header/storage/addon/primary/ExternalPrimaryStorageInventory.java
  • header/src/main/java/org/zstack/header/vm/VmAbnormalLifeCycleStruct.java
  • header/src/main/java/org/zstack/header/vm/VmInstanceSpec.java
  • header/src/main/java/org/zstack/header/vm/VmInstanceState.java
  • identity/src/main/java/org/zstack/identity/AuthorizationManager.java
  • network/src/main/java/org/zstack/network/l3/L3BasicNetwork.java
  • plugin/cbd/src/main/java/org/zstack/cbd/kvm/KvmCbdNodeServer.java
  • plugin/ceph/src/main/java/org/zstack/storage/ceph/primary/CephPrimaryStorageBase.java
  • plugin/ceph/src/main/java/org/zstack/storage/ceph/primary/CephPrimaryStorageMonBase.java
  • plugin/expon/src/main/java/org/zstack/expon/ExponStorageController.java
  • plugin/expon/src/main/java/org/zstack/expon/sdk/ExponClient.java
  • plugin/expon/src/main/java/org/zstack/expon/sdk/ExponConnectConfig.java
  • plugin/kvm/src/main/java/org/zstack/kvm/KVMGlobalConfig.java
  • plugin/kvm/src/main/java/org/zstack/kvm/KvmVmSyncPingTask.java
  • plugin/ldap/src/main/java/org/zstack/ldap/LdapUtil.java
  • plugin/loadBalancer/src/main/java/org/zstack/network/service/lb/LoadBalancerApiInterceptor.java
  • plugin/localstorage/src/main/java/org/zstack/storage/primary/local/LocalStorageAllocatorFactory.java
  • plugin/localstorage/src/main/java/org/zstack/storage/primary/local/LocalStorageKvmBackend.java
  • plugin/zbs/src/main/java/org/zstack/storage/zbs/ZbsStorageController.java
  • portal/src/main/java/org/zstack/portal/managementnode/ManagementNodeManagerImpl.java
  • search/src/main/java/org/zstack/query/QueryFacadeImpl.java
  • storage/src/main/java/org/zstack/storage/addon/primary/ExternalPrimaryStorageFactory.java
  • test/src/test/groovy/org/zstack/test/integration/storage/StorageTest.groovy
  • test/src/test/groovy/org/zstack/test/integration/storage/primary/PrimaryStorageTest.groovy
  • test/src/test/groovy/org/zstack/test/integration/storage/primary/addon/expon/ExponPrimaryStorageCase.groovy
  • test/src/test/groovy/org/zstack/test/integration/storage/primary/addon/xinfini/XinfiniPrimaryStorageCase.groovy
  • testlib/src/main/java/org/zstack/testlib/ExternalPrimaryStorageSpec.groovy
  • testlib/src/main/java/org/zstack/testlib/SpringSpec.groovy
  • utils/src/main/java/org/zstack/utils/clouderrorcode/CloudOperationsErrorCode.java
✅ Files skipped from review due to trivial changes (1)
  • conf/db/upgrade/V5.5.12__schema.sql
🚧 Files skipped from review as they are similar to previous changes (5)
  • plugin/expon/src/main/java/org/zstack/expon/ExponStorageController.java
  • plugin/cbd/src/main/java/org/zstack/cbd/kvm/KvmCbdNodeServer.java
  • plugin/expon/src/main/java/org/zstack/expon/sdk/ExponConnectConfig.java
  • test/src/test/groovy/org/zstack/test/integration/storage/StorageTest.groovy
  • plugin/expon/src/main/java/org/zstack/expon/sdk/ExponClient.java

Comment on lines +150 to +156
if (isDryRun) {
if (dryRunCompletion != null) {
dryRunCompletion.fail(inerr(ORG_ZSTACK_COMPUTE_ALLOCATOR_10019, errMsg));
}
} else if (completion != null) {
completion.fail(inerr(ORG_ZSTACK_COMPUTE_ALLOCATOR_10019, errMsg));
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

静默吞咽 Throwable 可能导致调用方永远挂起

isDryRun=truedryRunCompletion==null(或 isDryRun=falsecompletion==null)时,该 Throwable 被完全静默丢弃,调用方的 ReturnValueCompletion 回调永远不会触发,造成无限挂起

与此形成对比的是 done() 方法(第 88–119 行)直接调用 dryRunCompletion.fail(...) / completion.fail(...)没有任何 null 检查——若 completion 为 null 时 done() 会立即抛出 NPE,使问题暴露;但此处的 null 守卫反而将错误隐藏,使故障模式从"立即可见的崩溃"变成"难以排查的静默挂起"。

allocate(ReturnValueCompletion)dryRun(ReturnValueCompletion) 的实现来看,completion/dryRunCompletionstart() 调用前已经被赋值,正常路径下不应为 null。建议去掉 null 守卫,或至少在 null 时补充一条 error 级别日志并触发断言,以保持与 done() 的行为一致性。

🐛 建议修复——移除静默吞咽,保持与 done() 一致
         } catch (Throwable t) {
             logger.warn("unhandled throwable", t);
-            String errMsg = t != null ? t.toString() : "unknown error";
-            if (isDryRun) {
-                if (dryRunCompletion != null) {
-                    dryRunCompletion.fail(inerr(ORG_ZSTACK_COMPUTE_ALLOCATOR_10019, errMsg));
-                }
-            } else if (completion != null) {
-                completion.fail(inerr(ORG_ZSTACK_COMPUTE_ALLOCATOR_10019, errMsg));
+            String errMsg = t.toString();
+            if (isDryRun) {
+                dryRunCompletion.fail(inerr(ORG_ZSTACK_COMPUTE_ALLOCATOR_10019, errMsg));
+            } else {
+                completion.fail(inerr(ORG_ZSTACK_COMPUTE_ALLOCATOR_10019, errMsg));
             }
         }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@compute/src/main/java/org/zstack/compute/allocator/HostAllocatorChain.java`
around lines 150 - 156, The code in HostAllocatorChain handling errors silently
swallows Throwables when isDryRun is true and dryRunCompletion==null (or when
completion==null), causing callers to hang; change the logic in the error path
to mirror done(): remove the null guards and always call
dryRunCompletion.fail(...) or completion.fail(...) (or if you prefer to keep
guards, add an error-level log and an assertion/fail-fast when the completion is
null before returning) so that Throwable is not silently dropped; update the
error branch around isDryRun/dryRunCompletion/completion to ensure a fail(...)
is invoked (or a clear error+assert) using the same identifiers used in the
diff.

@Override
protected String getDeduplicateString() {
return String.format("connect-host-%s", self.getUuid());
return String.format("connect-host-%s", self == null ? "unknown" : self.getUuid());
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

避免把不同主机退化为同一个去重键。

Line 1445 在 self == null 时固定返回 connect-host-unknown,会让不同主机的连接任务共用同一 deduplicate key,存在误去重/串行化串扰风险。建议优先回退到 msg.getHostUuid()

建议修复
-                return String.format("connect-host-%s", self == null ? "unknown" : self.getUuid());
+                String hostUuid = self != null ? self.getUuid() : msg.getHostUuid();
+                return String.format("connect-host-%s", hostUuid == null ? "unknown" : hostUuid);
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
return String.format("connect-host-%s", self == null ? "unknown" : self.getUuid());
String hostUuid = self != null ? self.getUuid() : msg.getHostUuid();
return String.format("connect-host-%s", hostUuid == null ? "unknown" : hostUuid);
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@compute/src/main/java/org/zstack/compute/host/HostBase.java` at line 1445,
The deduplicate key currently returns a fixed "connect-host-unknown" when self
is null, causing different hosts to share the same key; change the key
generation in HostBase to prefer msg.getHostUuid() as the fallback (i.e., use
self.getUuid() if self != null, else msg.getHostUuid()), and only fall back to a
literal "unknown" if both are null so each host's connect-host-{uuid} key
remains unique.

Comment on lines +67 to +69
case ERROR: logger.warn("[Sentry] " + formatted); break;
case WARNING: logger.warn("[Sentry] " + formatted); break;
default: logger.info("[Sentry] " + formatted); break;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

请不要把 Sentry 的 ERROR 级别降为 WARN。

当前映射会弱化错误信号,可能影响基于日志级别的告警与定位。

🔧 建议修复
-                                    case ERROR: logger.warn("[Sentry] " + formatted); break;
+                                    case ERROR: logger.error("[Sentry] " + formatted); break;
                                     case WARNING: logger.warn("[Sentry] " + formatted); break;
                                     default: logger.info("[Sentry] " + formatted); break;
-                                    case ERROR: logger.warn("[Sentry] " + message, throwable); break;
+                                    case ERROR: logger.error("[Sentry] " + message, throwable); break;
                                     case WARNING: logger.warn("[Sentry] " + message, throwable); break;
                                     default: logger.info("[Sentry] " + message, throwable); break;

Also applies to: 76-79

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@core/src/main/java/org/zstack/core/telemetry/SentryInitHelper.java` around
lines 67 - 69, 在 SentryInitHelper 中不要把 Sentry 的 ERROR 级别降为 WARN;在处理 Sentry
日志级别的映射(在 switch 分支里处理 case ERROR/ WARNING/ default 的代码块,例如当前通过
logger.warn("[Sentry] " + formatted) 处理 ERROR 和 WARNING 的位置)将 ERROR 映射为
logger.error("[Sentry] " + formatted),将 WARNING 映射为 logger.warn("[Sentry] " +
formatted),保持 default(或 INFO)映射为 logger.info("[Sentry] " +
formatted);同样修正文件中另一个相同映射块(也就是你提到的 76-79 附近的 switch 分支)以保持一致。

Comment on lines +1079 to 1085
Set<ReservedIpRangeVO> reservedIpRanges = self.getReservedIpRanges();
if (reservedIpRanges != null && !reservedIpRanges.isEmpty()) {
freeIpInventorys.removeIf(freeIp -> reservedIpRanges.stream().anyMatch(
r -> NetworkUtils.isInRange(freeIp.getIp(), r.getStartIp(), r.getEndIp())));
}

reply.setInventories(freeIpInventorys);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

保留段过滤放在取满 limit 之后,会返回少于请求数量的结果

Line 1073 先按 limit 提前 break,Line 1079-1083 再删掉保留 IP,会让返回数小于 msg.getLimit()(即使后续网段仍有可用 IP)。建议在每个网段取回后先过滤,再决定是否达到 limit

建议修复(将过滤前移到循环内)
-        List<FreeIpInventory> freeIpInventorys = new ArrayList<FreeIpInventory>();
+        List<FreeIpInventory> freeIpInventorys = new ArrayList<FreeIpInventory>();
+        Set<ReservedIpRangeVO> reservedIpRanges = self.getReservedIpRanges();
         int limit = msg.getLimit();
         String start = msg.getStart();
         for ( IpRangeVO ipRangeVO : ipRangeVOs) {
@@
             List<FreeIpInventory> tempFreeIpInventorys = IpRangeHelper.getFreeIp(ipRangeVO, limit,start);
+            if (reservedIpRanges != null && !reservedIpRanges.isEmpty()) {
+                tempFreeIpInventorys.removeIf(freeIp -> {
+                    int ipVersion = NetworkUtils.isIpv4Address(freeIp.getIp()) ? IPv6Constants.IPv4 : IPv6Constants.IPv6;
+                    return reservedIpRanges.stream()
+                            .filter(r -> r.getIpVersion() == ipVersion)
+                            .anyMatch(r -> NetworkUtils.isInRange(freeIp.getIp(), r.getStartIp(), r.getEndIp()));
+                });
+            }
             freeIpInventorys.addAll(tempFreeIpInventorys);
             if (freeIpInventorys.size() >= msg.getLimit()) {
                 break;
             }
-            limit -= freeIpInventorys.size();
+            limit = msg.getLimit() - freeIpInventorys.size();
         }
-
-        Set<ReservedIpRangeVO> reservedIpRanges = self.getReservedIpRanges();
-        if (reservedIpRanges != null && !reservedIpRanges.isEmpty()) {
-            freeIpInventorys.removeIf(freeIp -> reservedIpRanges.stream().anyMatch(
-                    r -> NetworkUtils.isInRange(freeIp.getIp(), r.getStartIp(), r.getEndIp())));
-        }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@network/src/main/java/org/zstack/network/l3/L3BasicNetwork.java` around lines
1079 - 1085, 当前实现是先按 limit 聚合各网段的 freeIpInventorys 再一次性过滤 reservedIpRanges(使用
NetworkUtils.isInRange),导致可能在达到 limit 后才删除保留段,返回的数量低于
msg.getLimit();修改方法:在遍历各网段并向 freeIpInventorys 添加每段结果的地方(相关代码使用 freeIpInventorys
集合和添加逻辑),在将单段 IP 加入集合前先用 reservedIpRanges.stream().anyMatch(...) 或等价逻辑对每个 freeIp
进行过滤(使用 NetworkUtils.isInRange 和 ReservedIpRangeVO 的
getStartIp/getEndIp),然后再判断是否已达到 msg.getLimit() 并决定是否 break;确保最后调用
reply.setInventories(freeIpInventorys) 的集合已经是过滤后的结果。

Comment on lines +1081 to +1082
freeIpInventorys.removeIf(freeIp -> reservedIpRanges.stream().anyMatch(
r -> NetworkUtils.isInRange(freeIp.getIp(), r.getStartIp(), r.getEndIp())));
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

按 IP 版本过滤保留段,否则这里在双栈场景会抛运行时异常

Line 1081-1082 当前会把任意 freeIp 与所有保留段做 isInRange 比较;当 IPv4 IP 去比较 IPv6 保留段(或反向)时,NetworkUtils.isInRange 会走错族解析并抛异常,导致 APIGetFreeIp 失败。

建议修复
-        if (reservedIpRanges != null && !reservedIpRanges.isEmpty()) {
-            freeIpInventorys.removeIf(freeIp -> reservedIpRanges.stream().anyMatch(
-                    r -> NetworkUtils.isInRange(freeIp.getIp(), r.getStartIp(), r.getEndIp())));
-        }
+        if (reservedIpRanges != null && !reservedIpRanges.isEmpty()) {
+            freeIpInventorys.removeIf(freeIp -> {
+                int ipVersion = NetworkUtils.isIpv4Address(freeIp.getIp()) ? IPv6Constants.IPv4 : IPv6Constants.IPv6;
+                return reservedIpRanges.stream()
+                        .filter(r -> r.getIpVersion() == ipVersion)
+                        .anyMatch(r -> NetworkUtils.isInRange(freeIp.getIp(), r.getStartIp(), r.getEndIp()));
+            });
+        }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@network/src/main/java/org/zstack/network/l3/L3BasicNetwork.java` around lines
1081 - 1082, The removal logic for freeIpInventorys calls NetworkUtils.isInRange
against every reservedIpRanges entry regardless of IP family, causing exceptions
in dual-stack scenarios; update the predicate used in freeIpInventorys.removeIf
(the code that iterates freeIp -> reservedIpRanges.stream().anyMatch(...)) to
first compare IP versions (e.g., derive family from freeIp.getIp() and the
reserved range start/end or use an existing IP version field) and only call
NetworkUtils.isInRange when the versions match, ensuring compatibility for both
IPv4 and IPv6 and preventing APIGetFreeIp runtime errors.

Comment on lines +917 to +923
new Runnable() {
@Override
@AsyncThread
public void run() {
nodeLifeCycle.nodeJoin(ManagementNodeInventory.valueOf(n));
}
}.run();
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "=== 1) Find AsyncThread annotation definition ==="
fd -t f -i asyncthread.java

echo
echo "=== 2) Search for AsyncThread annotation class definition ==="
rg -t java 'public.*@interface.*AsyncThread|class.*AsyncThread' -A 10

echo
echo "=== 3) Find weaving/interceptor for AsyncThread ==="
rg -t java -i 'aspect|@around|@pointcut' | grep -i async | head -20

echo
echo "=== 4) Show ManagementNodeManagerImpl context around lines 917-923 ==="
if [ -f "portal/src/main/java/org/zstack/portal/managementnode/ManagementNodeManagerImpl.java" ]; then
  sed -n '900,940p' portal/src/main/java/org/zstack/portal/managementnode/ManagementNodeManagerImpl.java
else
  echo "File not found, searching for it..."
  fd ManagementNodeManagerImpl.java
fi

echo
echo "=== 5) Find all `@AsyncThread` usage patterns in codebase ==="
rg -t java '@AsyncThread' -B 3 -A 5 | head -100

Repository: MatheMatrix/zstack

Length of output: 15657


🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "=== 1) Find AsyncThread annotation definition with full content ==="
cat core/src/main/java/org/zstack/core/thread/AsyncThread.java

echo
echo "=== 2) Search for AspectJ aspect or weaving interceptor for AsyncThread ==="
rg -t java -i 'aspect|@pointcut|@around' core/src/main/java/org/zstack/core | grep -A 20 -B 5 -i async | head -150

echo
echo "=== 3) Get broader context around lines 910-930 in ManagementNodeManagerImpl ==="
sed -n '890,950p' portal/src/main/java/org/zstack/portal/managementnode/ManagementNodeManagerImpl.java | cat -n

echo
echo "=== 4) Check if there is a synchronized block wrapping this code ==="
rg -t java -n 'synchronized.*lifecycleLock' portal/src/main/java/org/zstack/portal/managementnode/ManagementNodeManagerImpl.java -A 100 | head -150

echo
echo "=== 5) Search for where the AsyncThread weaving happens ==="
fd -t f -path '*/core/*' -name '*.java' | xargs rg -l 'AsyncThread' | head -10

Repository: MatheMatrix/zstack

Length of output: 14545


🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "=== 1) Search for ThreadFacade or thread pool that processes `@AsyncThread` ==="
rg -t java 'ThreadFacade|ThreadPool|AsyncThread' core/src/main/java/org/zstack/core/thread/ -A 5 -B 2 | head -200

echo
echo "=== 2) Check test files for how `@AsyncThread` is expected to work ==="
if [ -f "test/src/test/java/org/zstack/test/core/thread/TestThreadFacadeAnnotationFuture.java" ]; then
  echo "--- TestThreadFacadeAnnotationFuture.java ---"
  cat test/src/test/java/org/zstack/test/core/thread/TestThreadFacadeAnnotationFuture.java | head -150
fi

echo
echo "=== 3) Search for method interception or CGLib usage for AsyncThread ==="
rg -t java 'AsyncThread' core/ -B 5 -A 10 | grep -i -E 'proxy|intercept|aspectj|cglib|invoke|reflection' | head -50

echo
echo "=== 4) Find where `@AsyncThread` annotation is actually processed ==="
rg -t java 'AsyncThread' core/ | grep -v test | grep -v '.class'

Repository: MatheMatrix/zstack

Length of output: 26785


确认 @AsyncThread 在匿名 Runnable#run() 上无法生效,并发修复失效。

第 917-923 行代码存在以下问题:

  1. @AsyncThread 注解仅在常规方法声明上有效(见 ThreadFacadeImplCloudBusImpl 等使用例),不支持匿名内部类方法。
  2. 即使注解被处理,直接调用 new Runnable(){...}.run() 也是同步执行,无法异步。
  3. 该代码在 synchronized(lifecycleLock) 内,若注解失效,nodeJoin 将在持有锁的状态下同步执行,锁持有时间延长。

建议改为:锁内只记录待处理节点,锁外提交异步任务。

当前代码
synchronized (lifecycleLock) {
    // ...
    nodesInDb.forEach(n -> {
        if (n.getUuid().equals(node().getUuid()) || suspects.contains(n)) {
            return;
        }
        new Runnable() {
            `@Override`
            `@AsyncThread`
            public void run() {
                nodeLifeCycle.nodeJoin(ManagementNodeInventory.valueOf(n));
            }
        }.run();
    });
}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@portal/src/main/java/org/zstack/portal/managementnode/ManagementNodeManagerImpl.java`
around lines 917 - 923, Inside ManagementNodeManagerImpl, the block that
iterates nodesInDb under synchronized(lifecycleLock) must not call new
Runnable(){ `@AsyncThread` public void run(){ nodeLifeCycle.nodeJoin(...) }}.run()
because `@AsyncThread` on an anonymous method is ignored and .run() is
synchronous; instead, inside the synchronized(lifecycleLock) collect the
ManagementNodeInventory objects that need processing into a temporary list
(e.g., pendingJoins), then release the lock and for each item in pendingJoins
submit an asynchronous task to the proper executor (use the existing
ThreadFacade/async API the project uses or create a new Runnable and submit to a
thread pool) that calls
nodeLifeCycle.nodeJoin(ManagementNodeInventory.valueOf(n)); this ensures
nodeJoin runs asynchronously and the lifecycleLock is not held during nodeJoin.


deleteVm(vm.uuid)

deleteVolume(vol2.uuid)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

清理阶段建议对 vol2 先判空,避免掩盖真实失败原因。

Line 689 若前序步骤异常导致 vol2 未创建,这里会二次抛错(NPE),让排障变难。

💡 建议修改
-        deleteVolume(vol2.uuid)
+        if (vol2?.uuid != null) {
+            deleteVolume(vol2.uuid)
+        }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@test/src/test/groovy/org/zstack/test/integration/storage/primary/addon/expon/ExponPrimaryStorageCase.groovy`
at line 689, The cleanup currently calls deleteVolume(vol2.uuid) unconditionally
which can throw an NPE if vol2 was never created; update the cleanup in
ExponPrimaryStorageCase (reference variable vol2) to check that vol2 is not null
(and optionally that vol2.uuid is not null/empty) before calling deleteVolume,
and skip or log the deletion when vol2 is absent so the original failure isn't
masked.

void testClean() {
deleteVm(vm.uuid)

deleteVolume(vol2.uuid)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

testClean() 里建议对 vol2 判空后再删除。

Line 574 在前序失败场景可能触发空指针,影响失败根因定位。

💡 建议修改
-        deleteVolume(vol2.uuid)
+        if (vol2?.uuid != null) {
+            deleteVolume(vol2.uuid)
+        }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@test/src/test/groovy/org/zstack/test/integration/storage/primary/addon/xinfini/XinfiniPrimaryStorageCase.groovy`
at line 574, 在 testClean() 中对可能为 null 的 vol2 做保护性判断后再调用 deleteVolume;也可以改为在删除前使用
if (vol2 != null) deleteVolume(vol2.uuid) 或者封装一个安全删除方法(如
safeDeleteVolume(uuid))来避免在前序失败时因 vol2 为 null 导致空指针,定位到 testClean() 中涉及 vol2
变量并在调用 deleteVolume(...) 处添加该判空或安全删除调用。

Comment on lines +455 to +457
} else if ("remove".equals(action)) {
iscsiClientSnapshots.getOrDefault(clientId, Collections.emptySet()).remove(snapId.toString())
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# 检查目标文件的确切行内容
cd /tmp && find . -path "*testlib/src/main/java/org/zstack/testlib/ExternalPrimaryStorageSpec.groovy" 2>/dev/null
# 如果不存在,尝试查找仓库
git ls-files | grep -i ExternalPrimaryStorageSpec

Repository: MatheMatrix/zstack

Length of output: 130


🏁 Script executed:

# 读取目标文件的 455-457 行
git show HEAD:testlib/src/main/java/org/zstack/testlib/ExternalPrimaryStorageSpec.groovy | sed -n '450,460p'

Repository: MatheMatrix/zstack

Length of output: 725


🏁 Script executed:

# 展开检查更多上下文(440-470行)以理解 iscsiClientSnapshots 定义
git show HEAD:testlib/src/main/java/org/zstack/testlib/ExternalPrimaryStorageSpec.groovy | sed -n '440,470p'

Repository: MatheMatrix/zstack

Length of output: 2001


🏁 Script executed:

# 检查是否存在中文注释(违反代码指南 **/*.* 不应有中文)
git show HEAD:testlib/src/main/java/org/zstack/testlib/ExternalPrimaryStorageSpec.groovy | grep -n '[^\x00-\x7F]' | head -20

Repository: MatheMatrix/zstack

Length of output: 878


🏁 Script executed:

# 搜索整个文件中是否有类似的 getOrDefault + Collections.emptySet() + remove 模式
git show HEAD:testlib/src/main/java/org/zstack/testlib/ExternalPrimaryStorageSpec.groovy | grep -n "getOrDefault" | head -20

Repository: MatheMatrix/zstack

Length of output: 620


🏁 Script executed:

# 验证 Collections.emptySet() 在 Groovy/Java 中是否真的不支持 remove()
# 查看同仓库内是否有类似的错误处理模式
rg -n "getOrDefault.*Collections\.(emptySet|emptyList|emptyMap)\(\)" testlib/src/main/java/org/zstack/testlib/ -A 2

Repository: MatheMatrix/zstack

Length of output: 2082


🏁 Script executed:

# 检查 iscsiClientSnapshots 的定义类型
git show HEAD:testlib/src/main/java/org/zstack/testlib/ExternalPrimaryStorageSpec.groovy | grep -n "iscsiClientSnapshots" | head -10

Repository: MatheMatrix/zstack

Length of output: 896


避免在不可变集合上执行 remove(),否则触发运行时异常。

第 456 行在 clientId 不存在时会获得 Collections.emptySet(),该集合不可修改,调用 remove() 会抛出 UnsupportedOperationException,导致模拟器接口异常。与第 454 行"add"分支使用 computeIfAbsent() 的正确做法不一致。

建议改为:

def snaps = iscsiClientSnapshots.get(clientId)
if (snaps != null) {
    snaps.remove(snapId.toString())
}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@testlib/src/main/java/org/zstack/testlib/ExternalPrimaryStorageSpec.groovy`
around lines 455 - 457, The current else-if "remove" branch calls
iscsiClientSnapshots.getOrDefault(clientId, Collections.emptySet()).remove(...),
which may invoke remove() on an immutable emptySet and throw
UnsupportedOperationException; change the logic in
ExternalPrimaryStorageSpec.groovy to fetch the set via
iscsiClientSnapshots.get(clientId) (or check containsKey) and only call remove
on a non-null, mutable set (mirroring the computeIfAbsent() approach used in the
"add" branch) so removals operate on the actual stored collection for clientId
instead of on Collections.emptySet().


public static final String ORG_ZSTACK_STORAGE_BACKUP_10133 = "ORG_ZSTACK_STORAGE_BACKUP_10133";

public static final String ORG_ZSTACK_STORAGE_BACKUP_CANCEL_TIMEOUT = "ORG_ZSTACK_STORAGE_BACKUP_CANCEL_TIMEOUT";
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail

FILE="utils/src/main/java/org/zstack/utils/clouderrorcode/CloudOperationsErrorCode.java"

echo "[1] 列出 CloudOperationsErrorCode 中非数字后缀的 ORG_ZSTACK 常量:"
rg -nP 'public static final String ORG_ZSTACK_[A-Z0-9_]+ = "ORG_ZSTACK_[A-Z0-9_]+";' "$FILE" | rg -nPv '_\d+";$' || true

echo
echo "[2] 搜索仓库内可能依赖"错误码数字后缀"的匹配/解析逻辑:"
rg -nP --type=java '(matches\([^)]*ORG_ZSTACK[^)]*\\d|Pattern\.compile\([^)]*ORG_ZSTACK[^)]*\\d|\.split\("_"\)|parseInt|parseLong)' -A 2 -B 2 | head -100 || true

echo
echo "[3] 检查错误码文件中数字后缀与非数字后缀的比例:"
echo "Total error codes with numeric suffix (ending with _XXXX where X is digit):"
rg -c '_\d+";$' "$FILE" || echo "0"
echo "Total error codes with non-numeric suffix:"
rg -cP 'public static final String ORG_ZSTACK_[A-Z0-9_]+[A-Z] = "ORG_ZSTACK' "$FILE" || echo "0"

Repository: MatheMatrix/zstack

Length of output: 11140


🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "[1] 搜索 ORG_ZSTACK_STORAGE_BACKUP_CANCEL_TIMEOUT 的使用:"
rg 'ORG_ZSTACK_STORAGE_BACKUP_CANCEL_TIMEOUT' --type=java || echo "No usage found"

echo
echo "[2] 搜索错误码处理相关的工具类(可能处理数字后缀):"
rg -nP --type=java '(ErrorCode|error.*code|errorCode)' -i | rg -iP '(parse|extract|suffix|numeric|digit)' | head -50 || true

echo
echo "[3] 检查CloudOperationsErrorCode附近是否有特殊的命名说明或注释:"
FILE="utils/src/main/java/org/zstack/utils/clouderrorcode/CloudOperationsErrorCode.java"
sed -n '6270,6290p' "$FILE"

echo
echo "[4] 查看错误码文件的开头以了解命名规范:"
head -50 "$FILE"

Repository: MatheMatrix/zstack

Length of output: 6615


应添加数字后缀以保持错误码命名一致性

该常量命名与仓库现有约定不符。CloudOperationsErrorCode 中 7940 个常量均采用 _NNNN 的数字后缀形式(如 ORG_ZSTACK_STORAGE_BACKUP_10129),而该常量为唯一例外。建议改为 ORG_ZSTACK_STORAGE_BACKUP_CANCEL_TIMEOUT_XXXXX 的形式,保持规范一致性。

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@utils/src/main/java/org/zstack/utils/clouderrorcode/CloudOperationsErrorCode.java`
at line 6281, The constant ORG_ZSTACK_STORAGE_BACKUP_CANCEL_TIMEOUT in
CloudOperationsErrorCode breaks naming conventions by lacking the numeric
suffix; rename it to follow the existing pattern (e.g.
ORG_ZSTACK_STORAGE_BACKUP_CANCEL_TIMEOUT_XXXXX) and update every reference to
this symbol across the codebase (tests, serializations, config, logs) to use the
new name; ensure CloudOperationsErrorCode's string value is changed to match the
new identifier and run a global search/replace for
ORG_ZSTACK_STORAGE_BACKUP_CANCEL_TIMEOUT to avoid dangling references.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants