VTN acceptance #11

Open
opened 2025-11-22 17:09:22 +00:00 by juselius · 15 comments
juselius commented 2025-11-22 17:09:22 +00:00 (Migrated from gitlab.com)

Ceph

Placement Group 27.11a has become inconsistent multiple times, which could lead to data loss in the future.

After performing a deep scrub, Ceph calculates the checksum of an object that is read from disk and compares it to the checksum that was previously recorded. If the current checksum and the previously recorded checksum do not match, that mismatch is considered to be an inconsistency.

[ERR] OSD_SCRUB_ERRORS: 1 scrub errors
[ERR] PG_DAMAGED: Possible data damage: 1 pg inconsistent
    pg 27.11a is active+clean+inconsistent, acting [70,135,115,121,78,120,102,65,90,85,106,116]

There are chances that the underlying drives are failing. If that's the case then the OSDs need to be removed from the cluster and replaced.

TODO:

  • Start manual repair with ceph pg repair 27.11a
  • Check the above OSDs with debug_osd = 20
## Ceph Placement Group `27.11a` has become inconsistent multiple times, which could lead to data loss in the future. After performing a deep scrub, Ceph calculates the checksum of an object that is read from disk and compares it to the checksum that was previously recorded. If the current checksum and the previously recorded checksum do not match, that mismatch is considered to be an inconsistency. ```shell [ERR] OSD_SCRUB_ERRORS: 1 scrub errors [ERR] PG_DAMAGED: Possible data damage: 1 pg inconsistent pg 27.11a is active+clean+inconsistent, acting [70,135,115,121,78,120,102,65,90,85,106,116] ``` There are chances that the underlying drives are failing. If that's the case then the OSDs need to be removed from the cluster and replaced. TODO: - [x] Start manual repair with `ceph pg repair 27.11a` - [x] Check the above OSDs with `debug_osd = 20`
juselius commented 2025-11-22 17:09:22 +00:00 (Migrated from gitlab.com)

set status to To do

set status to **To do**
juselius commented 2025-11-22 17:09:30 +00:00 (Migrated from gitlab.com)

set status to On hold

set status to **On hold**
juselius commented 2025-11-22 17:09:38 +00:00 (Migrated from gitlab.com)

set status to In progress

set status to **In progress**
juselius commented 2025-11-22 17:12:48 +00:00 (Migrated from gitlab.com)

assigned to @juselius

assigned to @juselius
mrtz-j commented 2025-12-03 10:23:39 +00:00 (Migrated from gitlab.com)

assigned to @mrtz-j

assigned to @mrtz-j
mrtz-j commented 2025-12-03 10:28:46 +00:00 (Migrated from gitlab.com)

changed the description

changed the description
mrtz-j commented 2025-12-03 10:33:12 +00:00 (Migrated from gitlab.com)

changed the description

changed the description
mrtz-j commented 2025-12-03 10:43:05 +00:00 (Migrated from gitlab.com)

changed the description

changed the description
mrtz-j commented 2025-12-03 10:43:33 +00:00 (Migrated from gitlab.com)

changed the description

changed the description
mrtz-j commented 2025-12-03 11:04:24 +00:00 (Migrated from gitlab.com)

marked the checklist item Start manual repair with ceph pg repair 27.11a as completed

marked the checklist item **Start manual repair with `ceph pg repair 27.11a`** as completed
mrtz-j commented 2026-01-12 13:43:31 +00:00 (Migrated from gitlab.com)

marked the checklist item Check the above OSDs with debug_osd = 20 as completed

marked the checklist item **Check the above OSDs with debug\_osd \= 20** as completed
mrtz-j commented 2026-01-12 13:43:56 +00:00 (Migrated from gitlab.com)

1 daemons have recently crashed
mgr.a crashed on host ceph-mon1 at 2026-01-12T08:51:29.167629Z

1 daemons have recently crashed `mgr.a crashed on host ceph-mon1 at 2026-01-12T08:51:29.167629Z`
mrtz-j commented 2026-01-12 13:47:04 +00:00 (Migrated from gitlab.com)
{
    "assert_condition": "nref == 0",
    "assert_file": "/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos9/DIST/centos9/MACHINE_SIZE/gigantic/release/19.2.3/rpm/el9/BUILD/ceph-19.2.3/src/common/RefCountedObj.cc",
    "assert_func": "virtual ceph::common::RefCountedObject::~RefCountedObject()",
    "assert_line": 14,
    "assert_msg": "/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos9/DIST/centos9/MACHINE_SIZE/gigantic/release/19.2.3/rpm/el9/BUILD/ceph-19.2.3/src/common/RefCountedObj.cc: In function 'virtual ceph::common::RefCountedObject::~RefCountedObject()' thread 7f1158194640 time 2026-01-12T08:51:29.164582+0000\n/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos9/DIST/centos9/MACHINE_SIZE/gigantic/release/19.2.3/rpm/el9/BUILD/ceph-19.2.3/src/common/RefCountedObj.cc: 14: FAILED ceph_assert(nref == 0)\n",
    "assert_thread_name": "ms_dispatch",
    "backtrace": [
        "/lib64/libc.so.6(+0x3ebf0) [0x7f127ccb6bf0]",
        "/lib64/libc.so.6(+0x8c21c) [0x7f127cd0421c]",
        "raise()",
        "abort()",
        "(ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x178) [0x7f127d248e5c]",
        "/usr/lib64/ceph/libceph-common.so.2(+0x183fc1) [0x7f127d248fc1]",
        "/usr/lib64/ceph/libceph-common.so.2(+0x27b0f9) [0x7f127d3400f9]",
        "/usr/lib64/ceph/libceph-common.so.2(+0x3ecdfb) [0x7f127d4b1dfb]",
        "(ceph::common::RefCountedObject::put() const+0x1a8) [0x7f127d342b88]",
        "(DispatchQueue::entry()+0x18a) [0x7f127d45af5a]",
        "/usr/lib64/ceph/libceph-common.so.2(+0x4302a1) [0x7f127d4f52a1]",
        "/lib64/libc.so.6(+0x8a4da) [0x7f127cd024da]",
        "clone()"
    ],
    "ceph_version": "19.2.3",
    "crash_id": "2026-01-12T08:51:29.167629Z_00071423-93b5-43fa-80ff-6dfe564ab4f5",
    "entity_name": "mgr.a",
    "os_id": "centos",
    "os_name": "CentOS Stream",
    "os_version": "9",
    "os_version_id": "9",
    "process_name": "ceph-mgr",
    "stack_sig": "2eca3482a59c223e5bb863efc60aadd04cd01c3b52ba030d191dfcd667c12359",
    "timestamp": "2026-01-12T08:51:29.167629Z",
    "utsname_hostname": "ceph-mon1",
    "utsname_machine": "x86_64",
    "utsname_release": "6.12.31-talos",
    "utsname_sysname": "Linux",
    "utsname_version": "#1 SMP Tue Jun 10 11:58:23 UTC 2025"
}
``` { "assert_condition": "nref == 0", "assert_file": "/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos9/DIST/centos9/MACHINE_SIZE/gigantic/release/19.2.3/rpm/el9/BUILD/ceph-19.2.3/src/common/RefCountedObj.cc", "assert_func": "virtual ceph::common::RefCountedObject::~RefCountedObject()", "assert_line": 14, "assert_msg": "/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos9/DIST/centos9/MACHINE_SIZE/gigantic/release/19.2.3/rpm/el9/BUILD/ceph-19.2.3/src/common/RefCountedObj.cc: In function 'virtual ceph::common::RefCountedObject::~RefCountedObject()' thread 7f1158194640 time 2026-01-12T08:51:29.164582+0000\n/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos9/DIST/centos9/MACHINE_SIZE/gigantic/release/19.2.3/rpm/el9/BUILD/ceph-19.2.3/src/common/RefCountedObj.cc: 14: FAILED ceph_assert(nref == 0)\n", "assert_thread_name": "ms_dispatch", "backtrace": [ "/lib64/libc.so.6(+0x3ebf0) [0x7f127ccb6bf0]", "/lib64/libc.so.6(+0x8c21c) [0x7f127cd0421c]", "raise()", "abort()", "(ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x178) [0x7f127d248e5c]", "/usr/lib64/ceph/libceph-common.so.2(+0x183fc1) [0x7f127d248fc1]", "/usr/lib64/ceph/libceph-common.so.2(+0x27b0f9) [0x7f127d3400f9]", "/usr/lib64/ceph/libceph-common.so.2(+0x3ecdfb) [0x7f127d4b1dfb]", "(ceph::common::RefCountedObject::put() const+0x1a8) [0x7f127d342b88]", "(DispatchQueue::entry()+0x18a) [0x7f127d45af5a]", "/usr/lib64/ceph/libceph-common.so.2(+0x4302a1) [0x7f127d4f52a1]", "/lib64/libc.so.6(+0x8a4da) [0x7f127cd024da]", "clone()" ], "ceph_version": "19.2.3", "crash_id": "2026-01-12T08:51:29.167629Z_00071423-93b5-43fa-80ff-6dfe564ab4f5", "entity_name": "mgr.a", "os_id": "centos", "os_name": "CentOS Stream", "os_version": "9", "os_version_id": "9", "process_name": "ceph-mgr", "stack_sig": "2eca3482a59c223e5bb863efc60aadd04cd01c3b52ba030d191dfcd667c12359", "timestamp": "2026-01-12T08:51:29.167629Z", "utsname_hostname": "ceph-mon1", "utsname_machine": "x86_64", "utsname_release": "6.12.31-talos", "utsname_sysname": "Linux", "utsname_version": "#1 SMP Tue Jun 10 11:58:23 UTC 2025" } ```
mrtz-j commented 2026-01-12 13:56:55 +00:00 (Migrated from gitlab.com)
@juselius seems like a bug https://tracker.ceph.com/issues/69537
mrtz-j commented 2026-01-12 13:57:33 +00:00 (Migrated from gitlab.com)

fix published in December: https://github.com/ceph/ceph/pull/65006

fix published in December: https://github.com/ceph/ceph/pull/65006
Sign in to join this conversation.