Restore snapshot checksum problem (Troubleshooting corruption)

IgorSim · October 29, 2024, 7:10pm

Hi, i'm encountering a problem when restoring ES snapshot into an empty cluster, some of the indices can't be restored due to org.apache.lucene.index.CorruptIndexException checksum failed (i'm using ES version 8.10)

I'm aware of the text around this issue (Troubleshooting corruption | Elasticsearch Guide [8.10] | Elastic) and also few people encountered similar problem which most often boils down to problem in HW (RAM, bad sectors etc).

What i didn't fully understand is when exactly ES computes and store a checksum, is it happening:

when a segment is written
merging of segments
creating the snapshot (or it simply 'copy' the files from data dir to backup dir including the checksum) ?

Btw, in my case data dir is stored in file system type 'ext4' and backup dir is on file system type 'ext3'. Is this sub-optimal setup and could be the reason for the corrupt index issue?

Thanks.

IgorSim · October 29, 2024, 7:11pm

Added snapshot-and-restore

thiagodsrodrigues · October 30, 2024, 7:42am

IgorSim:

Hi, i'm encountering a problem when restoring ES snapshot into an empty cluster, some of the indices can't be restored due to org.apache.lucene.index.CorruptIndexException checksum failed (i'm using ES version 8.10)

I'm aware of the text around this issue (Troubleshooting corruption | Elasticsearch Guide [8.10] | Elastic) and also few people encountered similar problem which most often boils down to problem in HW (RAM, bad sectors etc).

What i didn't fully understand is when exactly ES computes and store a checksum, is it happening:

when a segment is written

merging of segments

creating the snapshot (or it simply 'copy' the files from data dir to backup dir including the checksum) ?

Btw, in my case data dir is stored in file system type 'ext4' and backup dir is on file system type 'ext3'. Is this sub-optimal setup and could be the reason for the corrupt index issue?

Thanks.

Using different file systems (ext4 for data and ext3 for backup) can contribute to corruption issues.

DavidTurner · October 30, 2024, 8:14am

Not quite:

We verify the previously-created checksum at this point but otherwise just copy the checksum-carrying files verbatim to the snapshot.

The rest is right tho, we create the checksum footer whenever creating a new file, which includes creating a fresh segment and merging segments together.

It shouldn't, ES uses no filesystem-specific features that could possibly matter here. A checksum error really just means that the data Elasticsearch reads from the repository is different from the data it originally wrote.

IgorSim · October 30, 2024, 2:20pm

Thanks for getting back to me.

When you say ' We verify the previously-created checksum' ...do you mean before copying the files to the snapshot ES will re-calculate the checksum and in case it doesn't match snapshot will fail?

Btw, i'm using shared file system repository (NFS mounted dir) as snapshot repository.

In case it shed more light to the problem.... in the log i see that 'writtenLength' and 'expectedLength' are the same, footer=, verification failed (hardware problem?) expected vs actual is different.

DavidTurner · October 31, 2024, 5:40am

Yes.

IgorSim · March 28, 2025, 12:46pm

I finally analyzed this in bit more details, i started from the log error message (thrown from VerifyingIndexOutput.verify):
verification failed (hardware problem?): expected=280s5g actual=1ltduc1...resource=name [_8rs_Lucene90_0.tim]

I checked the CRC32 of the actual 'tim' file on the ES data dir. I used two methods:

xxd -s -8 -p <path_to_tim_file>
using python: zlib.crc32(open('<path_to_tim_file>'), 'rb').read()[:-8] & 0xffffffff

both of them yield same result which is OK.

I'm aware that you mentioned that ES verifies the previously-created checksum before snapshot is executed, just wanted to get more familiar with the internals.

Next, i found how 'tim' file is mapped into the snapshot repository (from trace log reported by FileRestoreContext):
...recovering [_8rs_Lucene90_0.tim] from [__jWL4NO.....]

I tried the python command on '__jWL4NO...' and got the same checksum value as reported by VerifyingIndexOutput (actual=1ltduc1)

Does this mean that backup (snapshot data) is actually corrupted due to some HW problem (NFS issue, RAM, bad sector etc...)?

DavidTurner · March 28, 2025, 1:19pm

Nice work, good sleuthing

Yes, unfortunately, sorry about that. The __jWL4NO..... blob should be byte-for-byte equal with _8rs_Lucene90_0.tim and it seems that they're not. Maybe you can use xxd to convert the two files to plain text and then diff to see what's actually different. I'd be interested to see the results of that analysis, it's not often that we have access to both the original and corrupted data.

RainTown · March 28, 2025, 2:29pm

You are using NFS where? What are the mount options? Which version ? What’s the NFS server in this case?

IgorSim · March 28, 2025, 6:50pm

This is the setup:

ES cluster contains 3 nodes (running Rocky Linux 9.2), ES version is 8.10.2
There is another machine which act as 'shared file system repository', OS is Rocky Linux 9.0. This machine is the NFS server. I installed NFS-utils 2.5.4 and the version is NFSv4.2

One entry in /etc/exports looks like this, for ex:
<mounted_path> (rw,sync,all_squash,no_subtree_check,anonuid=1000,anonguid=1000)

The file system from the repository is mounted to all ES nodes and shared file system repository is registered on all 3 ES nodes.

Snapshots are scheduled once per day and they are running for several months now. Kibana reports that snapshots are finishing successfully, i didn't notice errors in the ES logs (didn't investigate other logs though)

However, as stated above there is problem when restoring snapshot on different(testing) environment. The problematic 'tim' file is ~1,5 GB. (there are also other problematic files, for ex one of the data streams contains 7 indexes and 4 out of 7 of them are restored successfully).

I used 'diff' to see the actual differences (diff -u and the result shows ~300 lines... judging by the byte offsets i see dozen lines with '-' and dozen lines with '+' towards the end of the file (not sure if this information is interesting). From the ascii representation i can recognize familiar data from documents that were indexed. Given the fact that documents are not that small it looks like there are few docs that are missing in snapshot blob file and few that exists but doesn't exist in tim file (still need to investigate better)

RainTown · March 28, 2025, 8:33pm

Thanks for clear answers.

Just double check client side too, and check you really are using 4.2 consistently.

cat /proc/fs/nfsd/versions

on server side is your friend.

However, I think you are still in a tough spot. Because unless someone else spots something you’re doing wrong, that you can fix, which is truly your best case scenario, you still ended up with issues. And whether those issues are some subtle bug in elasticsearch, or the nfs server or nfs client implementation, or somewhere else, is going to be very difficult to find IMO.

One thing you might maybe try is just a different nfs server, say a *bsd box. But if I’ve got it right you’ve worked with this setup for ages without issue, and now hit a problem for first time. I.e. somewhere in the past, maybe multiple times, you’ve tested restoring your snapshots and it all worked well. And you’ve checked the repo regularly via a POST to /_snapshot/the_repo/_verify?pretty".

I hope someone else finds something better here. Good luck.

Christian_Dahlqvist · March 28, 2025, 9:02pm

I assume you have mounted the snapshot on the second cluster as read-only as described in the documentation?

Clusters should only register a particular snapshot repository bucket once. If you register the same snapshot repository with multiple clusters, only one cluster should have write access to the repository. On other clusters, register the repository as read-only.

IgorSim · March 28, 2025, 9:10pm

Yes, read-only as suggested by documentation.

IgorSim · March 28, 2025, 9:29pm

Thanks for suggestions, i'll double check.

Just for clarification, restore never worked (i've tried restoring several different snapshots, even the first one), also i removed the NFS mounts and recreated them all over again, that didn't help either. I was aware that restoration doesn't work but this issue wasn't top priority back then and i didn't investigate in more details. Now, it's back on the table and it's time to find resolution

So, i assume different nfs server or switching to different repository type may be things to consider OR continuing the investigation & troubleshooting which i agree may be very difficult.

I ran in the past 'verify' API and everything was/is OK, i also saw there is repository analysis API, i may give a shot to that one too.

RainTown · March 28, 2025, 11:14pm

Rocky Linux isn’t a distribution I’ve used. But I see it is RHEL-based. 9.0 on server side seems a bit old, July 2022 released, so if there is a bug there you’ll probably be asked to upgrade anyways. And of course if you used RHEL itself you could open a support ticket.

There are also other NFS server side implementations you could try, eg Ganesha.

Last, try to reduce to smallest possible reproducible problem. Snapshot an index with 1 document, check that works. If so, double doc count , rinse, repeat. When you hit the problem capture the traffic on client and server and analyse.

DavidTurner · March 31, 2025, 6:52am

FTR no way could this explain it anyway. The __jWL4NO..... filename is random, generated as the file is being written, so you wouldn't get two clusters writing to the same individual file anyway. It is possible to see repository corruption with multiple clusters writing to the same repository, but not corruptions like this.

It is interesting indeed, I'd love to see it. It's curious that both the old and new sections of the file have what looks to be valid data. If you'd rather not share it publicly I can set up a private upload link.

Yeah please try that, it does much more interesting checks. The _verify API is unlikely to show anything useful in this case.

Also try something else like stress-ng --verify --hdd 16 --temp-path ${NFS_MOUNT}. I believe Rocky Linux 9 will pick stress-ng-0.17.08 which is reasonably new.

IgorSim · March 31, 2025, 3:02pm

Thanks for suggestions, i'll give it a try.

I will need to double check with the stakeholders if i can upload the diff file, if yes then for sure i'll need private upload link, i'll let you know.

I start executing repository analysis API, had few runs which finished fast, without issues detected. At some point (after few iterations and increasing various query params) the analysis stops (after 30sec) due to 'timeout' error although i increased the timeout query param.

I tried triggering the request from kibana and directly using 'curl' (with --max-time set to few minutes) but i get the same error.

Is there any other (ES server param?) 'timeout' param that needs to be increased?

DavidTurner · March 31, 2025, 7:35pm

Strange indeed. AFAICT the timeout in question is the one set by the ?timeout query parameter, so I've no idea why it's failing after 30s if you're setting a much longer value here.

IgorSim · April 1, 2025, 10:55am

Yes, i'm not sure either what can be the problem, still investigating.

Anyway, in one of the runs i got this error (i assume the error returned was before 30sec timeout was hit)

Unfortunately, I wasn't able to reproduce it again but if i understand correctly this means 'real failure' in one of actual analysis that '_analyze' performs (transaction-log is the name of the snapshot repository).

Is my assumption correct?

DavidTurner · April 1, 2025, 11:34am

Bit hard to tell from a screenshot, sorry. It'd be better to share it in plain text. But yeah this looks like misbehaving storage.

You're using quite an old version of ES here, I'd suggest trying the analysis with 8.17 these days. It is a little stricter in some ways but also it reports problems slightly better too.

Topic		Replies	Views
Exception at backup restoring: org.apache.lucene.index.CorruptIndexException: checksum failed (hardware problem?) Elasticsearch	1	1499	March 15, 2017
CorruptIndexException on restore Elasticsearch	3	428	January 16, 2017
Restore of elasticsearch data fails with CorruptIndexException[checksum failed (hardware problem?) Elasticsearch snapshot-and-restore	6	4245	February 22, 2021
CorruptIndexException after node restart Elasticsearch	5	1049	September 26, 2017
Snapshot & Restore - Missing/Corrupted Segments Elasticsearch snapshot-and-restore	2	118	October 2, 2024

Restore snapshot checksum problem (Troubleshooting corruption)

Related topics