Hi, i'm encountering a problem when restoring ES snapshot into an empty cluster, some of the indices can't be restored due to org.apache.lucene.index.CorruptIndexException checksum failed (i'm using ES version 8.10)
What i didn't fully understand is when exactly ES computes and store a checksum, is it happening:
when a segment is written
merging of segments
creating the snapshot (or it simply 'copy' the files from data dir to backup dir including the checksum) ?
Btw, in my case data dir is stored in file system type 'ext4' and backup dir is on file system type 'ext3'. Is this sub-optimal setup and could be the reason for the corrupt index issue?
We verify the previously-created checksum at this point but otherwise just copy the checksum-carrying files verbatim to the snapshot.
The rest is right tho, we create the checksum footer whenever creating a new file, which includes creating a fresh segment and merging segments together.
It shouldn't, ES uses no filesystem-specific features that could possibly matter here. A checksum error really just means that the data Elasticsearch reads from the repository is different from the data it originally wrote.
When you say ' We verify the previously-created checksum' ...do you mean before copying the files to the snapshot ES will re-calculate the checksum and in case it doesn't match snapshot will fail?
Btw, i'm using shared file system repository (NFS mounted dir) as snapshot repository.
In case it shed more light to the problem.... in the log i see that 'writtenLength' and 'expectedLength' are the same, footer=, verification failed (hardware problem?) expected vs actual is different.
I finally analyzed this in bit more details, i started from the log error message (thrown from VerifyingIndexOutput.verify):
verification failed (hardware problem?): expected=280s5g actual=1ltduc1...resource=name [_8rs_Lucene90_0.tim]
I checked the CRC32 of the actual 'tim' file on the ES data dir. I used two methods:
xxd -s -8 -p <path_to_tim_file>
using python: zlib.crc32(open('<path_to_tim_file>'), 'rb').read()[:-8] & 0xffffffff
both of them yield same result which is OK.
I'm aware that you mentioned that ES verifies the previously-created checksum before snapshot is executed, just wanted to get more familiar with the internals.
Next, i found how 'tim' file is mapped into the snapshot repository (from trace log reported by FileRestoreContext):
...recovering [_8rs_Lucene90_0.tim] from [__jWL4NO.....]
I tried the python command on '__jWL4NO...' and got the same checksum value as reported by VerifyingIndexOutput (actual=1ltduc1)
Does this mean that backup (snapshot data) is actually corrupted due to some HW problem (NFS issue, RAM, bad sector etc...)?
Yes, unfortunately, sorry about that. The __jWL4NO..... blob should be byte-for-byte equal with _8rs_Lucene90_0.tim and it seems that they're not. Maybe you can use xxd to convert the two files to plain text and then diff to see what's actually different. I'd be interested to see the results of that analysis, it's not often that we have access to both the original and corrupted data.
ES cluster contains 3 nodes (running Rocky Linux 9.2), ES version is 8.10.2
There is another machine which act as 'shared file system repository', OS is Rocky Linux 9.0. This machine is the NFS server. I installed NFS-utils 2.5.4 and the version is NFSv4.2
One entry in /etc/exports looks like this, for ex:
<mounted_path> (rw,sync,all_squash,no_subtree_check,anonuid=1000,anonguid=1000)
The file system from the repository is mounted to all ES nodes and shared file system repository is registered on all 3 ES nodes.
Snapshots are scheduled once per day and they are running for several months now. Kibana reports that snapshots are finishing successfully, i didn't notice errors in the ES logs (didn't investigate other logs though)
However, as stated above there is problem when restoring snapshot on different(testing) environment. The problematic 'tim' file is ~1,5 GB. (there are also other problematic files, for ex one of the data streams contains 7 indexes and 4 out of 7 of them are restored successfully).
I used 'diff' to see the actual differences (diff -u and the result shows ~300 lines... judging by the byte offsets i see dozen lines with '-' and dozen lines with '+' towards the end of the file (not sure if this information is interesting). From the ascii representation i can recognize familiar data from documents that were indexed. Given the fact that documents are not that small it looks like there are few docs that are missing in snapshot blob file and few that exists but doesn't exist in tim file (still need to investigate better)
Just double check client side too, and check you really are using 4.2 consistently.
cat /proc/fs/nfsd/versions
on server side is your friend.
However, I think you are still in a tough spot. Because unless someone else spots something you’re doing wrong, that you can fix, which is truly your best case scenario, you still ended up with issues. And whether those issues are some subtle bug in elasticsearch, or the nfs server or nfs client implementation, or somewhere else, is going to be very difficult to find IMO.
One thing you might maybe try is just a different nfs server, say a *bsd box. But if I’ve got it right you’ve worked with this setup for ages without issue, and now hit a problem for first time. I.e. somewhere in the past, maybe multiple times, you’ve tested restoring your snapshots and it all worked well. And you’ve checked the repo regularly via a POST to /_snapshot/the_repo/_verify?pretty".
I hope someone else finds something better here. Good luck.
Clusters should only register a particular snapshot repository bucket once. If you register the same snapshot repository with multiple clusters, only one cluster should have write access to the repository. On other clusters, register the repository as read-only.
Just for clarification, restore never worked (i've tried restoring several different snapshots, even the first one), also i removed the NFS mounts and recreated them all over again, that didn't help either. I was aware that restoration doesn't work but this issue wasn't top priority back then and i didn't investigate in more details. Now, it's back on the table and it's time to find resolution
So, i assume different nfs server or switching to different repository type may be things to consider OR continuing the investigation & troubleshooting which i agree may be very difficult.
I ran in the past 'verify' API and everything was/is OK, i also saw there is repository analysis API, i may give a shot to that one too.
Rocky Linux isn’t a distribution I’ve used. But I see it is RHEL-based. 9.0 on server side seems a bit old, July 2022 released, so if there is a bug there you’ll probably be asked to upgrade anyways. And of course if you used RHEL itself you could open a support ticket.
There are also other NFS server side implementations you could try, eg Ganesha.
Last, try to reduce to smallest possible reproducible problem. Snapshot an index with 1 document, check that works. If so, double doc count , rinse, repeat. When you hit the problem capture the traffic on client and server and analyse.
FTR no way could this explain it anyway. The __jWL4NO..... filename is random, generated as the file is being written, so you wouldn't get two clusters writing to the same individual file anyway. It is possible to see repository corruption with multiple clusters writing to the same repository, but not corruptions like this.
It is interesting indeed, I'd love to see it. It's curious that both the old and new sections of the file have what looks to be valid data. If you'd rather not share it publicly I can set up a private upload link.
Yeah please try that, it does much more interesting checks. The _verify API is unlikely to show anything useful in this case.
Also try something else like stress-ng --verify --hdd 16 --temp-path ${NFS_MOUNT}. I believe Rocky Linux 9 will pick stress-ng-0.17.08 which is reasonably new.
I will need to double check with the stakeholders if i can upload the diff file, if yes then for sure i'll need private upload link, i'll let you know.
I start executing repository analysis API, had few runs which finished fast, without issues detected. At some point (after few iterations and increasing various query params) the analysis stops (after 30sec) due to 'timeout' error although i increased the timeout query param.
Strange indeed. AFAICT the timeout in question is the one set by the ?timeout query parameter, so I've no idea why it's failing after 30s if you're setting a much longer value here.
Unfortunately, I wasn't able to reproduce it again but if i understand correctly this means 'real failure' in one of actual analysis that '_analyze' performs (transaction-log is the name of the snapshot repository).
Bit hard to tell from a screenshot, sorry. It'd be better to share it in plain text. But yeah this looks like misbehaving storage.
You're using quite an old version of ES here, I'd suggest trying the analysis with 8.17 these days. It is a little stricter in some ways but also it reports problems slightly better too.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.