Hi all, I used the SeaChest-Software on Linux system to upgrade from

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

broken HDD drive after changing to 4kn,about seagate/toolbin

Comments (22)

payback007 commented on June 17, 2024 2

Two HDDs are replaced now and I used an Ubuntu_20_live USB stick and re-formating of the sector size to 4kn was possible without any further problems. Thanks for the fast and great support!

from toolbin.

pointerphile commented on June 17, 2024 2

I tried again SeaChest_Format --setSectorSize 4096 yesterday...

Target Disk: ST4000NM002A
OS: Ubuntu 20.04.01
SeaChest_Format Version: 2.2.1-2_1_3 X86_64
From: https://support.seagate.com/seachest/SeaChestUtilities.zip

Result: OK, No Error.

p.s. --setSectorSize 512 works fine.
p.s. Not tested on SeaChest_Lite 1.4.0-2_1_3

from toolbin.

vonericsen commented on June 17, 2024 2

@cfelicio,
thanks for the additional information. I haven't tested this on Win11 yet myself, so maybe that affected this. I wouldn't think the earlier release would have affected this, but it is possible. The biggest change between the two was "locking" access to the drive to ensure nothing would interrupt the change in sector size.

I'll ask a few other people who are good for these kinds of user questions which solution they think would be most helpful. I like to ask those that find things like this because sometimes it better helps me understand the use-case to make sure we do what is best for the users. So if you think about it more and want to suggest something, please do!
We'll work on testing this and trying to recreate it to make sure we can solve it properly.

from toolbin.

vonericsen commented on June 17, 2024 2

These updates to lock the disk and delete the MBR to prevent some weird cases is what we ended up doing.
Additionally the warnings for this option have been refactored to make it clear of the potential issues that can happen.

I'm closing this issue since this has been released in SeaChest and openSeaChest, but please reopen it if you think there is more we need to look into for this issue.

from toolbin.

vonericsen commented on June 17, 2024 1

This took a little while to wrap up, but the earlier commits I tagged this issue in should further help mitigate the issue in case something were to happen in the future.
After our internal testing and debugging, we are confident that the versions I mentioned earlier include the correct code to mitigate potential problems, but we added this additional code anyways to further ensure things work correctly, or at worst when a fast format fails, that the drive can be left in a good state.
There are more details about what this does in Seagate/openSeaChest#54

This same change has been brought into the closed source SeaChest tools as well and those have been uploaded to ToolBin earlier today.

I will leave this open in case @pointerphile or @payback007 have additional issues or feedback related to this issue, but plan to close this issue next week.

from toolbin.

cfelicio commented on June 17, 2024 1

I just tried to convert from 512 to 4096 with the latest Seachest Lite on Windows, and I end up with a corrupted disk / partition that can't be removed. Fortunately reverting back to 512 is possible without bricking the disk. Steps taken and screenshots here:

https://carlosfelic.io/misc/how-to-switch-your-seagate-exos-x16-to-4kn-advanced-format-on-windows/

Let me know once its fixed and I can try again. In the meantime I'll try to redo it with the regular Seachest.

from toolbin.

vonericsen commented on June 17, 2024 1

Well that is really weird.
Changing the sector size is not guaranteed to leave any data recoverable, which is why SeaChest requires you to confirm that this will erase data since it could result in data loss. I'm guessing something was left behind when the drive reformatted itself and Windows was able to read that. Whatever it read, it understood as a large GPT partition, but it probably couldn't do anything because the sector size change and LBA values would be different from what it was before.
When you format a disk with GPT, it creates a dummy/protective MBR record to stop old systems from trying to format the drive, so it is possible that was what the system read since the real GPT partition table was no longer able to be found (it is now part of sector 0 after the reformat instead of sector 2 or 3 or wherever it ends up written).

You could also try starting an erase on the drive before or after the format to clear out the first few sectors (SeaChest_Erase -d PD? --overwrite 0 --overwriteRange 4096 --confirm....) which would likely also correct this before you reboot. Then you wouldn't need to do the weird partition changes.
I haven't ever seen this done on disks with existing partitions, only raw disks (ignoring the "Format this drive" when attached to the system). SeaChest doesn't need a drive letter or partition to be able to find the disk since it checks for a different kind of handle to the drive since it is concerned with the full raw disk, not the partition information.

There are a couple options to make this part of the operation in SeaChest (Windows only for now):

Detect if there is a partition on the disk and tell the user they need to erase the first portion before switching
Erase the MBR/GPT table before performing the fast format for the user in the event that there is a partition.

Do you have an opinion of which you think it should be?
In SeaChest there is a lot of balancing between keeping the user informed and in control versus making things simple, so this one can go either way in my opinion.

from toolbin.

cfelicio commented on June 17, 2024 1

Thanks for your quick reply! It's a very bizarre situation. I purchased these disks back in 2021, and I did switch some of them to 4KN back then without any issues, on Windows! Pretty sure they had GPT partitions too. What might have changed:

I used an older version of Seachest lite back then (release date read 25-Feb-2021, compared to the new one (release date read 17-Jun-2021)
I was using Windows 10 Pro, now I'm using Windows 11

Aside from that, the disks are all the same (Exos X16 10TB).

I'm not sure what's the best way to fix this on the tool, I assume the vast majority of the audience for it will be technical and aware of the destructive nature of some of the commands, hopefully people are careful and target the correct disks when running these :-)

from toolbin.

pointerphile commented on June 17, 2024

Same here. I used SeaChest for set 4kn in Windows and my ST4000NM002A gets bricked.

from toolbin.

vonericsen commented on June 17, 2024

Hello,
Sorry to hear about these issues affecting your drives. Would you please reach out to Seagate customer support and reference the Ticket Number 11187884 for further support?

You will find Seagate phone contact number here:
https://www.seagate.com/contacts/

Thank you!

from toolbin.

payback007 commented on June 17, 2024

Thank you very much for the reference number. Due to the fact it is a new purchasded product I will send the hdd back to retailer and will give the reference number to them. I asked for exchange already. Hopefully the changed one will not have this issue!

from toolbin.

vonericsen commented on June 17, 2024

Thanks for the update.

As with any software or operating systems, please make sure you are running the latest version.
The SeaChest tools were recently updated and the latest version of SeaChest_Lite is 1.4.0 and the latest version of SeaChest_Format is 2.2.1. These tools report the build date as Feb 25, 2021 so you can also confirm that in addition to the version number. openSeaChest_Format will have a matching version number, but a later build date of Mar 1, 2021.

from toolbin.

payback007 commented on June 17, 2024

I got the information the HDDs will be replaced by central factory stock from Seagate, so I will have to do the same with replaced ones to get all 4 HDDs to 4kn format. Up to now the chance was 50:50, on 2 drives the conversion was running without problems (used for the 1st two HDDs the older SeaChest_Lite version) and for the other 2 drives a newer/latest SeaChest_Utilities (with SeaChest_Format) was used, always downloaded from Seagate_homepage -> with both versions the chance to get correct 4kn drives was still 50:50.

Is there any recommendation about the basic system to use for SeaChest? I can use Linux_Debian or Windows10 or if really needed any different software platform. Is it better to use SeaChest_Lite or SeaChest_Format? I don't want running into the same situation again because I need a working ZFS backup the next days!

from toolbin.

vonericsen commented on June 17, 2024

The best advice I can give for configuring any new product before integration into a system is to do it from a Live OS (LiveCD or LiveUSB) to reduce the chance of an installed OS from trying to interact with the drive during any of the configuration process. Also, make sure that low-level configuration commands such as these are performed prior to writing any partition information on the disk. Data is not guaranteed to be accessible in the same way after changing the sector size and other things already written to disk may use checksums based on individual sector sizes which would no longer work properly once changed (if the original data was still accessible).
When possible, I would also make sure that the drive and any HBA that it may be attached to have the latest firmware versions to ensure they can understand the change in sector size after it's performed and don't have any other compatibility issues.
To check for Seagate firmware updates, you can put the drive SN into this form and it will show manuals, software, and any available firmware updates.

As for SeaChest_Lite vs SeaChest_Format, the commands work the same way so one is not any better than the other. The code that runs this process is in opensea-operations which both of these tools use so that it works the same.

from toolbin.

payback007 commented on June 17, 2024

Ok, thanks for all the information! I will boot my server with a live CD, detach every other HDDs and use the standard internal SATA port. First I tried with HBA controller with for sure latest firmware level, seemed to work, but the result I already described. We will see what will happen. I will give feedback after the format change with the replaced drives.

from toolbin.

vonericsen commented on June 17, 2024

Thanks for the update and glad that worked! I want to leave this issue open a little longer as we have been doing a review of the code and some internal testing to make sure this feature is being handled in the tool correctly before we call it completed.

from toolbin.

cfelicio commented on June 17, 2024

Ah, I think I found a clue. The disk is in GPT format, if I delete all the partitions and convert it back to MBR before running the command, it works!

When I tried to convert it back to GPT, my computer froze, but after rebooting and trying a 2nd time, success!

from toolbin.

zotabee commented on June 17, 2024

Ah, I think I found a clue. The disk is in GPT format, if I delete all the partitions and convert it back to MBR before running the command, it works!

When I tried to convert it back to GPT, my computer froze, but after rebooting and trying a 2nd time, success!

I have some new EXOS X20 20TB with Win 11 22h2 x64 and I'm unable to change the sector size to 4096 with SeaChest Lite 1.9.0-4_1_1 X86_64 (last version). I get an error too.

I didn't try to format the disks in MBR first. But I confirm either in GPT or even with any volume at all, same issue, it ends with an error.

from toolbin.

zeroomega commented on June 17, 2024

Not intended to reopen this issue but I would like to report a bricked HDD after using seachest_format to change sector size from 512e to 4Kn.

My command was
./SeaChest_Format_x86_64-alpine-linux-musl_static -d /dev/sg3 --setSectorSize 4096 --confirm ....

However, the command never returned. After stuck about an hour (the command was said to be only take 5 mins), I checked the kernel messages and noticed that the ata link was reset by the kernel:

Aug 22 20:06:37 misato kernel: [427724.821865] ata9.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6
Aug 22 20:06:37 misato kernel: [427724.821890] ata9.00: irq_stat 0x40000001
Aug 22 20:06:37 misato kernel: [427724.821907] ata9.00: cmd a0/01:00:00:00:02/00:00:00:00:00/a0 tag 2 dma 512 in
Aug 22 20:06:37 misato kernel: [427724.821946] ata9: hard resetting link
Aug 22 20:06:38 misato kernel: [427725.137689] ata9: SATA link up 1.5 Gbps (SStatus 113 SControl 300)
Aug 22 20:06:38 misato kernel: [427725.137993] ata9.00: configured for UDMA/66
Aug 22 20:06:38 misato kernel: [427725.138097] ata9: EH complete

I tried use run another SeaChest command but it will stuck at scanning and won't communicate with the drive that was currently changing the sectorsize. And the HDD was bricked. The machine is a HPE MicroServer Gen 10 using Marvell 88SE9230 SATA controller.It is a NAS running Debian 11. The hard drive that failed is a Exos X14 14TB SATA HDD. I have successfully upgrade the firmware using SeaChest on this machine before so it is very surprising to me the setSector command would fail.

Despite I tried my best to shutdown the background processes that may related to disk IO, I suspect I missed a process that periodically check the SMART data for HDDs and it communicated with the hard drive when the setSector process was ongoing. And the kernel discovered that harddrive wasn't responding so it went ahead and reset the ATA link in an attempt to make the hard drive communicate. This reset probably caused the bricking of the drive.

I tried a few attempts to recover the drive. Since the harddrive wasn't responding I rebooted the machine but this bricked drive was causing the machine stuck at POST. So I have to remove it and put it on another machine which is a Windows desktop. This time the hard drive shown up in the system, however, it cannot be read or write (will result in IO error). I rerun the seaChest_Format --setSectorSize but it will fail and claim the HDD doesn't have this feature. I found this issue on github and tried to manually invoke the --seagateQuickFormat command (using an older openseaChest release) mentioned by @vonericsen but it failed even with --force flag. Eventually I gave up and requested an RMA from seagate.

#One question and a few suggestions:
Question:

What is a the best approach to recover HDD when interrupt (due to kernel ATA reset) happened during the setSectorSize? I suspect if I didn't power cycle the machine to remove the drive, and found a way to inject the quickFormat command right after the ATA reset happened, there are some chances the HDD can be saved. The power cycle might be the actual reason why the HDD was bricked and cannot be recovered.

Suggestion:

It might worth mentioning in the --seaChest_Format warning message, that any background process that try to communicate with HDD will very likely to cause ATA link reset on Linux and the reset will result in the HDD no longer operatable and cannot be recovered from seaChest software. Right now it warned about the risk but people (like me) might have a feeling they are safe as long as all known background services were shutdown.
It might worth recommend user (on Linux) to use system's single user mode or using a Linux live CD to run the destructive SeaChest command like setSectorSize, in the command's warning message. These modes don't have IO related background processes and might be very helpful to avoid issues I encountered.

from toolbin.

vonericsen commented on June 17, 2024

Hi @zeroomega,

Sorry to hear about your drive.

When the command is issued, it is issued with an infinite timeout to wait as long as possible for the drive to return status, and the handle is "locked" or attempted to lock in whatever ways we know of to try stopping the OS/background processes from communicating with it. I've never found a way to completely block off other processes in the background...even running multiple drives on the same HBA may have this issue where one is still busy and you try to start the operation on another one will trigger the reset...I think this case is happening at the driver level.
It sounds like the system never passed back a completion status which is why it looked hung for so long. That sounds like a kernel bug to me since a status should be returned, even a failing one from the link reset. I can look at changing this to something else to not wait forever. The 5 minutes is an approximate time from our testing only and too short of a timeout is also not good since that can also trigger a reset.

In our latest release and all future releases, the Seagate quick format command is automatically issued after it gets back a failing status. This would not have happened in your case since it was still waiting for a response.
In previous versions this command would attempt to run automatically in some cases, but not all and had a standalone option.

Question:
What is a the best approach to recover HDD when interrupt (due to kernel ATA reset) happened during the setSectorSize? I suspect if I didn't power cycle the machine to remove the drive, and found a way to inject the quickFormat command right after the ATA reset happened, there are some chances the HDD can be saved. The power cycle might be the actual reason why the HDD was bricked and cannot be recovered.

In my experience power cycling may not make much difference. I've power cycled drives that were reset mid format and have been able to recover them, but there is a chance this would have made it more difficult to recover.
The bigger problem is where the drive was at in the internal sector size reformat. When the drive does this reformat it may be touching other important areas inside the drive to keep it running and if the reset occurs while writing to a critical area, that may have been the thing that caused the problem, whereas if the reset happened at a different time where it wasn't touching a critical area of the drive, it may have continued showing up ok, even if it still needed the quick format command to recover properly.
It can also depend on what the HBA and OS does to discover the drive. Sometimes the system requires read commands to specific areas to complete without an error before the drive handle is exposed to the system. I have a dead SSD at home that won't show in Linux, but will in Windows (oddly enough), and it only completes an identify command. So it gets a drive handle for about 2 seconds in linux before disappearing....but this comes and goes between kernel versions in my experience.
Some HBAs require certain command to complete successfully to see them in their BIOS/Configuration area for boot as well which can sometimes mean the HBA will not see the drive. One example I know of that can cause this is using the PUIS feature on some SAS HBAs that do not understand that the drive requires an extra spinup command, so they will not discover and show the drive to the OS. That is similar to a bad drive not responding as these HBAs may expect, even if the scenario is a little different.

It might worth mentioning in the --seaChest_Format warning message, that any background process that try to communicate with HDD will very likely to cause ATA link reset on Linux and the reset will result in the HDD no longer operatable and cannot be recovered from seaChest software. Right now it warned about the risk but people (like me) might have a feeling they are safe as long as all known background services were shutdown.

I will clarify this warning message to help and include that there may be background processes from the OS that are outside of the user's control and may not be able to be stopped.

It might worth recommend user (on Linux) to use system's single user mode or using a Linux live CD to run the destructive SeaChest command like setSectorSize, in the command's warning message. These modes don't have IO related background processes and might be very helpful to avoid issues I encountered.

This is a good idea, I will put this suggestion into the help for the option as well as in the warning message dumped onto the screen ahead of issuing the command.

from toolbin.

zeroomega commented on June 17, 2024

Thank you very much for the detailed explanation.

It sounds like the system never passed back a completion status which is why it looked hung for so long. That sounds like a kernel bug to me since a status should be returned, even a failing one from the link reset. I can look at changing this to something else to not wait forever. The 5 minutes is an approximate time from our testing only and too short of a timeout is also not good since that can also trigger a reset.

I suspect the reason SeaChest didn't receive a response from a kernel ATA link reset might due to the fact that my HBA is a Marvell 88SE9230 controller instead of regular integrated Intel or AMD SATA controller. This Marvell controller is not very common on desktop/server motherboard so the driver implementation might be slightly different and may have bugs like this. Do you mostly test SeaChest SATA commands on Intel/AMD controllers? Does Windows have similar issue that the OS perform ATA link reset when HDD is not responding?

I probably will not do a 4Kn reformat again after I get a replacement drive. The risk seems to be too high compared to limited performance gain over 512e.

from toolbin.

vonericsen commented on June 17, 2024

@zeroomega,

Do you mostly test SeaChest SATA commands on Intel/AMD controllers?

Most systems we have tested with the built-in Motherboard SATA ports, but I have also tested a couple Marvel controllers in the past too and did not notice a significant difference in that testing. I do not know if changing the sector size was tested on a Marvel controller or not though, which may be why we didn't see this kind of hang before. We also test a lot with SAS/SATA HBAs from companies like Broadcom and Microchip and others, but we cannot test every adapter...there are too many of them, sometimes even differing in behavior from different adapter firmware versions and driver versions.
We try testing what we hear our customers are using most often in their configurations when it is possible. Whenever a bug is reported for a new piece of hardware, we do our best to figure out what the issue is and find a resolution.
I've put in some new HBA specific workarounds for some really unique cases we've recently encountered, but we generally try to code the tool in a way that works without too many specific workarounds so that it "just works" without making it too complicated...when that is not possible we start adding hardware specific workarounds.

Does Windows have similar issue that the OS perform ATA link reset when HDD is not responding?

Yes, Windows will also do this. Windows also has a few other quirks that vary between versions as well.
Prior to Windows 10, if you did a firmware update on a SATA/PATA adapter you would most likely get a BSOD (Blue Screen of Death) after the update was done...sometimes immediately, sometimes a little while later when it checked the drive's identify data again. After the automatic BSOD reboot the system would work fine.
I suspect that since a change to the sector size would change the identify data, Windows might get a BSOD on these older versions. It's one of the reasons we suggest rebooting the system after it is done to refresh what it knows about the drive since the OS is not going to be aware that the sector size changes.
Windows 10 and later are far less likely to have BSOD's from changes like this.

from toolbin.

broken HDD drive after changing to 4kn about toolbin HOT 22 CLOSED

Comments (22)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent