Giter Club home page Giter Club logo

Comments (11)

mwinkel-dev avatar mwinkel-dev commented on September 25, 2024

Hi @rui-coelho,

Just a brief post to acknowledge receipt of your bug report. It is unfortunate that the issue is not reproducible (so far).

Later today, I will post an answer to your question about the error message.

Am I correct that your cluster is running this MDSplus version: alpha-7-96-1 from 8-Jan-2020?

from mdsplus.

mwinkel-dev avatar mwinkel-dev commented on September 25, 2024

Hi @rui-coelho -- To provide a useful answer to your question, we would need more details.

But here is the generic answer . . .

There is only one occurrence of TreeLOCK_FAILURE in all of the MDSplus source code. Of course as the error propagates up the call stack, it can surface to the user via various APIs (Python, C/C++, Java, whatever.) The error is thrown by the io_lock_local() function of theRemoteAccess.c file. And occurs when the system fcntl() function fails to obtain a lock on a file.

Originally on POSIX systems, file locking was per-process. Which could lead to the situation of two different processes obtaining locks simultaneously on the same file. That is an undesirable condition and is not thread safe.

Thus, later Open File Descriptor locks (aka OFD locks) were added to prevent that condition. According to the man fcntl(2) page, OFD locks were added in the 3.15 version of the Linux kernel.

MDSplus does support both types of locks. It will use OFD locks if it can, otherwise uses regular locks.

I don't know if CentOS 7 supports OFD locks or not. (If it doesn't, I'm not sure how file locks would work on a CentOS 7 cluster either.) Regardless, the type of locking provided by CentOS 7 would determine what file access scenarios would cause MDSplus to throw the TreeLOCK_FAILURE error. For example, without OFD locks it might be that a mundane system administration program and MDSplus both are obtaining locks on the same MDSplus tree file. If CentOS 7 does support OFD locks, then there would be fewer scenarios that would trigger the TreeLOCK_FAILURE.

If you succeed in reproducing the error (or at least in capturing additional detail each time it randomly occurs), let us know.

Also, has anything changed on the cluster that might have caused this locking error to appear now?

from mdsplus.

rui-coelho avatar rui-coelho commented on September 25, 2024

About the version, i only found the naming 7.96.1 in our modules and the MDSPLUS_VERSION sys var which is set, unsure about alpha and release date. How can i get it ?

from mdsplus.

rui-coelho avatar rui-coelho commented on September 25, 2024

On the CentOS kernel, we are running 3.10.0-693.el7.x86_64

from mdsplus.

mwinkel-dev avatar mwinkel-dev commented on September 25, 2024

Hi @rui-coelho,

The easy way to display the MDSplus version is to run TCL's show version command. I'm pretty sure your site is using the alpha release and not the stable release. Reason I asked about the version is simply to confirm that it is alpha.

Here is an example from one of my development systems.

$ mdstcl
TCL> show version

MDSplus version: 7.139.17
----------------------
  Release:  alpha_release_7.139.17
  Browse:   https://github.com/MDSplus/mdsplus/tree/alpha_release_7.139.17
  Download: https://github.com/MDSplus/mdsplus/archive/alpha_release_7.139.17.tar.gz

TCL> 

Based on the output of the show version command, we can look up the associated release date.

from mdsplus.

rui-coelho avatar rui-coelho commented on September 25, 2024

Thanks for the tip @mwinkel-dev ! Unfortunately i don't bring much clear info since i get:

<g2rcoelh@s51 ~>module load imasenv/3.38.1
IMAS environment loaded.
<g2rcoelh@s51 ~>mdstcl
TCL> show version


MDSplus version: 1.0.0
----------------------
  Release:  unknown_release-1-0-0
  Browse:   https://github.com/MDSplus/mdsplus/tree/unknown_release-1-0-0
  Download: https://github.com/MDSplus/mdsplus/archive/unknown_release-1-0-0.tar.gz
  Build date: Fri Sep  9 13:30:18 CEST 2022

from mdsplus.

rui-coelho avatar rui-coelho commented on September 25, 2024

....however....the guys who installed it have named it 7.96.1 since when looking to the installed module i spot:

setenv           MDSPLUS_DIR /gw/swimas/libs/mdsplus/7.96.1/intel/2020
setenv           MDSPLUS_VERSION 7.96.1/intel/2020

Inside the corresponding folder i don't see much useful info though.....

from mdsplus.

mwinkel-dev avatar mwinkel-dev commented on September 25, 2024

Hi @rui-coelho,

Thanks for running TCL's show version command and posting the output.

There are two methods for installing MDSplus. Many customers just install the precompiled MDSplus packages (RPM files for CentOS / RHEL and DEB files for Debian / Ubuntu). Some customers instead download the source files from GitHub and compile MDSplus on their computers.

When using the MDSplus packages, show version will display alpha_release- or stable_release- (followed by the associated release, major and minor numbers). I had incorrectly assumed that your site had installed MDSplus from the packages.

The show version output you pasted above indicates that the MDSplus on your CentOS 7 cluster was the result of downloading the MDSplus source files, compiling it on the cluster, and then installing that. Additional evidence is that the "Build date" is 2022 (which is two years after alpha_release-7-96-1 was issued).

For future reference . . .

Another way to identify the release is to go to the "releases" page of the MDSplus GitHub site. There is a "Find a release" search box in the upper right. Enter 7-96-1 in that box and when the search is done, scroll down the page and eventually you will see that it is alpha_release-7-96-1 from 8-Jan-2020.
https://github.com/MDSplus/mdsplus/releases

Note that because of a quirk in the MDSplus build system, it is possible (but unlikely) to have both an alpha_release-X-Y-Z and a stable_release-X-Y-Z -- however those are two different releases and would likely have different content. Thus when searching for a release, it is best to review the entire list of results. In this specific case, there is only one release with -7-96-1, and it is alpha -- which answers my original question.

from mdsplus.

mwinkel-dev avatar mwinkel-dev commented on September 25, 2024

Hi @rui-coelho,

Although it is useful to know that your CentOS 7 cluster is running alpha_release-7-96-1, that information doesn't solve the intermittent locking problem you've encountered.

The more important clue is that man fcntl(2) says that OFD locks were introduced in the 3.15 Linux kernel, which is newer than the 3.10.0-693.el7.x86_64 kernel on your CentOS 7 cluster. So apparently, your CentOS 7 cluster only supports regular per-process locks.

However, the cluster obviously has to support complex workloads from many concurrent users / programs. So it is probable the cluster's file system (GlusterFS? Ceph? other?) has an important role in lock management. And if the cluster's file system exports NFS volumes (that contain MDSplus trees), then NFS certainly is involved in locks.

Which is to say that there isn't enough information at present to troubleshoot the issue with the occasional locking errors.

One way to move forward is simply to collect more information every time the locking problem arises (i.e., capture crash dumps, error logs, note whether any system programs were accessing the tree files at the same time as MDSplus, note what other users were on the system and which programs they were running, and so forth). And to hope that eventually that information will reveal some clues about the root cause of the problem.

Another way forward is to begin evaluating / planning a software upgrade of the cluster. CentOS 7 reaches its "end of life" on 30-Jun-2024 (i.e., no more security patches and bug fixes). Thus eventually your site will probably have to migrate to a newer operating system anyway.
https://blog.centos.org/2023/04/end-dates-are-coming-for-centos-stream-8-and-centos-linux-7/

from mdsplus.

rui-coelho avatar rui-coelho commented on September 25, 2024

Thanks for the support and detailed explanation @mwinkel-dev. As the cluster will be replaced after summer and will surely get a much more recent kernel, we can wait for that new hardware/software.

from mdsplus.

mwinkel-dev avatar mwinkel-dev commented on September 25, 2024

Hi @rui-coelho,

You're welcome. I am glad to read that your site will be replacing the cluster after the summer. If you have any problems with MDSplus on the new cluster, let us know.

from mdsplus.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.