Comments (11)
Hi @rui-coelho,
Just a brief post to acknowledge receipt of your bug report. It is unfortunate that the issue is not reproducible (so far).
Later today, I will post an answer to your question about the error message.
Am I correct that your cluster is running this MDSplus version: alpha-7-96-1
from 8-Jan-2020?
from mdsplus.
Hi @rui-coelho -- To provide a useful answer to your question, we would need more details.
But here is the generic answer . . .
There is only one occurrence of TreeLOCK_FAILURE
in all of the MDSplus source code. Of course as the error propagates up the call stack, it can surface to the user via various APIs (Python, C/C++, Java, whatever.) The error is thrown by the io_lock_local()
function of theRemoteAccess.c
file. And occurs when the system fcntl()
function fails to obtain a lock on a file.
Originally on POSIX systems, file locking was per-process. Which could lead to the situation of two different processes obtaining locks simultaneously on the same file. That is an undesirable condition and is not thread safe.
Thus, later Open File Descriptor locks (aka OFD locks) were added to prevent that condition. According to the man fcntl(2)
page, OFD locks were added in the 3.15 version of the Linux kernel.
MDSplus does support both types of locks. It will use OFD locks if it can, otherwise uses regular locks.
I don't know if CentOS 7 supports OFD locks or not. (If it doesn't, I'm not sure how file locks would work on a CentOS 7 cluster either.) Regardless, the type of locking provided by CentOS 7 would determine what file access scenarios would cause MDSplus to throw the TreeLOCK_FAILURE
error. For example, without OFD locks it might be that a mundane system administration program and MDSplus both are obtaining locks on the same MDSplus tree file. If CentOS 7 does support OFD locks, then there would be fewer scenarios that would trigger the TreeLOCK_FAILURE
.
If you succeed in reproducing the error (or at least in capturing additional detail each time it randomly occurs), let us know.
Also, has anything changed on the cluster that might have caused this locking error to appear now?
from mdsplus.
About the version, i only found the naming 7.96.1 in our modules and the MDSPLUS_VERSION sys var which is set, unsure about alpha and release date. How can i get it ?
from mdsplus.
On the CentOS kernel, we are running 3.10.0-693.el7.x86_64
from mdsplus.
Hi @rui-coelho,
The easy way to display the MDSplus version is to run TCL's show version
command. I'm pretty sure your site is using the alpha
release and not the stable
release. Reason I asked about the version is simply to confirm that it is alpha
.
Here is an example from one of my development systems.
$ mdstcl
TCL> show version
MDSplus version: 7.139.17
----------------------
Release: alpha_release_7.139.17
Browse: https://github.com/MDSplus/mdsplus/tree/alpha_release_7.139.17
Download: https://github.com/MDSplus/mdsplus/archive/alpha_release_7.139.17.tar.gz
TCL>
Based on the output of the show version
command, we can look up the associated release date.
from mdsplus.
Thanks for the tip @mwinkel-dev ! Unfortunately i don't bring much clear info since i get:
<g2rcoelh@s51 ~>module load imasenv/3.38.1
IMAS environment loaded.
<g2rcoelh@s51 ~>mdstcl
TCL> show version
MDSplus version: 1.0.0
----------------------
Release: unknown_release-1-0-0
Browse: https://github.com/MDSplus/mdsplus/tree/unknown_release-1-0-0
Download: https://github.com/MDSplus/mdsplus/archive/unknown_release-1-0-0.tar.gz
Build date: Fri Sep 9 13:30:18 CEST 2022
from mdsplus.
....however....the guys who installed it have named it 7.96.1 since when looking to the installed module i spot:
setenv MDSPLUS_DIR /gw/swimas/libs/mdsplus/7.96.1/intel/2020
setenv MDSPLUS_VERSION 7.96.1/intel/2020
Inside the corresponding folder i don't see much useful info though.....
from mdsplus.
Hi @rui-coelho,
Thanks for running TCL's show version
command and posting the output.
There are two methods for installing MDSplus. Many customers just install the precompiled MDSplus packages (RPM files for CentOS / RHEL and DEB files for Debian / Ubuntu). Some customers instead download the source files from GitHub and compile MDSplus on their computers.
When using the MDSplus packages, show version
will display alpha_release-
or stable_release-
(followed by the associated release, major and minor numbers). I had incorrectly assumed that your site had installed MDSplus from the packages.
The show version
output you pasted above indicates that the MDSplus on your CentOS 7 cluster was the result of downloading the MDSplus source files, compiling it on the cluster, and then installing that. Additional evidence is that the "Build date" is 2022 (which is two years after alpha_release-7-96-1
was issued).
For future reference . . .
Another way to identify the release is to go to the "releases" page of the MDSplus GitHub site. There is a "Find a release" search box in the upper right. Enter 7-96-1
in that box and when the search is done, scroll down the page and eventually you will see that it is alpha_release-7-96-1
from 8-Jan-2020.
https://github.com/MDSplus/mdsplus/releases
Note that because of a quirk in the MDSplus build system, it is possible (but unlikely) to have both an alpha_release-X-Y-Z
and a stable_release-X-Y-Z
-- however those are two different releases and would likely have different content. Thus when searching for a release, it is best to review the entire list of results. In this specific case, there is only one release with -7-96-1
, and it is alpha
-- which answers my original question.
from mdsplus.
Hi @rui-coelho,
Although it is useful to know that your CentOS 7 cluster is running alpha_release-7-96-1
, that information doesn't solve the intermittent locking problem you've encountered.
The more important clue is that man fcntl(2)
says that OFD locks were introduced in the 3.15
Linux kernel, which is newer than the 3.10.0-693.el7.x86_64
kernel on your CentOS 7 cluster. So apparently, your CentOS 7 cluster only supports regular per-process locks.
However, the cluster obviously has to support complex workloads from many concurrent users / programs. So it is probable the cluster's file system (GlusterFS? Ceph? other?) has an important role in lock management. And if the cluster's file system exports NFS volumes (that contain MDSplus trees), then NFS certainly is involved in locks.
Which is to say that there isn't enough information at present to troubleshoot the issue with the occasional locking errors.
One way to move forward is simply to collect more information every time the locking problem arises (i.e., capture crash dumps, error logs, note whether any system programs were accessing the tree files at the same time as MDSplus, note what other users were on the system and which programs they were running, and so forth). And to hope that eventually that information will reveal some clues about the root cause of the problem.
Another way forward is to begin evaluating / planning a software upgrade of the cluster. CentOS 7 reaches its "end of life" on 30-Jun-2024 (i.e., no more security patches and bug fixes). Thus eventually your site will probably have to migrate to a newer operating system anyway.
https://blog.centos.org/2023/04/end-dates-are-coming-for-centos-stream-8-and-centos-linux-7/
from mdsplus.
Thanks for the support and detailed explanation @mwinkel-dev. As the cluster will be replaced after summer and will surely get a much more recent kernel, we can wait for that new hardware/software.
from mdsplus.
Hi @rui-coelho,
You're welcome. I am glad to read that your site will be replacing the cluster after the summer. If you have any problems with MDSplus on the new cluster, let us know.
from mdsplus.
Related Issues (20)
- Is it possible to limit the amount of concurrent connections of a specific user HOT 3
- Have TCL's `show version` command also display the build date HOT 1
- In `tdi/RfxDevices/DIO2` initialization, remove the flag that triggers a TDI `abort()` on a failed connection HOT 1
- Error reporting when using function mdsvalue in Matlab which version——stable_release-7-142-80 HOT 14
- Source repo contains obsolete public key for RPM packages HOT 2
- Wrong number of child nodes in python TreeNode.getChildren()
- add timestamps, client ip & name, pid to mdsip error logs, equivalent to "access" logs HOT 1
- assign priorities to incoming mdsip connections HOT 1
- intermittent failure to connect to MDSplus server HOT 2
- enable linking MDSplus to custom python conda environments, and NOT system-wide python HOT 2
- errors importing MDSplus with numpy 2.x HOT 1
- Stable 7.142.80, Windows: server does not accept incoming conenctions (Attempting to duplicate socket from pid 8660 socket 380) HOT 7
- Source code has outdated MIT License, so eventually should replace with current MIT License HOT 1
- RHEL manual builds done locally (not on build server) fail because incorrectly expects an RPM signing key HOT 1
- Existing MATLAB scripts are broken by changes in class/type caused by APD changes.
- Make `mdsip` logging compatible with the automatic log management features provided by the operating system HOT 1
- Python documentation on mdsplus.org broken. HOT 5
- Uninstalling MDSplus on RHEL systems fails to remove the MDSplus signing key from RPM's key manager HOT 2
- Error installing `mdsplus-python` package on RHEL7
- another trouble with numpy 2.x
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from mdsplus.