Hello. I am using the latest release (1.5.2) and am experiencing a segmentation fault

From gdb ? The only command I know is <code class="no

Segmentation fault from unlocking / destroying tags from a thread,about libplctag/libplctag

Comments (17)

kyle-github commented on July 22, 2024 1

Thanks for the report!

I will try to replicate it. From the log it looks like it might be very timing dependent. The seg fault is within the IO thread which is decoupled from anything done with the client side of the API. So, that limits where it could be happening. I suspect that the unlock might not have anything to do with it. Calling the destroy function though...

from libplctag.

kyle-github commented on July 22, 2024 1

No worries. The first bug report back in January was definitely a bug in the library. A royal @#% pain to repro, but very real. I was removing a buffer while another thread was still using it. That was the primary fix in 1.5.3 and it came completely from your bug report.

from libplctag.

kyle-github commented on July 22, 2024

Do you have the debug log?

from libplctag.

xinthose commented on July 22, 2024

From gdb? The only command I know is bt. At the end of the porter log is where it seg faults; I don't have a separate debug log. You think it could be related to my multi-threading application? The seg fault happens less often than not when simulating a lost PLC connection.

It's interesting that the first two tags are destroyed fine, but there's an issue with the third. I'm not sure if this is consistent or not.

The plc_tag vector holds tags connected to different PLC's, so I only destroy the ones related to the lost PLC.

from libplctag.

kyle-github commented on July 22, 2024

The debug log is what you get when you put &debug=3 at the end of the string that creates the tag.

This dumps all kinds of useful stuff internally along with thread-specific control flow.

I suspect that it is when something is resending at the time that you destroy the tag. I am still trying to figure out how to make my tests do that on demand.

from libplctag.

xinthose commented on July 22, 2024

OK. I don't see that in the API at all. It outputs more details to cout.

from libplctag.

xinthose commented on July 22, 2024

I cannot resimulate the error with just one PLC (all that I'm using now). I will post the debug output once I see it happen again.

from libplctag.

kyle-github commented on July 22, 2024

It is not in the API directly. It is one of the elements you can put into the string describing a tag.

"protocol=ab-eip&gateway=...&elem_count=10&name=MyNiftyTag&debug=3"

It dumps out to stderr.

I have had a request to add this as a new API function and I think I will. I do not like adding to the API if I do not need to, but this one was probably a mistake of being too minimal.

from libplctag.

kyle-github commented on July 22, 2024

I still cannot seem to replicate this. There is more going on than timing.

Can you describe the scenario in more detail? How many PLCs? Was one or more down when this happened?

There are two possibilities here that seem the most likely. If the session object is being partially destroyed before it is removed from the IO thread's list, then I can see where some NULL might get dereferenced. If it is that there is an outstanding request and the tag is destroyed, there might be a timing issue if the outstanding request is getting resent at just the wrong time.

How repeatable is this?

from libplctag.

kyle-github commented on July 22, 2024

Examining the code, I think it was the second of my theories. It looks like it might have been possible to have an outstanding request when plc_tag_destroy() was called. I added some reference count acquire/release and pushed it to a new branch: issue_45_crash_fix.

Can you see if you can recreate the problem with that branch? Thanks for reporting the problem!

from libplctag.

xinthose commented on July 22, 2024

I see. Thank you and you're welcome for the report. I will be sure to use &debug=3 for any future issues. I need more than one ML1400 to repeat the scenario. It might be a bit before I can test the fix (job already shipped), but I will let you know when I do.

from libplctag.

xinthose commented on July 22, 2024

Hello again. I'm getting a segmentation fault when a read fails after I unplug the PLC. I compiled all of my tags with "&debug=3" and am attaching the cout and cerr output. Everything initializes fine, the code reads 4 times then fails at 17:09:19.794. I don't have this problem on 32-bit Debian 8.
plctag_seg_fault.txt

Release: 1.5.3
Computer: Ubuntu 16.04, 64-bit
PLC: Qty. 1 of MicroLogix 1400

Thank you.

from libplctag.

kyle-github commented on July 22, 2024

Thanks, this should help.

The repro steps are:

start a connection to a MicroLogix 1400 PLC with lots of reads and writes(?).
wait a few seconds for the connection to work a few times.
pull the connection (does this mean turning off the PLC or pulling the Ethernet cable?)
Boom

Does that look correct? I will try to repro this. I have Ubuntu 16.04 x64 as well. I do not have a MicroLogix. I hope it is not dependent on the type of PLC.

I was able to repro something similar with 1.5.2, but only with a WiFi connection and a very loaded ControlLogix. However, that problem completely disappeared with the 1.5.3 fix. I suspect this is something different.

Thanks again for the debug logs and the report!

from libplctag.

xinthose commented on July 22, 2024

By unplug the PLC, I mean unplug the Ethernet cable, but yes, that looks correct. I have not tested this with any other PLC. You're welcome. We succeed together!

from libplctag.

kyle-github commented on July 22, 2024

This is a little confusing. I find no direct signs of a crash in the library itself. In fact, at the time you mentioned, everything looks just fine:

thread(0002) 2017-02-17 17:09:19.793 INFO ok_to_resend:802 Request waited 30ms, need to resend.
thread(0002) 2017-02-17 17:09:19.793 INFO session_check_outgoing_data_unsafe:1043 Requeuing unconnected request.
thread(0002) 2017-02-17 17:09:19.793 INFO session_check_outgoing_data_unsafe:1078 Readying unconnected packet to send.
thread(0002) 2017-02-17 17:09:19.793 INFO request_acquire:71 Acquiring request.
thread(0002) 2017-02-17 17:09:19.793 INFO refcount_acquire:61 Ref count is now 3
thread(0002) 2017-02-17 17:09:19.793 INFO session_check_outgoing_data_unsafe:1086 sending packet, so 1 unconnected requests in flight.
thread(0002) 2017-02-17 17:09:19.793 INFO send_eip_request_unsafe:65 Sending unconnected packet with session sequence ID 399dc77
thread(0002) 2017-02-17 17:09:19.793 INFO send_eip_request_unsafe:93 Sending packet of size 73
thread(0002) 2017-02-17 17:09:19.793 INFO send_eip_request_unsafe:94 00000 6f 00 31 00 f0 f2 49 d5 00 00
thread(0002) 2017-02-17 17:09:19.793 INFO send_eip_request_unsafe:94 00010 00 00 77 dc 99 03 00 00 00 00
thread(0002) 2017-02-17 17:09:19.793 INFO send_eip_request_unsafe:94 00020 00 00 00 00 00 00 00 00 01 00
thread(0002) 2017-02-17 17:09:19.793 INFO send_eip_request_unsafe:94 00030 02 00 00 00 00 00 b2 00 21 00
thread(0002) 2017-02-17 17:09:19.793 INFO send_eip_request_unsafe:94 00040 4b 02 20 67 24 01 07 3d f3 45
thread(0002) 2017-02-17 17:09:19.793 INFO send_eip_request_unsafe:94 00050 43 50 21 0f 00 00 00 67 00 00
thread(0002) 2017-02-17 17:09:19.793 INFO send_eip_request_unsafe:94 00060 01 00 06 08 a0 96 09 94 08 00
thread(0002) 2017-02-17 17:09:19.793 INFO send_eip_request_unsafe:94 00070 00 80 3f 00
thread(0002) 2017-02-17 17:09:19.793 INFO request_release:83 Releasing request.
thread(0002) 2017-02-17 17:09:19.793 INFO refcount_release:91 Refcount is now 2

The above shows that the IO thread noticed that a request in the queue did not get a response within 30ms (this is a dynamically calculated value based on how long it takes for responses to come back from the PLC). So, it sets it up for sending again, sends it, and clears some internal links and releases the resource (refcount--). All that looks good.

thread(0002) 2017-02-17 17:09:19.793 INFO session_check_outgoing_data_unsafe:1057 2 unconnected requests in flight.

And now we are back to the main IO thread loop. So far, all good. Note that the time is 1ms before your crash.

The next block takes place in another thread. In that thread, the write times out and we clean up the tag.

thread(0004) 2017-02-17 17:09:19.794 WARN plc_tag_write:725 Write operation timed out.
thread(0004) 2017-02-17 17:09:19.794 INFO plc_tag_abort_mapped:354 Starting.
thread(0004) 2017-02-17 17:09:19.794 INFO request_release:83 Releasing request.
thread(0004) 2017-02-17 17:09:19.794 INFO refcount_release:91 Refcount is now 1
thread(0004) 2017-02-17 17:09:19.794 INFO plc_tag_abort_mapped:367 Done.
thread(0004) 2017-02-17 17:09:19.794 INFO plc_tag_write:731 Done

So, we abort anything that was outstanding (there were two requests that the IO thread reported before so one must have been from this tag). We release the request (refcount--), and then we return from plc_tag_write with an error of PLCTAG_ERR_TIMEOUT. All good so far.

This is where I start getting confused. Thread 4 (each line of the log file is prefaced with the thread that was executing when the log entry was written) is never seen again. There are no entries from that thread after this. What I expected to see was something calling plc_tag_destroy.

At least from this, it seems possible that the SEGV or crash problem happened outside the library immediately after the TIMEOUT was returned.

Instead, what I see is this:

thread(0002) 2017-02-17 17:09:19.795 INFO request_release:83 Releasing request.
thread(0002) 2017-02-17 17:09:19.795 INFO refcount_release:91 Refcount is now 0
thread(0002) 2017-02-17 17:09:19.795 INFO refcount_release:94 Calling destructor function.
thread(0002) 2017-02-17 17:09:19.795 INFO session_check_outgoing_data_unsafe:1057 1 unconnected requests in flight.

That is the IO thread noticing that it has an aborted request in its queue (from the tag we just cleaned up) and releasing it. That causes the refcount to drop to zero and we actually destroy (free) the request data. The tag is still around as we never saw a plc_tag_destroy call. But all of its references to outstanding requests were cleaned up. So far, so good.

When I keep looking in the log file, I see a repeating pattern of the IO thread looking at its requests and doing nothing. Eventually 30ms passes and we resend the single remaining request. After a long time, that times out:

thread(0002) 2017-02-17 17:09:20.740 INFO session_check_outgoing_data_unsafe:1057 1 unconnected requests in flight.
thread(0002) 2017-02-17 17:09:20.741 INFO session_check_outgoing_data_unsafe:1057 1 unconnected requests in flight.
thread(0002) 2017-02-17 17:09:20.743 INFO session_check_outgoing_data_unsafe:1057 1 unconnected requests in flight.

That is the IO thread chugging along seeing that it still has a request that is waiting for a response. I.e. we have sent the data in the request and are waiting for the PLC to respond. Note the time.

Then we see a completely different thread give up.

thread(0003) 2017-02-17 17:09:20.743 INFO plc_tag_abort_mapped:354 Starting.
thread(0003) 2017-02-17 17:09:20.743 INFO request_release:83 Releasing request.
thread(0003) 2017-02-17 17:09:20.743 INFO refcount_release:91 Refcount is now 1
thread(0003) 2017-02-17 17:09:20.743 INFO plc_tag_abort_mapped:367 Done.
thread(0003) 2017-02-17 17:09:20.743 INFO plc_tag_read:588 elapsed time 5000ms
thread(0003) 2017-02-17 17:09:20.743 INFO plc_tag_read:591 Done

It looks like I have slightly different debug output for read vs. write. My bad. I need to fix that. But, we clearly time out given the elapsed time printed out. So, as before, we see the request being released and then plc_tag_read returns. The "Done" is printed as the last statement before return, so if things are dying in the library there, we probably have a corrupt stack or something. While that is possible, it seems more likely that something further up the chain is the problem.

This time, it looks like the crash was hard because the library logged nothing. Now, the log is just IO and that is buffered, so we can lose some on a crash however given that something caught the error, see below, it probably would have flushed stderr automatically.

Here is the immediate next line in the log:

terminate called after throwing an instance of 'int'
on_power_loss:789) >> ended
17-03-17 17:08:55.714464 [info] src/events.cpp(micrologix_read_system_ready:3228) >> system_ready = true
17-03-17 17:09:00.725730 [info] src/events.cpp(micrologix_read_system_ready:3228) >> system_ready = true
17-03-17 17:09:05.731457 [info] src/events.cpp(micrologix_read_system_ready:3228) >> system_ready = true
17-03-17 17:09:10.743108 [info] src/events.cpp(micrologix_read_system_ready:3228) >> system_ready = true
17-03-17 17:09:19.794286 [error] src/plctag.cpp(write_tag:457) >> tag[4] >> plc_tag_write >> status = PLCTAG_ERR_TIMEOUT
17-03-17 17:09:19.794501 [error] src/events.cpp(micrologix_write_heart_beat:3336) >> _plctag.write_tag >> error = -3
17-03-17 17:09:19.948191 [error] src/main.cpp(event_dispatcher:1797) >> MICROLOGIX PLC ERROR
17-03-17 17:09:19.948397 [info] src/events_error.cpp(micrologix_error:201) >> started
17-03-17 17:09:19.948505 [info] src/main.cpp(reset_controller_threads:435) >> controller threads reset
17-03-17 17:09:20.695634 [warning] src/events_other.cpp(error_handling:233) >> SYSTEM FAULT
17-03-17 17:09:20.708312 [info] src/events_other.cpp(error_handling:339) >> error handled
17-03-17 17:09:20.743590 [error] src/plctag.cpp(read_tag:312) >> tag[0] >> plc_tag_read >> status = PLCTAG_ERR_TIMEOUT

That is all your code.

I will try to duplicate this because if it is even a little bit possible that you have discovered a bug in the library, I want to fix it! But, I think you might want to look at the code that is closest to the library API calls and make sure that something weird is not happening there.

The fact that there is something that seems to be killing off threads after control leaves the library makes me wonder if that is the core of the problem... Not sure.

In your code output above, it looks like it catches a write timeout first at 17:09:19.794286. That agrees with the library log.

Then at 17:09:19.9485397 it looks like something is starting up...

Then at 17:09:20.695634 we see SYSTEM FAULT. Is that a SEGV? Then we see the read timeout.

In the library log at that time, we see:

thread(0002) 2017-02-17 17:09:20.692 INFO session_check_outgoing_data_unsafe:1057 1 unconnected requests in flight.
thread(0002) 2017-02-17 17:09:20.693 INFO session_check_outgoing_data_unsafe:1057 1 unconnected requests in flight.
thread(0002) 2017-02-17 17:09:20.694 INFO session_check_outgoing_data_unsafe:1057 1 unconnected requests in flight.
thread(0002) 2017-02-17 17:09:20.695 INFO session_check_outgoing_data_unsafe:1057 1 unconnected requests in flight.
thread(0002) 2017-02-17 17:09:20.696 INFO session_check_outgoing_data_unsafe:1057 1 unconnected requests in flight.
thread(0002) 2017-02-17 17:09:20.697 INFO session_check_outgoing_data_unsafe:1057 1 unconnected requests in flight.
thread(0002) 2017-02-17 17:09:20.698 INFO session_check_outgoing_data_unsafe:1057 1 unconnected requests in flight.
thread(0002) 2017-02-17 17:09:20.699 INFO session_check_outgoing_data_unsafe:1057 1 unconnected requests in flight.
thread(0002) 2017-02-17 17:09:20.702 INFO ok_to_resend:802 Request waited 30ms, need to resend.
thread(0002) 2017-02-17 17:09:20.702 INFO session_check_outgoing_data_unsafe:1043 Requeuing unconnected

There's nothing special there. Just the retries on the read request that we are going to time out later. There is no sign of another tag being created or a tag being destroyed.

At the time that your log indicates that something is restarting, there is nothing in the library logs except the IO thread still attempting to send the remaining request. In particular, I do not see any tags being created. The library is not initializing itself (if a new process was started, that would happen). The existing other tag still exists since we see the IO thread gamely trying to send the data.

I will see if I can repro this. Your repro steps are exactly the tests I did in the past, but it is possible that the PLC type is somehow different. I will try with a PLC 5 since that is the closest thing I have to MicroLogix. I will try with my ControlLogix too.

from libplctag.

xinthose commented on July 22, 2024

Great work, thank you. The "terminate called after throwing an instance of 'int'" could be an uncaught throw in my code. I will re-check it tomorrow. I hope I'm not wasting your time with an issue with my own code.

from libplctag.

xinthose commented on July 22, 2024

I found the issue. I was not catching the error from the function micrologix_read_system_ready. I thought I was, which is why I could not find the source of the segfault. I hope this was a good exercise at least.

from libplctag.

Segmentation fault from unlocking / destroying tags from a thread about libplctag HOT 17 CLOSED

Comments (17)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent