Skip to content

Conversation

@smoe
Copy link
Collaborator

@smoe smoe commented Jan 23, 2026

Took the patch proposed by @BsAtHome for #3700 and prepared this PR for it. It looks fine to me but I also did not test it. Any takers?

The most important bits are

  • elimination of dynamic memory allocation
  • the declaration of the sessions variable as atomic. It is decreased when the client can no longer be read (within a thread), and increased in sockMain when a new thread is started (outside a thread) and also when the return value of the thread is != 0 (outside a thread).

There are some meta talking points:

  • There is no pthread_join or pthread_detach anywhere.
  • Is the exit(0) upon an acceptance failure (the immediate stop of the controller with no explanation whatsoever) what we truly want?

@smoe smoe added bug 2.9-must-fix must be fixed for the next 2.9 point release 2.10-must-fix labels Jan 23, 2026
@smoe smoe marked this pull request as draft January 23, 2026 11:51
@BsAtHome
Copy link
Contributor

Good to see that you fixed the off-by-one in read() :-)

Seeing pthread_join() normally indicates that the application is well behaved. It clearly is not when exit() is called upon failure. The program is mostly written as a C program masked in a C++ source file. There are many things that should be streamlined, but I'm not sure whether this has any real value at this time. There are many interactions to consider, for example with NLM.
That said, the missing pthread_join() does make it necessary to add pthread_detach() after creation because the resources should be released automatically when a thread ends. It can be added as an else-clause to the if(res != 0) just after the thread is created.

There are also no allowances for signals causing interrupted syscalls. That would simply cause the application to bail and exit.

The exit(0) in failed accept() is interesting because it ignores all network errors that should be treated as EAGAIN (see accept(2)). Your machine will stop accepting (remote) connections when the network wobbles. Is this good/bad/whatever?

That brings us to a general LCNC issue: error handling. There are many more instances in LCNC programs and utilities that are not exactly well-behaved when we talk about error resilience and recovery. But that would require a complete review and well defined strategy how to cope.

@rmu75
Copy link
Collaborator

rmu75 commented Jan 23, 2026

Is this stuff even thread safe? It looks like it uses emcsched.cc, and addProgram there just uses push_back on a std::list... did I muss a lock?. Seems incorrect.

@smoe
Copy link
Collaborator Author

smoe commented Jan 23, 2026

I like to where this cppcheck warning brought us. The pthread_detach should always declare the pthread_t to be reallocatable after it returned, but correct and direct me.

At one of the very last lines, just prior to the invocation of sockMain, another pthread_create is performed. That should be detached, too, but I wait for the clarification on the position of this pthread_detach.

@rmu75, I also just skimmed through emcsched.cc and emcsched.hh - the queueStatus is an enum that may indeed immediately benefit from some thread safety, as in testing for a condition and changing its status should not be interrupted. And the addProgram changes the queue and then triggers a queue sort, which may come as a disturbing inconvenience to anyone iterating through the queue at the same time. Should we have this as a separate issue?

@BsAtHome
Copy link
Contributor

Is this stuff even thread safe? ... Seems incorrect.

You are right! It is incorrect and not thread safe. There are probably many more problems with thread safety. These would not have been seen as the "normal" use is to attach only one remote client to send stuff.

Seen in this light, the choice is to limit to one client only (or use a select/poll construct to serialize access) or to review and fix the entire code path to be thread safe.

@BsAtHome
Copy link
Contributor

I like to where this cppcheck warning brought us. The pthread_detach should always declare the pthread_t to be reallocatable after it returned, but correct and direct me.

A detached thread will have its resources purged when it exits. That means that the return value cannot be retrieved, for which you would have to use a join. In this case, it doesn't matter because no return value is used. Therefore, a simple detach should suffice.

At one of the very last lines, just prior to the invocation of sockMain, another pthread_create is performed. That should be detached, too, but I wait for the clarification on the position of this pthread_detach.

That doesn't matter. That specific thread is the "queue runner" and there is only one during the lifetime of the program. Any resources associated with it will automatically die when the program dies.

@smoe
Copy link
Collaborator Author

smoe commented Jan 23, 2026

We have two separate threads already. One is started in

res = pthread_create(&updateThread, NULL, checkQueue, (void *)NULL);
and the other from within the socksMain we are editing. And both threads are changing the queue, one via parseCommand -> commandSet -> ... -> addProgram, the second via
q.remove(q.front());
.

@smoe
Copy link
Collaborator Author

smoe commented Jan 23, 2026

That doesn't matter. That specific thread is the "queue runner" and there is only one during the lifetime of the program. Any resources associated with it will automatically die when the program dies.

Done it anyway, may eventually avoid false positive memory leaks in valgrind.

@rmu75
Copy link
Collaborator

rmu75 commented Jan 23, 2026

I think the likelihood that concurrency-problems are triggered in real application of schedrmt are very low and in this case it would be acceptable to just document the shortcomings. Introducing a global mutex that is held while processing data from read() should be possible and relatively easy. As no significant amounts of data are moved that should have no performance implications.

Proper way to fix this amounts to a rewrite with boost::asio or equivalent (which would also get rid of the threads btw) (or port to python).

@BsAtHome
Copy link
Contributor

With a global mutex, I think only invocation of parseCommand() in schedrmt.cc:1177 and updateQueue() in schedrmt.cc:1141 need to be protected.

@rmu75
Copy link
Collaborator

rmu75 commented Jan 23, 2026

with something like this

class MutexGuard {
    static pthread_mutex_t mutex_;
public:
    MutexGuard() { while (pthread_mutex_lock(&mutex_)) /* sleep? exit?*/; }
    ~MutexGuard() { pthread_mutex_unlock(&mutex_); }
};

pthread_mutex_t MutexGuard::mutex_ = PTHREAD_MUTEX_INITIALIZER;

it should be only a matter of creating a local MutexGuard object at the start of the functions to be mutex'ed. Don't pthread_exit though.

@BsAtHome
Copy link
Contributor

There are no thread-die-exit-or-else afaics. So just mutex the two functions should handle the problem.

Then, thisQuit() is called from a signal handler and uses exit(0). However, you should use _exit(0) when called from a signal handler.

@smoe smoe force-pushed the emc_race_and_cpcheck_warnings branch 2 times, most recently from e95461e to fc012ca Compare January 25, 2026 14:58
@BsAtHome
Copy link
Contributor

@smoe, did you see my review comments?

@smoe smoe force-pushed the emc_race_and_cpcheck_warnings branch from fc012ca to b820de0 Compare January 25, 2026 15:43
@smoe
Copy link
Collaborator Author

smoe commented Jan 25, 2026

@smoe, did you see my review comments?

Well, yes and no, had not mentally parsed the exit -> _exit one. This is all a bit beyond my daily routines - what is the signal meant to stop? I was wrong to remove the exit from the implementation of shutdown, from how interpret this all now, but what flavour of exit is appropriate when the shutdown code also frees everything?

@smoe
Copy link
Collaborator Author

smoe commented Jan 26, 2026

With a global mutex, I think only invocation of parseCommand() in schedrmt.cc:1177 and updateQueue() in schedrmt.cc:1141 need to be protected.

Don't we need to expand any such global mutex also to emcsched.cc, etc?

@rmu75
Copy link
Collaborator

rmu75 commented Jan 26, 2026

I would keep the mutexes in the main loop of the threads. Blocking read returns data -> acquire mutex -> do stuff -> release mutex -> go back to blocking read of socket.

@smoe
Copy link
Collaborator Author

smoe commented Jan 26, 2026

@rmu75 , @BsAtHome , please kindly help me with a patch.

@smoe
Copy link
Collaborator Author

smoe commented Jan 26, 2026

With a global mutex, I think only invocation of parseCommand() in schedrmt.cc:1177 and updateQueue() in schedrmt.cc:1141 need to be protected.

Don't we need to expand any such global mutex also to emcsched.cc, etc?

I can answer that question for me now - no, we don't, since schedrmt is an application on its own, so there is no other route to invoke the same lower-level routines. Sorry, this seems rather obvious but somehow I needed that.

@smoe
Copy link
Collaborator Author

smoe commented Jan 28, 2026

Why exactly are you const'ifying the arguments? Is the context pointer prone to be misinterpreted and changed in inappropriate ways? Can you please state your case for dong this?

I liked it. It reflects in parts how I think.

On a side note: If there ever was any case for using references, this would be it. That would be a much better solution than to make complex const constructs.

I like references better, indeed.

But, these type of changes are rather irrelevant, do not really improve the code and do not fix real bugs. If you really want to fix the program to be top in class, then you should rewrite the entire code (also mainly because it uses C thinking in a C++ disguise).

Maybe that is something I would indeed want to do.

@smoe
Copy link
Collaborator Author

smoe commented Jan 28, 2026

Please run it with the LCNC cppcheck wrapper scripts/cppcheck.sh. There are several setup thingies in there.

$ cd src
$ ../scripts/cppcheck.sh emc/usr_intf/*.cc

This points to

emc/usr_intf/emcrsh.cc:530:29: warning: Potentially invalid type conversion in old-style C cast, clarify/fix with C++ cast [dangerousTypeCast]
  err = bind(server_sockfd, (struct sockaddr *)&server_address, server_len);
                            ^
emc/usr_intf/schedrmt.cc:331:23: warning: Potentially invalid type conversion in old-style C cast, clarify/fix with C++ cast [dangerousTypeCast]
  bind(server_sockfd, (struct sockaddr *)&server_address, server_len);
                      ^
emc/usr_intf/schedrmt.cc:1213:51: warning: Potentially invalid type conversion in old-style C cast, clarify/fix with C++ cast [dangerousTypeCast]
        int client_sockfd = accept(server_sockfd, (struct sockaddr *)&client_address, &client_len);

which I would then address with reinterpret_cast<sockaddr*> unless suggested otherwise.

If the introduced "const"s for the context are seriously frowned upon then I would remove them. Transitioning to references would come with quite another set of changes.

@BsAtHome
Copy link
Contributor

emc/usr_intf/emcrsh.cc:530:29: warning: Potentially invalid type conversion in old-style C cast, clarify/fix with C++ cast
... which I would then address with reinterpret_cast<sockaddr*> unless suggested otherwise.

Indeed, for C++ you do need that change because cppcheck is picky.

If the introduced "const"s for the context are seriously frowned upon then I would remove them. Transitioning to references would come with quite another set of changes.

I frown upon them (and specifically in the context of fixing these programs). They make no real improvement and just make things look complex. The correct change would be to convert to references.

@smoe smoe force-pushed the emc_race_and_cpcheck_warnings branch from 3117877 to 28a5f88 Compare January 28, 2026 15:56
@rmu75
Copy link
Collaborator

rmu75 commented Jan 28, 2026

I don't understand the mutex decl in emcrsh. If you want to backport the scheduling-stuff, I suggest to split off changes to emcrsh, the remote shell got a bunch of updates since 2.9

@smoe
Copy link
Collaborator Author

smoe commented Jan 28, 2026

I don't understand the mutex decl in emcrsh. If you want to backport the scheduling-stuff, I suggest to split off changes to emcrsh, the remote shell got a bunch of updates since 2.9

The parseCommand also uses strtok, so we should not allow any reentry by a separate thread. Here the diff of emcrsh to master:

--- a/src/emc/usr_intf/emcrsh.cc
+++ b/src/emc/usr_intf/emcrsh.cc
@@ -22,6 +22,8 @@
 #include <sys/socket.h>
 #include <netinet/in.h>
 #include <getopt.h>
+#include <atomic>              // for sessions counter
+#include <mutex>               // parseCommand is not reentrant

 #include "emcglb.h"            // EMC_NMLFILE, TRAJ_MAX_VELOCITY, etc.
 #include "inifile.hh"          // INIFILE
@@ -449,7 +451,7 @@ int enabledConn = -1;
 char pwd[16] = "EMC\0";
 char enablePWD[16] = "EMCTOO\0";
 char serverName[24] = "EMCNETSVR\0";
-int sessions = 0;
+std::atomic_int sessions = 0;
 int maxSessions = -1;

 const char *cmdTokens[] = {
@@ -479,6 +481,8 @@ struct option longopts[] = {
   {"path", 1, NULL, 'd'},
   {0,0,0,0}};

+std::mutex queue_mtx; // controls access to queue
+
 /* format string to outputbuffer (will be presented to user as result of command) */
 #define OUT(...) snprintf(context->outBuf, sizeof(context->outBuf), __VA_ARGS__)

@@ -524,7 +528,7 @@ static int initSocket()
   server_address.sin_addr.s_addr = htonl(INADDR_ANY);
   server_address.sin_port = htons(port);
   server_len = sizeof(server_address);
-  err = bind(server_sockfd, (struct sockaddr *)&server_address, server_len);
+  err = bind(server_sockfd, reinterpret_cast<struct sockaddr*>(&server_address), server_len);
   if (err) {
       rcs_print_error("error initializing sockets: %s\n", strerror(errno));
       return err;
@@ -2917,6 +2921,8 @@ cmdType lookupCommand(char *s)
 // handle the linuxcncrsh command in context->inBuf
 int parseCommand(connectionRecType *context)
 {
+  std::lock_guard<std::mutex> lck(queue_mtx); // Only one thread enters this function at a time.
+                                              // Caveat: Change strtok to strtok_r if this ever changes.
   int ret = 0;
   char *cmdStr;

@@ -3041,15 +3047,9 @@ int sockMain()
         sessions++;
         // enforce limited amount of clients that can connect simultaneously
         if ((maxSessions == -1) || (sessions <= maxSessions)) {
-            pthread_t *thrd;
+            pthread_t thrd;
             connectionRecType *context;

-            thrd = (pthread_t *)calloc(1, sizeof(pthread_t));
-            if (!thrd) {
-                fprintf(stderr, "linuxcncrsh: out of memory\n");
-                exit(1);
-            }
-
             context = (connectionRecType *)malloc(sizeof(connectionRecType));
             if (!context) {
                 fprintf(stderr, "linuxcncrsh: out of memory\n");
@@ -3067,7 +3067,10 @@ int sockMain()
             context->commProt = 0;
             context->inBuf[0] = 0;

-            res = pthread_create(thrd, NULL, readClient, (void *)context);
+            res = pthread_create(&thrd, NULL, readClient, (void *)context);
+            if (!res) {
+                pthread_detach(thrd);
+            }
         } else {
             fprintf(stderr, "linuxcncrsh: maximum amount of sessions exceeded: %d\n", maxSessions);
             res = -1;

@smoe
Copy link
Collaborator Author

smoe commented Jan 28, 2026

No more to add from my side other than testing.

@andypugh
Copy link
Collaborator

andypugh commented Jan 29, 2026

I have something of a mental block with C++ so any review of the code by me would be pointless.
I know that emcrsh is used by several people, and presumably that's just a subset of all the users.

smoe added 3 commits January 29, 2026 01:27
* pthread_detach, error messages
* return EXIT_FAILURE upon failure
* introduced mutex for queue control
* introduced command line usage
@rmu75
Copy link
Collaborator

rmu75 commented Jan 29, 2026

I was a bit confused about the scope of this PR when emcrsh entered the picture.
I checked manually, the changes to emcrsh should also apply to 2.9 (despite there having been many improvements to emcrsh since)

@smoe
Copy link
Collaborator Author

smoe commented Jan 29, 2026

I was a bit confused about the scope of this PR when emcrsh entered the picture.

That is because of thread safety conflating with the initial PR in #3734 (comment) .
We discussed this rather intensively and so I also applied that mutex to the neighbouring code that is very similar in its structure.

And missing pthread_{join,detach} was introduced to this PR in #3734 (comment)

A pthread_detach seems to be also missing for

Error = pthread_create( &thread_socket_server, NULL, (void *(*)(void *))SocketServerTcpMainLoop, (void *)NULL );

Error = pthread_create( &thread_socket_client, NULL, (void *(*)(void *))SocketModbusMasterLoop, (void *)NULL );

int thr_retval = pthread_create(&(_client_tcp_port->threadId), /* ptr to new-thread-id */
(but needs some thinking)
res = pthread_create(&thrd, NULL, readClient, (void *)NULL);

which may indeed be preferable to address in separate, individual PRs.

I never used strtok myself before, which was introduced to this PR in #3734 (comment)
which has an internal state remembering where it was last active so it is passed a NULL in subsequent invocations on the same string. So if my newly acquired insights serve me right, only one single strtok invocation can be executed at the very same time for all threads, and the variant strtok_r now demands that variable to be contributed externally so that problem is gone.

Suspicious invocations of strtok I found in

char* token = strtok(progname_plus_args, " ");

nextdir = strtok(tmpdirs,":"); // first token

s = strtok((char *) iniline, " \t");

char * name = strtok(names, ",");

buff = strtok(work_line, ";");

... others ...
which do not look overly complicated to transition to strtok_r. I would make this a separate strtok_r PR, though.

So, yes, this PR got utterly sidetracked but I loved it and thank @BsAtHome and @rmu75 for training me on it.

I checked manually, the changes to emcrsh should also apply to 2.9 (despite there having been many improvements to emcrsh since)

I would not want to introduce mutexes into a stable version without some serious testing. Maybe it would even be preferable to leave this in 2.9 as it is and have that extra (then tested) safety as a feature for 2.10.

@smoe smoe force-pushed the emc_race_and_cpcheck_warnings branch from 28a5f88 to d6e1674 Compare January 29, 2026 16:07
@smoe
Copy link
Collaborator Author

smoe commented Jan 29, 2026

Rebased to master.

@smoe smoe marked this pull request as ready for review January 29, 2026 17:24
@rmu75
Copy link
Collaborator

rmu75 commented Jan 29, 2026

Suspicious invocations of strtok I found in

char* token = strtok(progname_plus_args, " ");

nextdir = strtok(tmpdirs,":"); // first token

s = strtok((char *) iniline, " \t");

char * name = strtok(names, ",");

buff = strtok(work_line, ";");

... others ...
which do not look overly complicated to transition to strtok_r. I would make this a separate strtok_r PR, though.

I suggest to leave that stuff alone. interp isn't thread safe anyways and can't be made without extensive rearchitecture of the canon interface.

I checked manually, the changes to emcrsh should also apply to 2.9 (despite there having been many improvements to emcrsh since)

I would not want to introduce mutexes into a stable version without some serious testing. Maybe it would even be preferable to leave this in 2.9 as it is and have that extra (then tested) safety as a feature for 2.10.

then we should remove the label "2.9-must-fix".

@smoe smoe removed the 2.9-must-fix must be fixed for the next 2.9 point release label Jan 29, 2026
@smoe
Copy link
Collaborator Author

smoe commented Jan 30, 2026

Hm. I admit not to know about how many threads there are executed in parallel that possibly use strtok, but the problem is real, as in this little test program

#include <stdio.h>  // printf
#include <string.h> // strtok
#include <pthread.h>

const char delim[]=" ,?";
const int num=500;

void * f(void*) {
    for(int i=0; i< num; i++) {
        char a[]="hello world, how are you?";
        char *s1 = strtok(a,delim);
        printf("1: %s\n", s1);
        char *s2 = strtok(NULL,delim);
        printf("2: %s\n", s2);
        char *s3 = strtok(NULL,delim);
        printf("3: %s\n", s3);
    }
    return(nullptr);
}

void * g(void*) {
    char *safeptr = nullptr;
    for(int i=0; i< num; i++) {
        char a[]="hello world, how are you?";
        safeptr = NULL;
        char *s1 = strtok_r(a,   delim,&safeptr);
        printf("1: %s\n", s1);
        char *s2 = strtok_r(NULL,delim,&safeptr);
        printf("2: %s\n", s2);
        char* s3 = strtok_r(NULL,delim,&safeptr);
        printf("3: %s\n", s3);
    }
    return(nullptr);
}

int main() {
    for(int i=0; i<100; i++) {
        pthread_t t;
        int res = pthread_create(&t,NULL,g,NULL);
        if (res) {
           fprintf(stderr,"E: %d -> %d\n", i, res);
        } else {
           pthread_detach(t);
        }
    }
}

produces with strtok (function f)

$ ./t | sort | uniq -c
   2464 1: hello
    203 2: are
     48 2: are you?
    379 2: how
     31 2: how are you?
    395 2: (null)
   1197 2: world
     29 2: world, how are you?
    120 2: you
     29 2: you?
    288 3: are
      1 3: are you
     39 3: are you?
    766 3: how
     36 3: how are you?
    482 3: (null)
    573 3: world
      1 3: world, how are you
     32 3: world, how are you?
    154 3: you
     23 3: you?

and with strtok_r the expected (did not use join to wait, so numbers are not equal)

$ ./t | sort | uniq -c
   2798 1: hello
   2769 2: world
   2737 3: how

and certainly what LinuxCNC is doing is not threadsafe in the sense that it may not be prohibited that the tool table may be edited simultaneously to some thread reading it because some locking mechanism not implemented. What we have here is that someone pressing the button to reread the tool table may lead to some complete nonsense at some other part of LinuxCNC that is auto-updating the contents of an .ini file or whatever. Completely unrelated so it cannot be addressed by protecting critical areas with semaphores. And dangerous. Just because of those routines are using strtok instead of strtok_r..

Since I am for some weird reason enjoying this, I suggest you just let me perform those strtok to strtok_r transitions, and it shall not be to the disadvantage of LinuxCNC.

@rmu75
Copy link
Collaborator

rmu75 commented Jan 30, 2026

It is out of the question that strtok is not thread-safe and concurrent use will mess up the save-pointer or worse. In general, internal linuxcnc interfaces are not thread safe anyway, strtok is not the only problem.

All stuff that is written in python is implicitly mutexed by the global interpreter lock.

The interpreter canon interface has no way to identify any context of a calling interpreter, so it really doesn't make sense to run two interpreters in one process at the same time.

IME concurrency problems do show up in program usage rather sooner than later. The only reason no problems are known in emcrsh and emcsched is because nobody is/was using those with multiple connections doing stuff at the same time.

@rmu75
Copy link
Collaborator

rmu75 commented Jan 30, 2026

Editing and reloading tool table in the same process at the same time from two different threads seems a rather theoretical possibility, why would anybody do that and expect it to work? It won't make sense in any case. Also all stuff that goes through NML also is serialized auomatically, messages are put into a message queue, and one giant semi-endless loop and corresponding switch statement is processing those messages, no danger of anything happening concurrently there.

@BsAtHome
Copy link
Contributor

The question about thread safety is of course a serious one. As said, there are many places that are not thread safe by design or implementation and we have not (yet) seen disasters because most user/usage patterns do not exploit parallelisms that would expose the races.

It should be a long term goal to update and fix all programs to adhere to proper thread safe standards and fix things. But I see no reason to expedite this update process at this moment unless there is a use case that demands it or while we are fixing a particular program anyway. For now, fixing as we get along should be fine.

@rmu75
Copy link
Collaborator

rmu75 commented Jan 30, 2026

Fixing things is of course OK and necessary but we should not begin to redesign stuff to become threadsafe/reentrant just for the heck of it or making stuff multi-threaded needlessly, just the opposite, stuff like emcrsh should be redone to not neet multiple threads IMHO.

@BsAtHome
Copy link
Contributor

Well, I did previously suggest to use a simple select/poll loop to make it simpler.

@smoe
Copy link
Collaborator Author

smoe commented Jan 30, 2026

There is some subtle distinction between multithreading (what in my limited understanding we are mostly discussing) and functions being reentrant (which strtok is not but strtok_r is).

I'll prepare the strtok->strtok_r transition against 2.9. And I am happily anticipating your select/poll loop.

@BsAtHome
Copy link
Contributor

I'll prepare the strtok->strtok_r transition against 2.9. And I am happily anticipating your select/poll loop.

Working on it :-)

@smoe
Copy link
Collaborator Author

smoe commented Feb 1, 2026

I'll prepare the strtok->strtok_r transition against 2.9. And I am happily anticipating your select/poll loop.

Working on it :-)

The select/poll loop, so I hope :-) Have the strtok_r transition prepared locally.

@BsAtHome
Copy link
Contributor

BsAtHome commented Feb 1, 2026

Actually,... I've been doing a rewrite on linuxcncrsh (well, emcrsh.cc) making it single threaded and more of a C++ program. It started out as a small patch, but I went a bit further when I found all the off-by-one and buffer problems. Oh, and strtok is now also history. It is almost done. Currently testing it, smoothing things out and fix inconsistencies.

That whole process uncovered a lot more of funny stuff that may be necessary to do in schedrmt.cc too.

@smoe
Copy link
Collaborator Author

smoe commented Feb 1, 2026

Do you have ideas for including your (some of your) tests as part of the CI ?

@BsAtHome
Copy link
Contributor

BsAtHome commented Feb 1, 2026

Do you have ideas for including your (some of your) tests as part of the CI ?

Not yet, but yes, they need to be added.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Development

Successfully merging this pull request may close these issues.

4 participants