particular / nservicebus.distributor.msmq Goto Github PK
View Code? Open in Web Editor NEWHome Page: https://docs.particular.net/nservicebus/scalability-and-ha/distributor/
License: Other
Home Page: https://docs.particular.net/nservicebus/scalability-and-ha/distributor/
License: Other
Versions: 4.0.2 & 4.7.5
We've been having some problems with a distributor working correctly in our production environment. We noticed that when work was completed on a worker, that it would send 2 worker ready messages back to the distributor which would then have the side effect of work getting queued up on workers while other workers were performing no processing as the worker ready messages in the distributor storage queue was inaccurate.
It took awhile to debug and determine where the problem was, I initially upgraded to 4.7.5 as we're currently on 4.0.2 to see if it resolved the issue with the new NServicebus.Distributor.MSMQ package and session id's which it did not.
We then noticed that on the endpoint at hand, we had disabled transaction wrapping as we needed information out of the endpoint before it was done processing. We were disabling the transaction scope with:
Configure.Transactions.Enable().Advanced( x => x.DoNotWrapHandlersExecutionInATransactionScope() );
The issue was resolved by removing this config and restructuring the worker to send the information we needed before the message was sent to the distributor. I'm opening a thread here just to bring the issue to light and see if you are aware, or if there is even a way to resolve this, or perhaps this may help someone in the future.
It seems like the worker ready mechanism sends a "worker ready" message anytime a "message" is sent from an endpoint. I'm assuming that the definition of a message being sent is from the transaction scope so that one completed send of all messages (we were sending 2 messages from the worker from the handler), it would only send one worker ready message. In our case, it seems like once the transaction wrapping is disabled, every message sent is considered a worker ready message as I'm assuming NSB has no way to discern if the message being sent is from the same handler processing.
Hopefully this helps someone in the future, it took us a little while to determine what was going on and in the end, ended up being obvious.
Thanks!
Sean Chambers
Orignally from: https://groups.google.com/forum/#!topic/particularsoftware/BMu5TBAE5YA
When Message Queueing service is restarted then the distributor endpoint stops processing messages. This can cause outages in a production environment.
When using one of the following distributor profiles
When Message Queueing service is unavailable the distributor endpoint should shut down just as any other endpoint would do when not using either master profiles.
When MSMQ gets restarted in production then the expected behavior should be that all services that depend on it be restarted too. This can be configured with sc.exe
NOTE: This will not restart your service when MSMQ is killed or crashes. Only when you restart MSMQ via services.msc
or sc.exe start/stop
DESCRIPTION:
Modifies a service entry in the registry and Service Database.
USAGE:
sc <server> config [service name] <option1> <option2>...
OPTIONS:
NOTE: The option name includes the equal sign.
A space is required between the equal sign and the value.
type= <own|share|interact|kernel|filesys|rec|adapt|userown|usershare>
start= <boot|system|auto|demand|disabled|delayed-auto>
error= <normal|severe|critical|ignore>
binPath= <BinaryPathName to the .exe file>
group= <LoadOrderGroup>
tag= <yes|no>
depend= <Dependencies(separated by / (forward slash))>
obj= <AccountName|ObjectName>
DisplayName= <display name>
password= <password>
In this scenario:
sc.exe config [servicename] depend=MSMQ/MSDTC
This will make your service dependent on both MSMQ and MSDTC. Mind to use /
and not ,
to separate services.
if you restart either of those services then your endpoint will be started too.
Installation of NServiceBus fails with exception thrown by MsmqWorkerAvailabilityManager
saying it can't access the queue.
All users of external distributor (NServiceBus.Distributor.MSMQ) who want to use a non-default container (other than Autofac)
Other containers have different ordering of components for multiple registration than Autofac. It changes the order of installers making satellite queue creator run before distributor persistence queue creator. The former tries to instantiate a distributor satellite which in turn tries to instantiate the availability manager causing the exception.
Start()
method removed from IWorkerAvailabilityManager
The distributor stops dispatching message to workers after it restarts (same machine or another cluster node) if all workers are consuming message at the moment the distributor stops / quits / crashes.
Ready messages send to the distributor after a restart are ignored.
Take the following steps to reproduce this issue:
Now the backlog will not be processed and message pile up.
You have to restart the worker so that in re-enlist itself at the distributor but this obviously is not suffecient for when the distributor restarts or rolls over to a different node in a cluster.
The solution is to not ignore the ready messages send by the worker.
A minor issue is that when the worker is restarted it could temporarily have more messages in its queue then the configured concurrency limit is set. This is due the ready message send at start which instructs the distributor to enlist the worker for work.
Based on our release policy the following NServiceBus.Distributor.MSMQ versions need patching:
Raised by @jdrat2000
Migrated from by Particular/NServiceBus#1966
Customer pointed out that they were running a distributor with N workers. A configuration error was causing excessive ready messages... many hundreds in some cases. This had the effect of bringing down workers endpoint do to storage issues.
Turns out in this case what caused it was a configuration error that had senders accidentally sending messages to workers directly... not through the distributor as they should have. The workers sent ready messages to the distributor for more work leading to a snowball effect of sorts.
Not sure I'd call this a bug but would it make sense to safeguard against this behavior? Perhaps messages not from the distributor could cause a warning or exception and the storage queue could enforce some unique constraint by worker and throw and exceptions?
The desk case number for this was 3968. Customer was on 4.04.
Endless log entries of type Logs Session ids for Worker at '{0}' do not match, so popping next available worker
until distributor for endpoint is restarted.
Orginally reported on the google group.
Distributor tries to pop a worker, but the worker has session id not equal to the active session id stored by the distributor for that worker address. A info entry is written to the log with Session ids for Worker at '{0}' do not match, so popping next available worker
. Tries next worker. Same thing happens for all workers in the storage queue. No worker with the active session id is found. The end result is the transaction, including both the message and the expired workers popped from the storage queue, rolling back. The entire thing repeats forever unless distributor is restarted, which clears the active session id-cache. The root cause is that restarting a worker and then restarting its distributor will clear the active session id cache for that worker and no new messages will be sent out.
How to know if you're affected:
Session ids for Worker at X do not match, so popping next available Worker.
Workaround:
What to do if you're affected:
Raised by @johnsimons
Migrated from by Particular/NServiceBus#1756
As part of #1732 we are going to:
MasterNodeConfig
with new separate DistributorConfig
<DistributorConfig
Node="localhost"
QueueName="orders.handler">
</DistributorConfig>
The above configuration will result in:
Raised by @jdrat2000
Migrated from by Particular/NServiceBus#2000
When and endpoint starts and fails, logging only indicates and very general error for example..The queue does not exist or you do not have sufficient permissions to perform the operation. Full stack trace below.
In this case a customer, desk case 4156, was getting this error troubleshooting the startup on a cluster instance. It turned out the problem was simply a missing queue.
Could the logging be improved to include the name of the queue?
2014-02-28 09:38:46,410 [5] FATAL NServiceBus.Satellites.SatelliteLauncher - Satellite NServiceBus.Distributor.DistributorSatellite, NServiceBus.Core, Version=4.2.0.0, Culture=neutral, PublicKeyToken=9fc386479f8a226c failed to start.
System.Messaging.MessageQueueException (0x80004005): The queue does not exist or you do not have sufficient permissions to perform the operation.
at System.Messaging.MessageQueue.MQCacheableInfo.get_Transactional()
at System.Messaging.MessageQueue.get_Transactional()
at NServiceBus.Transports.Msmq.WorkerAvailabilityManager.MsmqWorkerAvailabilityManager.Start() in :line 0
at NServiceBus.Satellites.SatelliteLauncher.StartSatellite(SatelliteContext context) in :line 0
Raised by @kami1eon
Migrated from by Particular/NServiceBus#2029
We noticed NSB 4.4.2 the Worker profile interface has been deprecated.
As we use the IHandleProfile interface for only worker specific scenarios, after changing over to MSMQWorker and invoking NServiceBus host with NServiceBus.Integration NServiceBus.MSMQWorker profiles we have noticed that after the initial "Hey distributor, I am available for work, got 10 slots" there is no more begging for additional graft (no sign of "Worker checked in with available capacity: X)".
When we are using the depreciated "Worker" interface, everything is fine.
Has anyone seen this before? Am I missing something...
With the split of the Gateway from the core, the distributor either has to have a hard dependency on the new Gateway dll/nuget or we do not enable the Gateway for Master profile automatically.
When a worker is stopped it should signal the distributor and drain its queue.
Changes
When disabling transactions on the distributor or master role messages are not forwarded to the workers. The .storage
queue does not contain messages after the distributor receives a ready message from the worker.
busConfiguration.Transactions().Disable();
NServiceBus.Distributor.MSMQ.MsmqWorkerAvailabilityManager Worker at 'Samples.Scaleout.Worker2@RAMON-HOME' has been registered with 1 capacity.
Workers can run with transactions disabled but the distributor and master role do not allow this. Only disabled transactions on the worker nodes.
If you want to use the 'master' role then setup a worker process on the same machine as the distributor where the distributor has transactions enabled and the worker has transactions disabled.
Make sure the worker uses the following additional settings:
<appSettings>
<add
key="NServiceBus/Distributor/WorkerNameToUseWhileTesting"
value="Samples.Scaleout.Worker1" />
</appSettings>
<MasterNodeConfig Node="localhost" />
<UnicastBusConfig
DistributorControlAddress="Samples.Scaleout.Server.Distributor.control@localhost"
DistributorDataAddress="Samples.Scaleout.Server@localhost"/>
https://github.com/Particular/docs.particular.net/tree/master/samples/scaleout/distributor/Version_5
We will not change this behavior. Running without transactions can result in loss of messages. If an error occurs during the processing of a message the message will not be retried and will not be forwarded to the error queue. This unreliable behavior is unwanted from the perspective of the distributor.
Workers can run without transactions and they are the actual processors of the message. The workaround provides a method to have the workers run without transactions and have a solution to have a worker run on the same machine as the distributor both having different transactional modes.
Please be aware that the message will not be retried and not forwarded to the error queue when the processing fails and the message will be gone.
https://groups.google.com/forum/#!msg/particularsoftware/c3-zS5dJ61Q/pCOQaL0tBAAJ
When using the new MSMQDistributor and MSMQWorker profiles, the distributor no longer sends messages to the workers after it has been restarted. When a message is received it clears out the storage queue and doesn't process the message. When the worker is restarted it works as expected.
Steps to reproduce:
Raised by @andreasohlund
Migrated from by Particular/NServiceBus#2014
2-4 is usually the optimum. Let see if we can repro
Original question on the group: https://groups.google.com/forum/#!msg/particularsoftware/bRuh8wyOSZU/ZztGZAk3d3MJ
When MasterNodeConfig section is not present the following error is shown:
2015-12-15 11:52:35.033 ERROR NServiceBus.GenericHost Exception when starting endpoint. System.Configuration.ConfigurationErrorsException: 'MasterNodeConfig.Node' entry should point to a valid host name. Currently it is: [].
The MasterNodeConfig section is not present at all.
To reproduce use the scaleout sample and comment MasterNodeConfig of a worker and comment the NServiceBus/Distributor/WorkerNameToUseWhileTesting
app setting.
Because the IProfile
is now in the host dll, this means that either:
@andreasohlund which one?
Currently, when running as a worker node, and if the master node is configured to be on the same machine as that of the worker, an exception is thrown:
How do we configure the worker to override this?
Repro:
Use the scale out sample in the Update branch:
https://github.com/Particular/NServiceBus.Msmq.Samples/tree/update/ScaleOut
Reported by user:
Currently using NServiceBus 4.6.3, but this has also happened with earlier 4.x versions and I'm wondering if anyone else has seen this before or has an idea. I've Googled this for a while and came up with nothing.
I have a distributor and 2 workers, all on different servers. This distributor does not run an endpoint. Most of the time, all is well, but sometimes when a request comes in, the distributor will log the message "Session ids for Worker at 'Endpoint@Server' do not match, so poping next available worker" for both workers as fast as it can and no longer processes anything. It appears that maybe it's not actually removing the message and keeps reading the same one over and over. I have to stop the services, clear the distributor storage queues, and restart the services to get it to run again. It may run for hours or even days just fine, but will eventually do this again.
Endless log entries of type Logs Session ids for Worker at '{0}' do not match, so popping next available worker
until distributor for endpoint is restarted.
Orginally reported on the google group.
Distributor tries to pop a worker, but the worker has session id not equal to the active session id stored by the distributor for that worker address. A info entry is written to the log with Session ids for Worker at '{0}' do not match, so popping next available worker
. Tries next worker. Same thing happens for all workers in the storage queue. No worker with the active session id is found. The end result is the transaction, including both the message and the expired workers popped from the storage queue, rolling back. The entire thing repeats forever unless distributor is restarted, which cleares the active session id-cache. The root cause is that restarting a worker and then restarting its distributor will clear the active session id cache for that worker and no new messages will be sent out.
How to know if you're affected:
Session ids for Worker at X do not match, so popping next available Worker.
Workaround:
What to do if you're affected:
The exception at
And when we log it we should at least make sure that we don't throw away the exception details by calling the incorrect method on the logger.
Logger.InfoFormat("NextAvailableWorker Exception", e)
is wrong. We want Logger.Info("NextAvailableWorker Exception", e)
Any customer attempting to scale out MSMQ as a MasterNode with attached Worker.
Second level retries does not work when endpoint is deployed as a master node with attached worker and configured with a MasterNodeConfig
section pointing to the local server. The message is sent to the EndpointName.Retries
queue, but that queue either does not exist, or exists but the endpoint does not receive messages from it. As a result, faulting messages are passed to SLR but then never reprocessed again, or an exception is thrown that the retries queue does not exist.
Bug found by @dvdstelt when working with a customer, and verified by me. Here is a minimal repro:
NServiceBus.Host.exe NServiceBus.MSMQMaster
WARN NServiceBus.Faults.Forwarder.FaultManager Message with 'c449a8c4-5bc6-4569-a23b-a64a00ad70fa' id has failed FLR and will be handed over to SLR for retry attempt 1.
MasterNodeConfig
section from the App.config, it will work, advancing through all SLR levels and eventually forwarding the message to the error queue.
namespace SlrBugRepro
{
public class EndpointConfig : IConfigureThisEndpoint
{
public void Customize(BusConfiguration configuration)
{
configuration.UsePersistence<InMemoryPersistence>();
configuration.EnableInstallers(); // So queues are created when run from cmdline
}
}
public class Startup : IWantToRunWhenBusStartsAndStops
{
public IBus Bus { get; set; }
public void Start()
{
Bus.Send("SlrBugRepro", new TestCmd());
}
public void Stop() { }
}
public class TestCmd : ICommand { }
public class TestHandler : IHandleMessages<TestCmd> {
public void Handle(TestCmd message)
{
Console.WriteLine("Attempting TestCmd");
throw new Exception("Boom");
}
}
}
<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<configuration>
<configSections>
<section name="MessageForwardingInCaseOfFaultConfig" type="NServiceBus.Config.MessageForwardingInCaseOfFaultConfig, NServiceBus.Core" />
<section name="TransportConfig" type="NServiceBus.Config.TransportConfig, NServiceBus.Core" />
<section name="SecondLevelRetriesConfig" type="NServiceBus.Config.SecondLevelRetriesConfig, NServiceBus.Core" />
<section name="MasterNodeConfig" type="NServiceBus.Config.MasterNodeConfig, NServiceBus.Core" />
</configSections>
<MessageForwardingInCaseOfFaultConfig ErrorQueue="error" />
<TransportConfig MaxRetries="2" MaximumConcurrencyLevel="1" MaximumMessageThroughputPerSecond="0" />
<SecondLevelRetriesConfig Enabled="true" NumberOfRetries="3" TimeIncrease="00:00:5" />
<MasterNodeConfig Node="DAVIDBOIKE0C30" />
</configuration>
Raised by @andreasohlund
Migrated from by Particular/NServiceBus#1279
Have each worker send back the critical time calculated as header in the control message going back to the distributor so that the distributor can populate it's counter. This allows for centralized monitoring of SLA's by just looking at the distributor
Worker node is not reporting ready status after a message is sent to SLR. This happens because in that case the FinishedMessageProcessing event is not raised.
The solution is changing the way ready state reporter works so that it subscribes to StartedMessageProcessing instead.
Raised by @JeffHenson
Migrated from by Particular/NServiceBus#2518
NServiceBus 4.7.1
NServiceBus.Distributor.MSMQ 4.4.6
We had an issue this morning when one of our database servers took a dive and it caused some of our NSB worker nodes to stop processing messages. The server dying caused every message to exceed the FLRs and be sent off for the SLRs. The SLR message was sent and the message was ultimately processed but the worker node never checked in to the distributor to say it could take more work. This caused all processing to grind to a halt once all worker threads were used until the worker nodes were restarted.
I've modified the ScaleOut example to reproduce the issue.
https://drive.google.com/file/d/0B68lziNQwCVLYWZrZjZkX01rSnM/view
The reply to address of the workers are wrongly rewritten the distributors control address instead of the main address
This was broken in 4.4.2 by this change:
Particular/NServiceBus@54a1da3
This is the corresponding issue on the old distributor:
Raised by @andreasohlund
Migrated from by Particular/NServiceBus#2048
The code we have to generate unique input queues for workers running locally for demo purposes I constantly confusing users. It's time to kill it.
Full context:
https://groups.google.com/forum/m/?utm_medium=email&utm_source=footer#!msg/particularsoftware/mBeAWMQ5JOE/KsE5gS4DL8UJ
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.