particular / nservicebus.distributor.msmq Goto Github PK

View Code? Open in Web Editor NEW

3.0 3.0 5.0 2.75 MB

Home Page: https://docs.particular.net/nservicebus/scalability-and-ha/distributor/

License: Other

PowerShell 0.11% C# 99.89%

nservicebus.distributor.msmq's People

Contributors

Stargazers

Watchers

Forkers

zwyl2001 votrongdao jberke jango2015 chinamcat

nservicebus.distributor.msmq's Issues

DoNotWrapHandlersExecutionInATransactionScope leads to extra ready messages

Versions: 4.0.2 & 4.7.5

We've been having some problems with a distributor working correctly in our production environment. We noticed that when work was completed on a worker, that it would send 2 worker ready messages back to the distributor which would then have the side effect of work getting queued up on workers while other workers were performing no processing as the worker ready messages in the distributor storage queue was inaccurate.

It took awhile to debug and determine where the problem was, I initially upgraded to 4.7.5 as we're currently on 4.0.2 to see if it resolved the issue with the new NServicebus.Distributor.MSMQ package and session id's which it did not.

We then noticed that on the endpoint at hand, we had disabled transaction wrapping as we needed information out of the endpoint before it was done processing. We were disabling the transaction scope with:

Configure.Transactions.Enable().Advanced( x => x.DoNotWrapHandlersExecutionInATransactionScope() );

The issue was resolved by removing this config and restructuring the worker to send the information we needed before the message was sent to the distributor. I'm opening a thread here just to bring the issue to light and see if you are aware, or if there is even a way to resolve this, or perhaps this may help someone in the future.

It seems like the worker ready mechanism sends a "worker ready" message anytime a "message" is sent from an endpoint. I'm assuming that the definition of a message being sent is from the transaction scope so that one completed send of all messages (we were sending 2 messages from the worker from the handler), it would only send one worker ready message. In our case, it seems like once the transaction wrapping is disabled, every message sent is considered a worker ready message as I'm assuming NSB has no way to discern if the message being sent is from the same handler processing.

Hopefully this helps someone in the future, it took us a little while to determine what was going on and in the end, ended up being obvious.

Thanks!

Sean Chambers

Orignally from: https://groups.google.com/forum/#!topic/particularsoftware/BMu5TBAE5YA

Distributor hangs when restarting/killing Message Queueing service

When Message Queueing service is restarted then the distributor endpoint stops processing messages. This can cause outages in a production environment.

When using one of the following distributor profiles

NServiceBus.Master
NServiceBus.MSMQMaster
NServiceBus.Distributor
NServiceBus.MSMQDistributor

Expected behavior

When Message Queueing service is unavailable the distributor endpoint should shut down just as any other endpoint would do when not using either master profiles.

Tested versions:

NServiceBus v4.6.1 (Profile: NServiceBus.Master)
NServiceBus v4.7.5 (Profile: NServiceBus.Master)
NServiceBus v4.7.5 & NServiceBus.Distributor.MSMQ v4.4.8 (Profile: NServiceBus.MSMQMaster)
NServiceBus v5.2.0 & NServiceBus.Distributor.MSMQ v5.0.0 (Profile: NServiceBus.MSMQMaster)
NServiceBus v4.7.5 & NServiceBus.Distributor.MSMQ v4.4.8 (Profile: NServiceBus.MSMQMaster)
NServiceBus v5.2.14 & NServiceBus.Distributor.MSMQ v5.0.4 (Profile: NServiceBus.MSMQDistributor)

Partial workaround

When MSMQ gets restarted in production then the expected behavior should be that all services that depend on it be restarted too. This can be configured with sc.exe

NOTE: This will not restart your service when MSMQ is killed or crashes. Only when you restart MSMQ via services.msc or sc.exe start/stop

DESCRIPTION:
        Modifies a service entry in the registry and Service Database.
USAGE:
        sc <server> config [service name] <option1> <option2>...

OPTIONS:
NOTE: The option name includes the equal sign.
      A space is required between the equal sign and the value.
 type= <own|share|interact|kernel|filesys|rec|adapt|userown|usershare>
 start= <boot|system|auto|demand|disabled|delayed-auto>
 error= <normal|severe|critical|ignore>
 binPath= <BinaryPathName to the .exe file>
 group= <LoadOrderGroup>
 tag= <yes|no>
 depend= <Dependencies(separated by / (forward slash))>
 obj= <AccountName|ObjectName>
 DisplayName= <display name>
 password= <password>

In this scenario:

sc.exe config [servicename] depend=MSMQ/MSDTC

This will make your service dependent on both MSMQ and MSDTC. Mind to use / and not , to separate services.

if you restart either of those services then your endpoint will be started too.

Do not check persistence queue during construction of MsmqWorkerAvailabilityManager

Symptoms

Installation of NServiceBus fails with exception thrown by MsmqWorkerAvailabilityManager saying it can't access the queue.

Who's affected

All users of external distributor (NServiceBus.Distributor.MSMQ) who want to use a non-default container (other than Autofac)

Root cause

Other containers have different ordering of components for multiple registration than Autofac. It changes the order of installers making satellite queue creator run before distributor persistence queue creator. The former tries to instantiate a distributor satellite which in turn tries to instantiate the availability manager causing the exception.

Possible solutions

Check the persistence queue when first accessed (via property getter)
Bring back the Start() method removed from IWorkerAvailabilityManager

Distributor stops dispatching work after restart if workers are busy

The distributor stops dispatching message to workers after it restarts (same machine or another cluster node) if all workers are consuming message at the moment the distributor stops / quits / crashes.

Ready messages send to the distributor after a restart are ignored.

How to reproduce:

Take the following steps to reproduce this issue:

Take the distributor scaleout sample
Add a thread sleep to the worker handler of 10 seconds
Send 4 messages
- 2 will be dispatched to the workers, 2 remain in the distributor queue
Stop the distributor
Wait until the workers complete
Start the distributor
Distributor now dis

Now the backlog will not be processed and message pile up.

Workaround

You have to restart the worker so that in re-enlist itself at the distributor but this obviously is not suffecient for when the distributor restarts or rolls over to a different node in a cluster.

Solution

The solution is to not ignore the ready messages send by the worker.

A minor issue is that when the worker is restarted it could temporarily have more messages in its queue then the configured concurrency limit is set. This is due the ready message send at start which instructs the distributor to enlist the worker for work.

Affected supported versions

Based on our release policy the following NServiceBus.Distributor.MSMQ versions need patching:

5.0.x => 5.0.4, pull request #35
4.4.x => 4.4.9, pull request #39

In distributor situation, errant direct messages to worker cause extry ready message buildup in distributor

Raised by @jdrat2000
Migrated from by Particular/NServiceBus#1966

Customer pointed out that they were running a distributor with N workers. A configuration error was causing excessive ready messages... many hundreds in some cases. This had the effect of bringing down workers endpoint do to storage issues.

Turns out in this case what caused it was a configuration error that had senders accidentally sending messages to workers directly... not through the distributor as they should have. The workers sent ready messages to the distributor for more work leading to a snowball effect of sorts.

Not sure I'd call this a bug but would it make sense to safeguard against this behavior? Perhaps messages not from the distributor could cause a warning or exception and the storage queue could enforce some unique constraint by worker and throw and exceptions?

The desk case number for this was 3968. Customer was on 4.04.

Distributor stops handing out work to Workers after restart

Symptoms

Endless log entries of type Logs Session ids for Worker at '{0}' do not match, so popping next available worker until distributor for endpoint is restarted.

Orginally reported on the google group.

Root cause

Distributor tries to pop a worker, but the worker has session id not equal to the active session id stored by the distributor for that worker address. A info entry is written to the log with Session ids for Worker at '{0}' do not match, so popping next available worker. Tries next worker. Same thing happens for all workers in the storage queue. No worker with the active session id is found. The end result is the transaction, including both the message and the expired workers popped from the storage queue, rolling back. The entire thing repeats forever unless distributor is restarted, which clears the active session id-cache. The root cause is that restarting a worker and then restarting its distributor will clear the active session id cache for that worker and no new messages will be sent out.

How to know if you're affected:

You are running version 4.3.X, 4.4.X or 5.X and the Distributor repeatedly logs Session ids for Worker at X do not match, so popping next available Worker.
the Distributor's main input queue has a lot of messages that are not being distributed to Workers.

Workaround:

Restarting the Distributor will fix the issue temporarily, but it can re-emerge later if the Distributor and a Worker is restarted in succession.

What to do if you're affected:

Please update your Distributor to the latest version. When upgrading the Distributor between major versions, it is important that you upgrade as prescribed, so that the workers are properly registered with the Distributor.
This fix does not affect the Workers hence, no need to update the Workers.

New Distributor xml configuration

Raised by @johnsimons
Migrated from by Particular/NServiceBus#1756

As part of #1732 we are going to:

Remove DistributorControlAddress as that can be automatically computed, DistributorDataAddress + "distributor.control".
Merge MasterNodeConfig with new separate DistributorConfig

Proposed new config section

<DistributorConfig
    Node="localhost"
    QueueName="orders.handler">
</DistributorConfig>

The above configuration will result in:

DistributorControlAddress = "orders.handler.distributor.control@localhost"
DistributorDataAddress = "orders.handler@localhost"

Improve logging of queue names on startup exceptions

Raised by @jdrat2000
Migrated from by Particular/NServiceBus#2000

When and endpoint starts and fails, logging only indicates and very general error for example..The queue does not exist or you do not have sufficient permissions to perform the operation. Full stack trace below.

In this case a customer, desk case 4156, was getting this error troubleshooting the startup on a cluster instance. It turned out the problem was simply a missing queue.

Could the logging be improved to include the name of the queue?

2014-02-28 09:38:46,410 [5] FATAL NServiceBus.Satellites.SatelliteLauncher - Satellite NServiceBus.Distributor.DistributorSatellite, NServiceBus.Core, Version=4.2.0.0, Culture=neutral, PublicKeyToken=9fc386479f8a226c failed to start.
System.Messaging.MessageQueueException (0x80004005): The queue does not exist or you do not have sufficient permissions to perform the operation.
   at System.Messaging.MessageQueue.MQCacheableInfo.get_Transactional()
   at System.Messaging.MessageQueue.get_Transactional()
   at NServiceBus.Transports.Msmq.WorkerAvailabilityManager.MsmqWorkerAvailabilityManager.Start() in :line 0
   at NServiceBus.Satellites.SatelliteLauncher.StartSatellite(SatelliteContext context) in :line 0

Profile "NServiceBus.MSMQWorker" not reporting for duty with distributor

Raised by @kami1eon
Migrated from by Particular/NServiceBus#2029

We noticed NSB 4.4.2 the Worker profile interface has been deprecated.

As we use the IHandleProfile interface for only worker specific scenarios, after changing over to MSMQWorker and invoking NServiceBus host with NServiceBus.Integration NServiceBus.MSMQWorker profiles we have noticed that after the initial "Hey distributor, I am available for work, got 10 slots" there is no more begging for additional graft (no sign of "Worker checked in with available capacity: X)".

When we are using the depreciated "Worker" interface, everything is fine.
Has anyone seen this before? Am I missing something...

Master profile automatically enables the Gateway

With the split of the Gateway from the core, the distributor either has to have a hard dependency on the new Gateway dll/nuget or we do not enable the Gateway for Master profile automatically.

When worker is stopped it should signal the distributor and drain its queue

When a worker is stopped it should signal the distributor and drain its queue.

Changes

A windows service stop should 'wait' until worker threads are idle
During a stop it should not send a ready message to the distributor when shutting down
Send a message to the distributor that prevents messages to be forwarded to the worker

No messages are processed when disabling transactions on distributor or master role

When disabling transactions on the distributor or master role messages are not forwarded to the workers. The .storage queue does not contain messages after the distributor receives a ready message from the worker.

How to reproduce

Open scaleout distributor sample
- https://github.com/Particular/docs.particular.net/tree/master/samples/scaleout/distributor/Version_5
Disable transactions in the Server project
- busConfiguration.Transactions().Disable();
Run the sample
Check the log on the Server project and see that it receives
- NServiceBus.Distributor.MSMQ.MsmqWorkerAvailabilityManager Worker at 'Samples.Scaleout.Worker2@RAMON-HOME' has been registered with 1 capacity.

Workaround

Workers can run with transactions disabled but the distributor and master role do not allow this. Only disabled transactions on the worker nodes.

If you want to use the 'master' role then setup a worker process on the same machine as the distributor where the distributor has transactions enabled and the worker has transactions disabled.

Make sure the worker uses the following additional settings:

NServiceBus/Distributor/WorkerNameToUseWhileTesting
UnicastBusConfig: DistributorControlAddress & DistributorDataAddress attributes

<appSettings>
    <add 
      key="NServiceBus/Distributor/WorkerNameToUseWhileTesting" 
      value="Samples.Scaleout.Worker1" />
  </appSettings> 
  <MasterNodeConfig Node="localhost" />
  <UnicastBusConfig 
    DistributorControlAddress="Samples.Scaleout.Server.Distributor.control@localhost" 
    DistributorDataAddress="Samples.Scaleout.Server@localhost"/>

https://github.com/Particular/docs.particular.net/tree/master/samples/scaleout/distributor/Version_5

Decision: Won't Fix

We will not change this behavior. Running without transactions can result in loss of messages. If an error occurs during the processing of a message the message will not be retried and will not be forwarded to the error queue. This unreliable behavior is unwanted from the perspective of the distributor.

Workers can run without transactions and they are the actual processors of the message. The workaround provides a method to have the workers run without transactions and have a solution to have a worker run on the same machine as the distributor both having different transactional modes.

Please be aware that the message will not be retried and not forwarded to the error queue when the processing fails and the message will be gone.

References

https://groups.google.com/forum/#!msg/particularsoftware/c3-zS5dJ61Q/pCOQaL0tBAAJ

MSMQDistributor unable to continue work after a restart.

When using the new MSMQDistributor and MSMQWorker profiles, the distributor no longer sends messages to the workers after it has been restarted. When a message is received it clears out the storage queue and doesn't process the message. When the worker is restarted it works as expected.

Steps to reproduce:

Start an instance of an endpoint with NServiceBus.MSMQDistributor
Start an instance of that endpoint with NServiceBus.MSMQWorker.
Restart the Distributor instance.
Send a messages to the distributor.

High cpu reported when running the distributor with 10 threads

Raised by @andreasohlund
Migrated from by Particular/NServiceBus#2014

2-4 is usually the optimum. Let see if we can repro

Original question on the group: https://groups.google.com/forum/#!msg/particularsoftware/bRuh8wyOSZU/ZztGZAk3d3MJ

When MasterNodeConfig section is not present the validation output could be improved

When MasterNodeConfig section is not present the following error is shown:

2015-12-15 11:52:35.033 ERROR NServiceBus.GenericHost Exception when starting endpoint. System.Configuration.ConfigurationErrorsException: 'MasterNodeConfig.Node' entry should point to a valid host name. Currently it is: [].

The MasterNodeConfig section is not present at all.

To reproduce use the scaleout sample and comment MasterNodeConfig of a worker and comment the NServiceBus/Distributor/WorkerNameToUseWhileTesting app setting.

IProfile is now in the host

Because the IProfile is now in the host dll, this means that either:

The Distributor assembly references the host dll
We remove profiles from the Distributor
We create another assembly that includes the distributor profiles

@andreasohlund which one?

When running with the MSMQWorker profile and the MasterNode configuration is set to localhost, worker throws an exception

Currently, when running as a worker node, and if the master node is configured to be on the same machine as that of the worker, an exception is thrown:

See:
https://github.com/Particular/NServiceBus.Distributor.Msmq/blob/develop/src/NServiceBus.Distributor.MSMQ/Features/WorkerNode.cs#L99

How do we configure the worker to override this?

Repro:
Use the scale out sample in the Update branch:
https://github.com/Particular/NServiceBus.Msmq.Samples/tree/update/ScaleOut

Worker session ids seems to get out of sync now and then

Reported by user:

Currently using NServiceBus 4.6.3, but this has also happened with earlier 4.x versions and I'm wondering if anyone else has seen this before or has an idea.  I've Googled this for a while and came up with nothing.

I have a distributor and 2 workers, all on different servers.  This distributor does not run an endpoint.  Most of the time, all is well, but sometimes when a request comes in, the distributor will log the message "Session ids for Worker at 'Endpoint@Server' do not match, so poping next available worker" for both workers as fast as it can and no longer processes anything.  It appears that maybe it's not actually removing the message and keeps reading the same one over and over.  I have to stop the services, clear the distributor storage queues, and restart the services to get it to run again.  It may run for hours or even days just fine, but will eventually do this again.

Distributor stops handing out work to Workers after restart

Symptoms

Endless log entries of type Logs Session ids for Worker at '{0}' do not match, so popping next available worker until distributor for endpoint is restarted.

Orginally reported on the google group.

Root cause

Distributor tries to pop a worker, but the worker has session id not equal to the active session id stored by the distributor for that worker address. A info entry is written to the log with Session ids for Worker at '{0}' do not match, so popping next available worker. Tries next worker. Same thing happens for all workers in the storage queue. No worker with the active session id is found. The end result is the transaction, including both the message and the expired workers popped from the storage queue, rolling back. The entire thing repeats forever unless distributor is restarted, which cleares the active session id-cache. The root cause is that restarting a worker and then restarting its distributor will clear the active session id cache for that worker and no new messages will be sent out.

How to know if you're affected:

You are running version 4.3.X, 4.4.X or 5.X and the Distributor repeatedly logs Session ids for Worker at X do not match, so popping next available Worker.
the Distributor's main input queue has a lot of messages that are not being distributed to Workers.

Workaround:

Restarting the Distributor will fix the issue temporarily, but it can re-emerge later if the Distributor and a Worker is restarted in succession.

What to do if you're affected:

Please update your Distributor to the latest version. When upgrading the Distributor between major versions, it is important that you upgrade as prescribed, so that the workers are properly registered with the Distributor.
This fix does not affect the Workers hence, no need to update the Workers.

Fix logging in NextAvailableWorker pump

The exception at

NServiceBus.Distributor.Msmq/src/NServiceBus.Distributor.MSMQ/MsmqWorkerAvailabilityManager.cs

Line 133 in e131c66

catch (MessageQueueException e)

can happen quite frequently. If there are no avaialble workers, the receive operation will get a timeout exception. That is not something that calls for a INFO message. At best DEBUG.

And when we log it we should at least make sure that we don't throw away the exception details by calling the incorrect method on the logger.
Logger.InfoFormat("NextAvailableWorker Exception", e) is wrong. We want Logger.Info("NextAvailableWorker Exception", e)

SLR does not work for MasterNode

Who's affected

Any customer attempting to scale out MSMQ as a MasterNode with attached Worker.

Symptoms

Second level retries does not work when endpoint is deployed as a master node with attached worker and configured with a MasterNodeConfig section pointing to the local server. The message is sent to the EndpointName.Retries queue, but that queue either does not exist, or exists but the endpoint does not receive messages from it. As a result, faulting messages are passed to SLR but then never reprocessed again, or an exception is thrown that the retries queue does not exist.

Repro

Bug found by @dvdstelt when working with a customer, and verified by me. Here is a minimal repro:

Reference NSB 5.2.14, NSB.Host 6.0.0, NSB.Distributor.MSMQ 5.0.4
Use Code/App.config below.
- In App.config, set MasterNodeConfig's Node attribute to your own machine name.
Build, then run from command line NServiceBus.Host.exe NServiceBus.MSMQMaster
Message will fail through FLR and log will say WARN NServiceBus.Faults.Forwarder.FaultManager Message with 'c449a8c4-5bc6-4569-a23b-a64a00ad70fa' id has failed FLR and will be handed over to SLR for retry attempt 1.
- At this point, the endpoint will not do anything else
- You'll find the message in the Retries queue, but there's no satellite receiving from that queue.
If you remove the MasterNodeConfig section from the App.config, it will work, advancing through all SLR levels and eventually forwarding the message to the error queue.
- In this configuration, if you kill the endpoint after the message is handed off to SLR, you'll probably see that the Retries queue is empty. This is because a satellite is receiving messages on the Retries queue faster than you can kill the process, and storing it in timeout storage, which in this case is in-memory. So in this case, there is something receiving from the Retries queue.

Code

namespace SlrBugRepro
{
    public class EndpointConfig : IConfigureThisEndpoint
    {
        public void Customize(BusConfiguration configuration)
        {
            configuration.UsePersistence<InMemoryPersistence>();
            configuration.EnableInstallers(); // So queues are created when run from cmdline
        }
    }

    public class Startup : IWantToRunWhenBusStartsAndStops
    {
        public IBus Bus { get; set; }

        public void Start()
        {
            Bus.Send("SlrBugRepro", new TestCmd());
        }

        public void Stop() { }
    }

    public class TestCmd : ICommand { }

    public class TestHandler : IHandleMessages<TestCmd> {
        public void Handle(TestCmd message)
        {
            Console.WriteLine("Attempting TestCmd");
            throw new Exception("Boom");
        }
    }
}

App.config

<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<configuration>
  <configSections>
    <section name="MessageForwardingInCaseOfFaultConfig" type="NServiceBus.Config.MessageForwardingInCaseOfFaultConfig, NServiceBus.Core" />
    <section name="TransportConfig" type="NServiceBus.Config.TransportConfig, NServiceBus.Core" />
    <section name="SecondLevelRetriesConfig" type="NServiceBus.Config.SecondLevelRetriesConfig, NServiceBus.Core" />
    <section name="MasterNodeConfig" type="NServiceBus.Config.MasterNodeConfig, NServiceBus.Core" />
  </configSections>
  <MessageForwardingInCaseOfFaultConfig ErrorQueue="error" />
  <TransportConfig MaxRetries="2" MaximumConcurrencyLevel="1" MaximumMessageThroughputPerSecond="0" />
  <SecondLevelRetriesConfig Enabled="true" NumberOfRetries="3" TimeIncrease="00:00:5" />
  <MasterNodeConfig Node="DAVIDBOIKE0C30" />
</configuration>

Make the distributor output the criticaltime/time to breach SLA counters

Raised by @andreasohlund
Migrated from by Particular/NServiceBus#1279

Have each worker send back the critical time calculated as header in the control message going back to the distributor so that the distributor can populate it's counter. This allows for centralized monitoring of SLA's by just looking at the distributor

Worker node not reporting in after message is sent to SLR

Worker node is not reporting ready status after a message is sent to SLR. This happens because in that case the FinishedMessageProcessing event is not raised.

The solution is changing the way ready state reporter works so that it subscribes to StartedMessageProcessing instead.

Raised by @JeffHenson
Migrated from by Particular/NServiceBus#2518

NServiceBus 4.7.1
NServiceBus.Distributor.MSMQ 4.4.6

We had an issue this morning when one of our database servers took a dive and it caused some of our NSB worker nodes to stop processing messages. The server dying caused every message to exceed the FLRs and be sent off for the SLRs. The SLR message was sent and the message was ultimately processed but the worker node never checked in to the distributor to say it could take more work. This caused all processing to grind to a halt once all worker threads were used until the worker nodes were restarted.

I've modified the ScaleOut example to reproduce the issue.

https://drive.google.com/file/d/0B68lziNQwCVLYWZrZjZkX01rSnM/view

Workers use the wrong reply to address

The reply to address of the workers are wrongly rewritten the distributors control address instead of the main address

This was broken in 4.4.2 by this change:

Particular/NServiceBus@54a1da3

This is the corresponding issue on the old distributor:

Particular/NServiceBus#2051

Remove the support for running workers and distributor on the same box

Raised by @andreasohlund
Migrated from by Particular/NServiceBus#2048

The code we have to generate unique input queues for workers running locally for demo purposes I constantly confusing users. It's time to kill it.

Full context:
https://groups.google.com/forum/m/?utm_medium=email&utm_source=footer#!msg/particularsoftware/mBeAWMQ5JOE/KsE5gS4DL8UJ

particular / nservicebus.distributor.msmq Goto Github PK

nservicebus.distributor.msmq's People

Contributors

Stargazers

Watchers

Forkers

nservicebus.distributor.msmq's Issues

Expected behavior

Tested versions:

Partial workaround

Symptoms

Who's affected

Root cause

Possible solutions

How to reproduce:

Workaround

Solution

Affected supported versions

Symptoms

Root cause

Proposed new config section

How to reproduce

Workaround

Decision: Won't Fix

References

Symptoms

Root cause

Who's affected

Symptoms

Repro

Code

App.config

Recommend Projects

Recommend Topics

Recommend Org