Issues with Transitions and Services

tarawilliams · August 26, 2022, 6:16pm

We are trying to integrate Shopify Payments with Moqui and are running into an odd issue. When
I was testing the cancelation of an order, each time I tried to “Close” or “Un-Approve” the order in an attempt to cancel it, nothing happened. I checked the logs and I didn’t see any errors, just a few warnings and then nothing.

It seems like the service is not even being called (to cancel the order). We were wondering if perhaps the artifact might only record when the service finishes, is that the case?
We looked into how to track transition hits, and this is what we see from the cancelOrder transition

The other thing that we can see is that after trying to run the cancelOrder transition we get a message notification saying “could not connect to server”.
We have tried other transitions such as alter quantity in quick items and same thing happened. The page was frozen for a while, and then got a notification that read “could not connect to server” after reloading the screen.

Anyone have any ideas? We are stumped.

michael · August 27, 2022, 12:53am

If I recall correctly, there are several different Shopify Integrations with Moqui and they are all either closed source or are not part of the open source project.

The Moqui Forum is not intended to be free support for proprietary software. This especially applies when the software in question isn’t even open source (in this case it is or not). If i’m wrong about it being proprietary, I’d love to have a look at it and possibly help out.

By the way: It’s way more effort to help with vague information. Instead of saying:

Say something like:

On the X screen’s Y button, I’m trying to perform Z action.
I expect A to happen, but B happens instead.

Thanks for your post. Sorry if it wasn’t clear about what this forum is.

jonesde · August 27, 2022, 5:41pm

For general debugging, and reporting of issues, it’s best to start with identifying 3 things:

what did you do? (user presses a button with an order in a particular state, or service job or something runs)
what was the result?
what result was expected, and how was the result different from what is expected? (describing the difference, comparing actual to expected, is often helpful for communication)

For a developer, once you have those 3 details it’s a matter of tracing through the code to see what is going on, or where it’s going wrong, and then adjust as needed (fix bugs in existing code, add code handle the scenario, etc).

You have some of the 3 points in what you wrote, but not sufficient for #1 to be able to reproduce the issue or at least understand what you’re doing, and not enough of #2 and #3 to get an idea of what the issue. It’s interesting that maybe a service isn’t being called, but that’s not the real issue when it comes to something like the state of data before and after a user operation or service job run and such.

tarawilliams · August 29, 2022, 10:56pm

Michael, this Shopify integration from TailorSoft is intended to be part of the Moqui framework. This was something that Daniel wanted to contribute back. I am not sure if he has already done that or not.

As for my vagueness, I can be more clear:

I was on the Order Detail screen of an order in the “Approved” status.
I tried clicking the “Close” button which typically will cancel an order. When I clicked that button, nothing happened.
I ran into the same issue when trying the “Un-Approve” button as well. It is like the button click did nothing, there was no action triggered.

Maybe it would help if I understood what was supposed to happen in that situation. What actions, services or what ever are called when you click the “Close” button on an Approved order? Same question for clicking the “Un-Approve” button.

tarawilliams · September 7, 2022, 5:47pm

@jonesde we had an issue with being able to recreate the bug in the development environment. After some work, we were able to do some debugging in Staging. Here is the most current information that Sebastian told me (I don’t know if this is enough detail, if you need additional information please let me know):

When testing in staging an error arose; seemingly the services were getting frozen (they did not raise an error message, nor return).
After adding logs to review the flow it was found that the SECA got frozen when calling (most likely somewhere inside that service)

jonesde · September 8, 2022, 5:35am

You mentioned Sebastian who I’m not sure I’ve met, is Sebastian the developer working with you on this and is he available to join in the discussion here? This is likely a very technical issue.

Calling the queue#SystemMessage service might be an important clue, along with the idea that this might be a transaction conflict issue, because that service updates the SystemMessage record in a separate transaction so that it succeeds or fails independently of whatever calls it.

The problem with that is that record locking in separate transactions is very picky. There are a few patterns where things can go wrong, but a common one looks like this:

transaction A locks record X (by update, for-update find/query, OR a common gotcha in some databases where record X gets locked because it is a foreign key of record Y… for example if you update SystemMessage to set the orderId which is a foreign key for OrderHeader then it locks, or tries to lock, the OrderHeader record too!)
in the same thread transaction A (from whatever calls queue#SystemMessage) is paused and transaction B (in queue#SystemMessage itself) is run synchronously, so transaction A is waiting for transaction B to complete before it can continue
transaction B tries to lock record X
TX A has a lock on record X so TX B is waiting for that lock… AND TX A is waiting for TX B to complete, so there is a deadlock between the record lock on one side and TX A waiting for TX B on the other
this is not the sort of deadlock that a database can detect (where both sides are locks in the database), so eventually the lock wait will timeout which results on errors being logged

When you say the service never returns and there are no error messages, is the actual behavior that around 1 minute later there are log messages in the thread that was running the service, and those log messages show errors, perhaps even a lock wait timeout error?

That would be the first thing I look for, and the easiest way to do is filter the log by the thread name that was running the service so you can see just the logs from that thread over a longer period of time.

Because transaction management can get messy there are some tools in Moqui Framework to help track down such issues, especially the Entity Facade lock tracking feature. To enable this set the env var entity_lock_track to true (in an OS env var, or in a moqui conf XML file). There is some runtime overhead (memory and processing time) for this, so it is not enabled by default.

What the entity lock tracking does is track all transactions and records locked on the Moqui server and then it logs warning BEFORE an operation might lock a record if another transaction already has a lock on that record.

The downside to this is it has LOTS of false positives, ie warnings about potential lock conflicts that are not any sort of issue, just part of normal expected operation. In a case like this it might tell you exactly what you’re looking for in a log message after the “Queueing Cancel System Message…” log message you mentioned.

One little detail, the entity lock tracking feature is on by default in dev mode (using the MoquiDevConf.xml file which is the default for devs, ie when just running java -jar moqui.war). If this can be reproduced locally, then doing so on a local dev machine is the easier way to go. If it only happens in production and only periodically, you’ll have to turn on the entity lock tracking on the production server and wait for it to happen. This is not generally harmful, mainly just increases the memory used and slightly slows things down, so unless the production server is already having trouble keeping with its load then it’s not a problem. FWIW I’ve done this many times on production servers, tricky transaction problems are often intermittent.

SGiraldoZ · September 13, 2022, 3:44pm

Hi David, thanks for the tips.

After reading the post I searched in the logs for the test transition we were working on and in fact found a couple of messages about a minute after running the service.

Later on we restarted the staging application with the entity_lock_track enabled and tried to cancel the order again (Following all the steps Tara clarified earlier) and indeed saw two log messages right after calling org.moqui.impl.SystemMessageServices.queue#SystemMessage :

The detail on both of those warn logs are:

Now, I think the lock on the orderHeader might be solved with executing the SECA for the shopify integration(which builds the system message) after the transaction commit (when=“post-commit” instead of “post-service”).

However, I’m not sure what to do with the other lock (on status item) since the transaction with the original lock is of a service that is supposed to have run and commited before calling the new service.

Do you have any ideas on how to go about preventing this issue?

Sorry for taking so long to get into the discussion, I was out sick a couple of days last week and had some connectivity issues yesterday. Anyhow I’m here now, and thanks for all your help.

jonesde · September 13, 2022, 6:43pm

The first message is about a potential lock conflict on StatusItem, which shouldn’t matter in this case. The 2nd message is about OrderHeader, which is very likely what is causing the issue.

In the artifact stack you can see the switch from TX (transaction) 651 to 652 inside the queue#SystemMessage service, and that is one side of the deadlock with TX 651 waiting for TX 652 to complete because of synchronous code being the other.

The reason that OrderHeader has a locked conflict is a little odd if you’ve never run into it before: many, but not all, databases go the extra mile with protecting data involved in transaction by not just locking an updated record, but also all other records referred to by foreign key. If the orderId is populated on SystemMessage, or if orderId is being set, then when SystemMessage is updated then the database will try to lock SystemMessage and OrderHeader.

To break the dependency what needs to happen is that the queue#SystemMessage service needs to be in a transaction that does not already have locks on records that queue#SystemMessage might lock. In this case the main ones to worry about are SystemMessage, and by foreign key extension, OrderHeader.

On a side note this reminds me that it would be nice to add SECA, EECA, EMECA to the artifact stack for easier debugging in cases like this.

Even without that, it sounds like your Shopify service is called via SECA on OrderServices.cancel#Order. You mentioned the seca.@when attribute is set to post-commit. That runs in the same thread after the transaction commit phase, but it does NOT guarantee there will be no transaction because the service the SECA rule is attached to may not have started the transaction, and so would not have committed it. It also runs regardless of whether the transaction was committed or rolled back.

This is not guaranteed to work alone, it depends on the code which might need other changes as well, but what you likely want is not post-commit but tx-commit which only runs if the transaction the triggering service is running in is successfully committed… and it always runs asynchronously in a separate thread.

For more information on the seca.@when attribute see:

https://moqui.org/docs/framework/Logic%20and%20Services/Service%20ECA%20Rules

That would be the first thing to try (when=tx-commit) and may be the only change you need. If there are still lock conflicts then it may be because the Shopify service in the stack has a lock on the OrderHeader record because it calls queue#SystemMessage. In that case there are two options:

refactor so locks (from create/update or for-update find) are only done after calling queue#SystemMessage
call queue#SystemMessage asynchronously so that the calling service is not waiting for it to complete, and even if queue#SystemMessage runs right away and ends up waiting that will be fine because it can just wait for the calling service’s transaction to complete and release its locks