How to do Massive, Slow Database Operations Efficiently With Event-Driven Recursion

Scenario: You need to perform a very heavy scripted database operation in ServiceNow, on a very large number of records.

To do this all at once, would: (1) slow down your instance, and (2) take longer than the maximum allowable transaction time/sql count/etc. and (3) other bad stuff.
Each operation takes between 10 and 60 seconds because of the additional business logic that needs to run.
You can't optimize the operations, they're simply slow, and there's a lot of them.

Example: You need to reclassify 100,000 CIs from one class to another (a very heavy operation on a large number of records).
How do you handle this?

I’ve run into this scenario a lot, and in every team I’ve been on (which is a fair number), the go-to answer, is to run a scheduled script which does the operation on a batch (some specific number) of records at a certain interval.

But imagine if you have the job run every 10 minutes, and it deletes 20 records per run.
You can imagine a scenario where the instance or scheduled job is particularly slow due to uncontrollable circumstances (such as the volume of integrations hitting the instance at that particular moment, or a discovery run happening simultaneously with the scheduled script). In this case, the job could take some amount of time longer than the average interval between jobs, in which case the jobs would begin to "pile up" and result in a sort of traffic jam that would both be inefficient, and be a massive drain on instance performance.

So the way to do this safely, would be to figure out what is the longest amount of time you could imagine one "batch" taking, and then setting the scheduled job interval to a little bit more than that amount of time multiplied by the number of records per batch. For example, if you do a 20-record batch and each record takes takes between 10 and 60 seconds, you might want to run one batch every 12 or so minutes.

However, if you need to update 100k records at a rate of 1 every 1.2 minutes, that is 120,000 minutes, or 83 days - nearly 3 months! (24 hours per day.) This might well be an unacceptable amount of time for these records to be in this sort of sys_class limbo.

So, for an operation like this, how can we make it the most efficient that it can be, without bogging down our instance?
The solution: Event-driven recursion.

Using a scheduled job with a gap between each job prevents the “pile-up” effect, which is good, and it gives anything else which might be piling up in the queue (such as a discovery run) some time to run. But, since each job might take a different amount of time, we have to be overly conservative with our intervals.
However, with event-driven recursion, we can set it up so that each batch triggers the next! This would prevent the pile-up, and send the next batch to the “back of the line” in terms of the scheduling queue. The result would be that the batches would run slower when the instance is busy, but much quicker when the queue is empty, preventing the pile-up effect, and preventing bogging down the instance, both without sacrificing much in terms of the performance of our actual operations (which we’d like to get done before Jules Verne’s ‘round-the-world journey finishes).


How is it done?

Event-driven recursion consists of two records, and a trigger:

  1. An Event registry

  2. A Script Action (to be triggered by the event)

  3. A background script, run once, to trigger the event.

Step 1: The Event

Step one is pretty simple. Just go to the event Event Registry [sysevent_register] table, and create a new Event record. This is a very simple record; all you need is a Name, and a Description of the event, and what should trigger it (in the Fired by field).

Step 2: The Script Action

Next, head over to the Script Action [sysevent_script_action] table, and create a new record. This is the script that will execute each time the event is fired. It’s also the script that’s going to be firing the event.
Give the script a Name, set the order to 1000, and set the Event name field to whatever event you created in Step 1.
Finally, add your script into the Script field.

Your script will have to follow a few basic guidelines, but I’ve provided a basic template you can use below.

The important parts of this script are:

  1. The “EVENT_NAME” variable (line 8) should be updated to contain the name of the event you created in step 1.

  2. Whatever operation you’re performing, you want to make sure that you respect the limit (otherwise, this whole thing becomes pointless). In the above example, I’ve done so using .setLimit(limit) on line 18.

  3. You must have a check to see if there is more work to do. If there is not more work to do, you must return - halting the function before it gets to the point where the event is re-triggered. In the example script, this is done on line 24.
    Note: This example script assumes that every record matching the query needs something done to it, and that when that something is done, it will no longer match the query. If that is not true for your use-case, you might consider using something like my paginaged GlideRecord utility, and implementing a simple .setPage() method (which I should really add myself, if I ever get around to it).

  4. At the end of the function, if you didn’t identify that there is no more work to do in point 3 above, you must re-trigger the event. In the example script, this is done on line 36.

  5. Finally, you must actually call the eventWrapper() function with the event’s first and second parameters. This is done on line 39 in the example code above.

Step 3: The Background Script

The last thing you’ve got to do, is kick off this recursive process.
Simply copy the code in your Script Action (the script you adapted from the above code), and on the last line where eventWrapper() is called, replace event.parm1 and event.parm2 with the desired batch size and the number 0 respectively. For example:

function eventWrapper(limit, currentNumber) {
  //Your code here
}

eventWrapper(25, 0);

And that’s it! The script will continue to re-execute itself until it completes processing all records.

Risks

It’s important to note that if you use very large batches over a long period of time, this method can still potentially clog up your primary/default event queue.

One solution to this, if you need to do large batches over a long period of time, would be to move the events out of the default queue by creating a dedicated Event Queue. This isn’t terribly difficult to do, and ServiceNow has even helpfully provided an article on exactly how to do this, here.