When the Data Sync Falls Behind: How We Built a Self-Healing Process for Henry Schein One
- Sam Heck

- 21 hours ago
- 3 min read
By digging into a complex business problem and implementing a robust testing framework, we were able to help Henry Schein One, developer of dental practice software, save hours of manual work each week.
About Henry Schein One
Henry Schein One is a global leader in dental practice management software. Their suite of software covers all parts of the staff and patient experience.
The Challenge
Henry Schein One has a data sync process as part of one of their practice management applications which copies data from an application database to a data warehouse that serves user reporting activities. This process is started on a scheduled basis and always processes data from the end of the last successful sync through current. When all is running as normal, this is about a ten minute window. However, when the sync fails – due to a transient issue, a time window that contains too much data, a bug, etc – the process falls behind and will then fail due to the time window being too large.
When the syncing falls behind, the schedule has to be stopped and a developer has to take over manually running the process over shorter windows of time. This is time consuming, as the process can sometimes take upwards of half an hour to complete when there are many records to process. It is also error-prone since the developer has to manually set start and end times and ensure nothing is missed. When the sync is caught up, they have to remember to turn the schedule back on.
This data sync process, implemented in Pentaho, is very complicated and mature. It is non-trivial to make changes and to validate their effectiveness. As part of a previous engagement, we built an integration testing framework that automatically runs parts of the sync process against synthetic data, helping to alleviate the validation problem and put us in a position to safely tackle altering how the data sync process calculates its time window.
The Solution
To solve this problem, we needed to modify the sync time window calculation to be dynamic. We had to implement these changes within the existing sync processes and using the limited amount of data about sync runs and application database state available to us. It was also important to keep the new process as simple and understandable as possible, both to reduce the number of edge cases we had to worry about and make it easy to maintain in the long term. Eventually we landed on the following process as one that solved the largest problems:
Normal state: Running less than 15 minutes behind? Process from the last successful sync through now, as usual.
After a failure: Scale the time window down substantially. If the sync failed because it tried to process too much data, a smaller window may be all it needs to succeed.
High-volume signal: When indicators suggest the upcoming window contains a large volume of data, shorten it preemptively — before a failure happens.
Behind but recovering: If the last sync completed in less time than it covered, the process is running efficiently. Expand the window (up to a set maximum) to start catching up.
We also built a suite of integration tests to run the updated sync window calculation over different situations and ensure that our solution worked as designed.
Outcomes
The dynamic sync window timing was successfully deployed to production. Almost immediately, the sync process was faced with having to process over 50 million new records as part of onboarding a new customer. Before the dynamic window process was implemented, the sync process would’ve failed when it hit the large number of records and continued failing until a developer stopped the process. They would’ve then had to spend a good part of their day babysitting the data sync process, kicking off manual runs, and monitoring their progress until the data sync was current again. Instead, the data sync process automatically condensed the processing window and kept it there until the large chunk of data had been processed, then expanded the window to get caught back up. No manual intervention was needed at all, and the team now spends significantly less developer time on recovering from errors.
Takeaways
By taking the time to deeply understand the business situation and by implementing a robust testing framework, we were able to efficiently alter a sensitive technical process and save hours of manual work each week.
![Logo [color] - single line w_ cart_edite](https://static.wixstatic.com/media/17b7e3_2ff2eac6a2994af799992822fabc1202~mv2.png/v1/fill/w_354,h_50,al_c,q_85,usm_0.66_1.00_0.01,enc_avif,quality_auto/Logo%20%5Bcolor%5D%20-%20single%20line%20w_%20cart_edite.png)
Comments