Jump to content

Welcome!

Sign In or Register to gain full access to our forums.

Photo
- - - - -

Blog Post - Duende - 11/19/17 - NFB Silicon Valley & System Lag


  • Please log in to reply
No replies to this topic

#1 Duende

Duende

    Advanced Member

  • Administrators
  • 1595 posts

Posted 20 November 2017 - 11:06 AM

Happy rapid onset of fall and getting dark at 5pm and wintry depression! We're rapidly cruising into the holidays, and while that has historically been the time where we attempt to wrap up a bunch of projects before we travel and spend time with family, this year is a bit different. At the beginning of November, Iyara and Omnis and I all attended a regional NFB  Conference in the San Francisco Bay Area. We got to spend a weekend hanging out, and on Friday we (they) staffed a table at the conference and met a bunch of potential new players, most of whom are visually impaired. To prepare for this, I finally decided to forgo my personal Mushclient settings, and installed our official (Ruthgul's) Mushclient bundle, with the mappers and dozens of plugins. I was shocked at how good it was, in terms of features, support and documentation- and we got Omnis to stop using his 15-year old install of Zmud to play with it, and I think he had a similar experience.

I'm glossing over a lot of details, because almost every sentence in the previous paragraph deserves its own blog post, but I promise I'll make my way around to them eventually. One of the best parts of the weekend was getting to have all 3 of us sit down and discuss the game in physical proximity, which last happened over a year ago, when we attended PAX 2016. Last time we got together, we figured out a bunch of new plans for the future, and solidified a bunch of quality-of-life changes which became a big chunk of 4.6. In the year since, more and more of the game's codebase has been refactored, opened up, documented and fixed up, to the point where the intractable problems of 2016 have become the next goals for 2018.

One of the big intractable problems, and the one we seek to fix most immediately, is the periodic lag spikes. These lag spikes have been occurring since at least 2014, and for an unknown time before that. The spikes get recorded in our server logs, and analyzing them hasn't revealed anything helpful about them. They don't appear to be caused by any specific use input or game loop, don't appear to be related to any sort of save routine, they frequently but not exclusively occur at night (US), and occur sporadically. We've had 1200 such events this calendar year, and there isn't any discernible pattern.

One of the great infrastructure features that Umlaut added about 10 years ago was a system profiling harness, which can be used to determine where game latency originates from, whether in code or scripts. I've slowly been working on the lag since it really came to my attention 2 or 3 years ago, but we have few leads to go on, and they're all very deep in the code, with no visible components. This makes researching it and adding diagnostics glacial, as any addition will get called millions of times between reboots and probably not be diagnosing the correct thing. Furthermore, because the lag doesn't occur on the dev server, we can only make the changes once per live reboot- resulting in a handful of debug additions per year. That approach hasn't solved the problem yet, obviously, so at the beginning of October, we took a different approach using our new debug tools, to collect more diverse data samples. That still hasn't revealed anything yet, so I resorted to asking Umlaut to get on and help, for the first time in many years. We spent a few nights examining the data and writing more profilers, and we think we have a lead on what's causing the lag. Diagnosing further will occur at the same pace, but it's the highest priority we have- so we anticipate potentially having a series of reboots over the coming week to make faster progress, starting tonight.

On a side note, we won't have a Thanksgiving Global this year, as we've focused on larger maintenance priorities this month, and we wanted to invest the remaining time for global work on making Christmas really awesome.