Backing track

In this huge section we will talk how to play a backing track! (Spoiler : It is not just AudioSource.Play) This deceptively simple task will be a major mindfuck to you while the mathematics looks like a really really simple addition or subtractions. I am not even sure if this is understandable for someone else other than me.

Your receptor is at "time"

Receptor is the judgement line that run forward, or inversely the note fall down to it depending on how you look. In this article I assume the receptor is in time unit and it keep increasing according to delta time each frame. From this point think of "receptor time" as a "position" of this receptor. There is no other position that was used to convert to time. Advantage of doing it like this is because delta time could be added directly to progress an entire game. I don't know about how you represent "where" you are right now in the song, feel free to convert my definition to fit yours instead of fixing the game to use the same unit.

It is used for :

Judgement : A frame where you detected an input, you may compare the current receptor time with the time of each affected note should you score them or not and with which judgement.
Rendering : For example, if your receptor is at time 1.5s, and that note is at time 2.25s. You can make your rendering logic such that when the receptor time is at 2s, the note should be closer than before. (This is the "note falling to the receptor" you perceived, but actually the receptor runs forward.)

There is no asking audio position back to apply to this time. It is purely delta time based. You have read that asking audio time back is not reliable, what we hear from speaker is likely ahead of this audio DSP time, and audio DSP time updates in a weird step that even changes or not changing in the next few lines of code.

It is better to have only one critical moment where we ensure the game sync with this wild audio time (that is the moment we start the music) then let it go. If buffer overrun/underrun make the audio lags or game lagging making audio go ahead then so be it. It is better than base your entire game on audio time! Base your entire game to delta time.

Your task : start playing from ANY receptor time

Of course you start the song at the beginning. But remember that you could pause and resume. Or even you have a note editor that could "test play" from any point. This "playing backing track" should be able to handle ANY current receptor position. Starting the game normally then just mean moving the receptor to negative time then start our magic function, which we are figuring out how to make it right now.

PlayScheduled : Worship this method

Unity's built in PlayScheduled has one thing even Native Audio couldn't do : accuracy. (not latency, not immediate response) You can specify future time and it will try hard to start the audio at that time. (You can still prime the audio to be at any position, then it will start at that time.)

Any audio that you could predict ahead of time (backing track, autoplay assist clap, metronome, etc.) should use this method.

The scheduling is special because it is frame independent. If the time you want happen to land inbetween frame, the audio will still be able to start at that time. What magic is this? Frame is an abstract concept that Unity added. Usually at native side there is a tighter callback or something out of frame. (If you did microcontroller/Arduino/etc. , you may have heard of "interrupt" signal where you could make something happen in an instant.) Unity has a platform specific code to interface with native audio scheduling way. This method worths more than it looks.

Your naive solution of waiting to play at the right time within your update function will always be later than the time you rellly wanted because there is no way to ensure the frame time lands exactly at that point. You can assume the next frame would arrive 1/60 ahead of the current time in perfect condition, that is as far as you could.

The added problem is that it must be in the future. How would this affects us?

Assume latency is zero for now

For our sanity let's assume the music plays immediately and reliably at the time specified on PlayScheduled.

Fixing the overtime

Let's think about this simple situation. Your receptor position is at 0s. You want to start here. What bits of audio ideally you would like to hear right now?

It's the 0s of audio file! That is, if the next frame moved the receptor position to 16.66ms, then you would like to be hearing the 16.66ms time in your audio right now. The song is supposed to start at receptor time 0s. (Without offset, I will talk about it later.)

But hold on, how can we be already hearing audio at 0s in the file, when this frame, we just received the command to play and we have our "must play in the future" rule?

Consider this tactics :

Set audio time to 0s.
Schedule the audio to start in the 16.66 * 2 ms future time from now (I think the game could prepare the audio in 2 frames). Remember it in DSP time unit. For example, currently at this line of code you ask for DSP time and it says 20000 ms. Then you are "looking forward to" 20033 ms. Also the PlayScheduled accepts time in DSP timeline, so you also tell it 20033 ms.
Don't begin your gameplay just yet! The plan is to begin the game at 20033 ms.
Two update rounds later you check DSP time again and it was :
- 20033 ms (extremely unlikely) : Lucky! You are hearing 0s of audio right now and the receptor position is also at 0s right now. This frame you won't add delta time to the receptor position since it is correct. The next frames you continue adding delta time independent of audio time with hope that the audio time will still "remotely sticking" to your delta time added receptor time from now on. At least you got a very good start!
- 20018 ms (oh no) : The audio schedule that already went ahead didn't start just yet. But the next frame it probably already started and the code will go over that time! (Let's say 20041ms) But since the scheduled audio didn't start just yet and you are given this final chance to run any code before it starts, StopScheduled is definitely an option to cancel and try scheduling again. Well that may produce infinite loop. So I suggest just let this frame pass and see the next frame.
- 20041 ms (oh no) : The audio already started and there is no stopping it! What is the solution? For this frame, add a custom delta time so it accounts for the over-time. (that is 20041 - 20033 = 8ms) Remember that if was exact (the first case), we will not add delta time to receptor time this frame since it is already correct. This means your game will weirdly move a bit ahead this frame, and the next frame we can continue adding ~1/60s delta time like usual with the same hope I explained in the first case.

With this tactics, it is possible to circumvent the fact that "scheduled" is only available for audio by fixing the overtime. There is no "ExecuteCodeScheduled" so that the play start at time you want. You can cheat your way because the receptor is also in time unit, then you force it to move forward to match when the game loop present you this chance. This chance is the only exact chance you have, since it's you who said "20033 ms!" to the ultimate method PlayScheduled and now it is a bit over that, you know how far we are ahead of it. I will call this the NOW! moment.

All future frames are left for destiny of delta time and DSP time. (And the player's ear that would detect the drift or not, better if the player is not trying to perfect attack.)

How about coming back from a pause

When paused you don't pause the AudioSource, you stop it completely. We are redoing everything I said again on the frame your player pressed resume, except that you prime the audio play head time at our current receptor position when the player want to play again.

Pausing and resuming is nice this way, since we could fix any drift of DSP time and actual receptor time that we used to judge and render. We are redoing the "fixing the overtime" yet again and have an another chance to ensure exact sync.

How about starting before the song

The audio may start at receptor time 0s, but you have that intro animation on your cool UI plus a word READY?, but you want the notes scrolling in as you see the READY? on the screen. (Or see empty playfield scrolling in the back,)

This is simply set the receptor time to negative time according to the length of your intro animations. However one thing is different, when coming back from a pause, you prime the audio time ahead. Now, you cannot prime the audio time to negative! The least is 0s.

That's right! You must add this "under-time" to the scheduling method PlayScheduled instead and then prime the audio at 0s, the smallest it could go.

This means the schedule will be quite far ahead. The NOW! moment is quite far ahead. The game keeps checking for DSP time every frame... then finally there should be a frame that the DSP time returned is a bit over our destined time.

The difference though, we were adding delta time all along before this point to make the game scroll while we are displaying READY? This frame where you detected DSP over your scheduled DSP time, it would be perfect if you rather not add delta time to your receptor position, but set the receptor position to an equivalent time of the DSP time you scheduled plus the difference. Set means no matter where it is right now (it should be close by), it warps to the time you wanted.

This could produce some side effects : Player with crap phone but with good eyes will notice the game jumps either backward or forward a bit inconsistent of prior frames leading to the NOW! moment that the song finally begin. If there is no notes around then it may not be a problem. Anyways, most songs begin its audio content without any notes on screen. I think ensuring the sync with audio worths this little hack of receptor time.

You got all cases covered! You can even pause while the game still says READY? and resume, and it still wait approproately before the song really comes because you handled this generically enough.

Offset

The previous section assume at receptor time 1s you want to hear the audio at 1s.

But this is not often the case. The audio file often has some silence at the beginning or some intro which you would rather move forward or backward a bit so that first cymbal lands on the note you placed at time 2s. (Trivia : that's at the 2nd measure of 4/4 time signature 120 BPM song since one beat takes 0.5s) This is an offset value per audio file that likely apply to all chart difficulties.

So you just add this time before determining what you should do. For example in the case of starting the song at exactly 0s or maybe the player pause the song around the start at 0.5s (bitten by a mosquito) but the offset says delay the song by 4s. This will make this case turned into "starting before the song" and you must put the "under-time" to the schedule as explained.

Do the same for negative offset. If the song has negative offset then even at 0s you will already be hearing some part into the song.

Offset deals with "at this receptor time what part of the song should we be hearing?" by moving the song back or forth. Both positive and negative values make sense. Maybe you really need that 5s into the song to land at measure 0, in which you move the song early. (Or you change your note chart so that the note starts later and not at measure 0.)

Make sure your "move time back to negative on start for animations and READY?" take account of this offset as well. That is, minus the start time starting from the offset and not from 0s. Or else, a song that really move the offset late into the future will have a gap after READY? and the actual start that is too long.

Player adjustable "calibration"

Sometimes it says "input calibration". Sometimes it says "audio calibration". Or "adjust latency"?

Latency compensation

Agree with me first, negative latency doesn't make sense. What latency does is that the audio play later than expected. Therefore if you say please compensate for 0.5s of latency, then everything that could be planned ahead of time, which you already have to be earlier because of our future schedule rule, will try to be even more earlier by 0.5s so they go through latency back to an intended time.

It make no sense to say "I want -0.5s latency compensation!".

When coming back from pause (the case where we are already into the song) you remembered there are several frames where we don't start moving our receptor time just yet because we are waiting for the NOW! moment. It is now that you must wait even more after you detected the DSP time over the promised NOW! moment, that you could fix the overtime and finally let the game going. Put it simply, add number equal to the compensated latency in addition to the scheduled time. (Do not add to the scheduled time, just that the game wait a bit more after the schedule)

In the previous example, we wait for 20033 ms. You have arrived at the frame where DSP time is 20035 ms. But you have stated that the latency compensation is 10ms. Therefore even if you know the audio schedule already excecuted, this is not the right frame to add anything to receptor time just yet. Instead wait one more frame where DSP time reads 20055ms, then you perform fixing the overtime comparing 20055 to 20045 (instead of 20035).

A possible problem from this fix when taking this to the extreme, imagine if the player adjust so much latency compensation than needed. After coming back from pause, they will hear the song play even when the game didn't start moving yet! That is weird of course, and the goal for player is to adjust so that the player hear the song exactly when things start moving.

This is the by product of using the stillness of the receptor at pause to fix the problem. Could we fix this somehow?

You can counter this by adding a resume countdown, that you are sure is longer that any possible latency compensation the player could adjust. (So you lock maximum latency adjustment in the option screen to say : 3s) This will make the freeze wait natural since you could mask it inside the countdown. (The schedule met before the count actually reaches 0.)
Moving the receptor backward to compensate for latency. However, it will allow an exploit that the player could pause and resume the game repeatedly in order to move the receptor back. Some games did it as a QoL for players such as DJMAX Respect, where the game not just add a countdown, the game even rewinds time for you. You can mask the latency in the rewind.
Let the receptor go ahead after resuming. However you will get to play a silent song for a bit! Why? Because this silent is actually a time given for our crap Android phone to finally pour out audio to our ear. The schedule already met while it is still silent. Therefore when the music finally come back, that section of song should also be ahead of the section at the time you paused because you "pay" them for latency compensation without weird stop. You do this by priming the audio position ahead intentionally the more latency you want to compensate, and have that come back precisely with PlayScheduled. (In other words each time you pause, after resuming a bit of song ahead of that pause point is never heard again since the audio was adjusted to jump ahead of time to meet at the end of silent playing.) In this case there is no weird pause while the song plays from calibration error, but instead an over-calibration will make the song plays in silence longer.

I prefer the last way since the player is likely focusing on moving notes after the pause (read them "by sight" for a while instead of by ear), it is probably better than startling audio coming back while the notes aren't moving yet, or a receptor position yanked to a different place on its own.

How to implement the silent play solution

After you decided to play from any receptor time, determine what you should be hearing not right now but at calibration time far into the future.

If your receptor is at 2000ms, your latency compensation is at 0s, the offset is 500ms. Normally you would say I would like to hear 1500ms now. But you instead, set the audio time waiting to not 1500ms but 1533.33 ms. (the core difference is right here)

Next, try to ask the "DSP now" the earliest in the frame so it is the most in agreement with delta time we are going to work on. (Remember that DSP time is always changing each time you ask for it) Schedule the 1533.33 ms audio to start at DSP now + 33.33ms. This frame, do nothing to the receptor time. This "do nothing" only make sense if we managed to ask DSP time the earliest in the frame as close as possible to an actual frame time, so we are ready to do something the next frame and expect the delta time is going together with DSP time.

Next frame, add delta time to receptor. (That means we are already playing the game, turn on judgement and everything now.) Recall that previously, we check DSP time until it is over DSP + 33.33ms to actually move then fix the overtime, but since the audio has been forwarded in this approach, we can now move immediately. In this frame, it is probably around 1516.66ms and not yet at the schedule. (33.33 ms is roughly 2 frames) But we are already "playing in silence" right now! (You can hit or miss notes while in silence) Then when finally the schedule came about, it will be at the position that account for the time we had played in silence. That 33.33ms disappeared completely audio content wise, as if they played with 0 volume while in silence then suddenly restored back the volume.

Now let's try adding latency compensation of 100ms into this mess. (Remember that negative compensation doesn't make sense) Set audio time to 1633.33ms, but still schedule it to start at DSP + 33.33ms. (still that minimum time for PlaySchedule to work) Now, while the schedule met at the old spot, the audio came late because of crap Android latency, we are playing in silence even longer, but when the audio came out, voila the position is not at 1533.33ms but at 1633.33ms. Sure it came later than scheduled because of latency, but the audio position has been forwarded to counter it. JUST ACCORDING TO KEIKAKU.

One very important thing to notice here is that no matter what (what = latency compensation, receptor time, offset), the scheduled time seems to be a fixed DSP Now + 33.33ms. Because we put all other problems into the forwarded audio time instead (and bear with the silence), all we need for scheduling remains just the necessary future time needed for PlaySchedule to work. (Please tell me you nodded your head at this point, or else you can try reading this article later after you drink some sugar.) Later I will reveal that this is not 100% true (haha) but it will be useful to notice this.

Bonus : You noticed that at the moment where the schedule was met, we are already running the receptor way before that. It is still possible to mix in the "overtime fix" approach by asking DSP time until it is over the scheduled moment then nudge the receptor time like before. (The difference being we are already running receptor time up until this moment, countered with forwarded audio time waiting. The moment of this nudge we are still not hearing anything because of latency.) You may skip this if you would rather like the receptor running smoothly or fearing that player will detect the receptor time warp. (Which affects rendering) The difference from before is that, before we didn't run the receptor and it was possible to completely mask the "fix" in the stillness. Now it is a trade off would you like to fix it at this only defined moment, traded with a bit of discontinuity depending on how bad the phone is. If you don't fix, then even bad phone will continue smoothly at the scheduled moment with drifted game time since it wasn't fixed.

Let's consider the case where the calculation results say "I want to hear negative ms right now". If the receptor is at 200ms, but offset says 500ms! The result is that I want to hear the time -300ms of the audio right now, which of course doesn't make much sense. The real situation is that the audio was offsetted to the future so much that 200ms is not even at the beginning of audio, or inversely, the receptor time now was too far back into negative time that it is again, not at the start of audio accounting all the offset.

If we continue the calculation like before, we would be setting audio to wait at -300ms + 33.33 + latency. Which is probably still negative... or maybe not if the latency compensation is set to large number? My point is that the negative-ness must be calculated from receptor time - offset + 33.33 (minimum future time) + latency. If you managed to get positive number from this, you can do it like before. The problem is just it is impossible to prime audio time waiting at negative time. It must go to something else.

This something else is that "not 100% true" thing I asked you to remember. The scheduled time that was fixed to DSP Now + 33.33ms! You can still imagine that you are before the audio right? The remaining thing you need to do is just, take that negative time added into the schedule. It is now DSP Now + 33.33ms + (negative time * -1). Yes just dump it in here and schedule immediately. Then the primed audio time could wait at 0s instead of impossible negative time. A gotcha here is that when that value came out positive, there is nothing added to DSP Now + 33.33ms. (Because those go to primed audio time instead) You can just use if to separate negative and positive cases, do not risk brain damage trying to overly generalize things.

I believe you have got an algorithm, that play an audio correctly from any receptor time be it positive, negative, before the song, mid song, or even after song! (which the primed time will be at the end, and you will hear nothing given that AudioSource is not looping) Congratulations!

Just in case, when resumed from pause let's put some delay time that prevent the player from pressing pause button again while the schedule wasn't met yet, or else you may hear a song playing in pause screen. Or you may take some effort to cancel the schedule.

Other kinds of "calibration"

Not yet! Just a bit more.

In many games when you go into option screen, you see it allows you to press +- to a value of unknown unit to "calibrate" the game. "I calibrated to -0.5 / 0.5" now both make sense. What's going on? Negative latency doesn't make sense right?

There are reasons the game didn't outright say what is the unit of calibration or just display a test play area of moving bars to the player while letting you press +- because this topic is such a mindfuck. (Check out how long this page has become just to play a damn music.)

One, it is possible that those values actually add/subtract to the music's offset, but sometimes it results in a perception that latency was fixed. Then it make sense to have both positive and negative values. In games that do this without any other tricks, the game may only appears compensated when restart from the beginning. The game may try to convert the positive calibration into latency compensation definition we worked on all along, but negative calibration to offset.

Adjusting song offset is useful when the player doesn't trust the offset value offered by developers, or simply that some songs sounds differently to different players that some says this song is late, and other says this song is too early. Offering offset calibration in addition maybe a good safety net for your offset mistake.

Two, tt is also possible that those values is an input/judgement calibration. It make so that if the receptor time is exactly at note time in the frame where you detected an input, you will not get perfect because it was offsetted. This is to help on device with bad input latency. The player of course has perfect eye. So he try touching the screen when the note really overlaps the receptor. (This is the frame where receptor time is the same as note time, remember how receptor time affects rendering.) But of course the frame that the input arrives won't be this exact frame. It maybe one or more frames later. If you got really good eye, then you will see the note moved past the receptor as the input take their time to arrive, before at some frame that you finally could detect the input and make the note disappear.

But well, it will always be too late if you designed the game to be touched on the frame where the note overlaps the receptor. And while I think this is the right design as it is easier to say that than having to explain "well but your input actually arrives at the code later lol", you cannot just have the note disappear the exact frame the flesh of finger hit the screen. They will unevitably move past for a bit.

Device with 0ms input latency doesn't exist. So you should have some input calibration by default so the under receptor is actually the real perfect, if you intended to give perfect score for player that touch the note that overlaps receptor visually. This is making the game player centric (Priority to visual that player see) instead of code centric. (The code doesn't care how long input take to arrive.)

Or else the game would turn into "Hey in this game you need to touch a bit before that line and the game will say perfect." that players usually talk about, which is fine too, maybe! Because player will understand that the devices don't have the same capabilities, it is his fault that he don't have money to buy better device with better input latency and he understands why he need to touch ahead of time. Then he keep touching ahead of time consistently.

Or you can balance this by putting a bit of default input calibration because you know device with 0ms input latency doesn't exist. But some phones still needs to hit early because they are cheap, and you don't want to include input calibration. Also offering 3 calibrations of latency, music offset, and input are sometimes too much and hard to understand.

The End

Finally! As a bonus for putting up with me this far, this section is actually a huge Rubber Duck Debugging for myself. If I didn't type all these out, I could never get past the first few paragraphs in the line of thinking I just present you. It just.. works.

And sorry that the last few topics completely invalidates the first few reasonings, but without getting though those first I found it impossible to understand the final solution. The end!! Yeah!