Recovery

The Recovery operation ensures that the state of the SDK (and therefore the system using the SDK) is in-sync with the state of the feed

Initial state of the SDK

When the feed is opened, Producers start in Down state and the recovery is automatically initiated to get the current state.

Recovery timestamp

Each recovery is associated with and tracked per single Producer.

Since SDK does not persist any information between runs, a user must provide a producerRecoveryFromTimestamp from which the recovery will be made. Once the user processes all the recovery messages, the SDK is in-sync with the state of the feed, the Producer is marked as Up. The recovery timestamp can be determined by calling an appropriate method on the producer object lastProcessedMessageGenTimestamp. This value should be periodically retrieved and stored to ensure it is available and up-to-date in case a restart of the system is required.

If producerRecoveryFromTimestamp is not set - full odds feed message recovery is going to be performed. The general rule is - the shorter the time frame, the sooner the recovery operation will finish. The customer is expected to make an effort to keep this time-frame as short as possible given their circumstances, as long-recoveries creates additional pressure on other producers and increase a probability of other Producers issuing additional recoveries, increasing overall Producer Down-time, which puts downwards pressure on the revenue the bookmaker is able to capture.

Each producer defines a maximum time frame allowed to be requested a recovery for, which can be consulted in statefulRecoveryWindowMinutes. Actual timestamp requested for each recovery initiated can be inspected via a onRecoveryInitiated callback consulting timestampRecoveryWasInitiatedFrom on a supply object.

Producer Status Change Events

Each time the state of the producer changes, the user is notified via global event / callback. The current state of the producer can be consulted in producerStatusChange callback's isDown method. The following are the possible values and their corresponding meanings:

false - Producer Up is used to notify the user that the state of the associated Producer is up-to-date, and therefore the bets placed on markets associated with that producer can be accepted.
true - Producer Down is used to notify the user the state of the producer is no longer up-to-date and it might not be safe to accept new bets on markets from that producer.

Alternative ways to inspect Producer Up or Down state are by consulting relevant methods on ProducerManager or by Producer.

producerStatusChange callback's producerStatusReason method rovides additional context for Producer changing its state. The following are the categories ProducerStatusReason are classified into:

Initial producer recovery,
Slow processing of feed messages on the customer end
unreliable delivery of feed messages to the SDK

Initial Producer Recovery

FirstRecoveryCompleted - this ProducerStatusReason indicates Producer has just been marked Up right after its initial recovery and this reason is going to appear only once per Producer in the lifetime of the SDK run.

Slow processing of feed messages on the customer end

Feed messages are time critical for the bookmaker to make important betting market related descisions in a timely fashion. A bookmaker operating stale data risks loosing on their revenues by providing bets to punters for the events for which outcomes are already known. SDK will bring the Producer Down or Up depending on the speed of processing of feed messages.

Even though slow processing of feed messages on the customer end results in Producer being marked as Down, the remote producer is actually healthy and continues producing messages.

It is possible to increase Producer Up time at the cost of operating no as up-to-date messages. The tolerance for the level of not-up-to-date messages can be configured by the configuration option inactivitySeconds. It has a sensible default value, which should be carefully considered before modifying to ensure a right balance between overall Producer Down time and the bookmaker tolerance to operating stale data is struck. After bringing the Producer Down due to slow processing of feed messages, SDK will continue delivering messages even after the Producer is marked Down expecting the speed of message processing to recover. In this case no recovery will be issued.

ProducerStatusReasons related to slow message processing are the following:

ProcessingQueueDelayViolation - the Producer is marked to be Down due to the processing of messages being too slow (i.e. messages currently processed by the SDK were generated more than inactivitySeconds(see config options) in the past), or
ProcessingQueDelayStabilized - the Producer is marked back Up again due to the speed of processing of messages recovering back again (i.e. messages currently processed by the SDK were generated less than inactivitySeconds(see config options) in the past).

Consulting isDelayed on ProducerStatusChange callback is an alternative way to find out about slow message processing.

Accessing deeper insight about processing delay

processingQueueDelay - provides a delay in milliseconds from last processed message.
lastMessageTimestamp - provides a timestamp last message was reeived from the producer.
lastProcessedMessageGenTimestamp - provides last user processed message timestamp.

Considerations

Slow processing of feed messages is usually caused by the customer's registered OddsFeedListener not keeping up with the speed of message delivery.

Customer is highly encouraged to consider implementing concurrent consumption of feed messages on their end.

Current implementation of the SDK delivers messages sequentially in each of the configured UofSessions, hence messages which take long time to process by any of the UofSessionMessageHandler methods - will delay the processing of subsequent messages, potentially bringing the Producer Down.

Delays usually are caused by:

long-taking API calls issued under the hoods of the SDK due to network or server issues when customer implemented code exercises methods on the message delivered.
other long taking operations performed right in the listener method like persisting message to a database.

To reduce the frequency of Producers being Down due to slow message processing it is highly recommended to return from the listener method as soon as possible, handling the bulk of potentially slow/blocking operations related to the message asynchronously.

Complexities associated with implementing concurrent implementation

The principal complexity associated with asynchronous handling of messages, when implementing concurrent implementation on the customer end, is preserving the feed delivery order of messages for any given event. The following is the example ilustrating the scenario:

Assuming there is sport event happening between Real Madrid vs Barcelona and there are 2 subsequent OddsChange messages delivered for the market 1x2 in the user-configured session within the SDK. The first message carries odds of 1.1 1.3 1.5 meanwhile the second carries odds 1.2 1.3 1.4. If these messages end up being processed by 2 different threads on the user end, and due to the race condition their the first message gets delayed and, as a result, is handles after the second one was handled - bookmaker will be publishing wrong odds, providing a punter with an advantage.

Solution - Adjust tolerance to slow delivery

Configuration option inactivitySeconds allows the customer to permit lower amount of Producer Down-time at the cost of increasing the risk of operating with stale market data.

The default configuration option provides a sensible default value and should be changed only after carefully considering consequences.

Unreliable delivery of feed message to the SDK

In some cases SDK will not be able to continue with stable message delivery. In these cases a Producer will be marked Down providing one of the subsequent ProducerStatusReasons:

AliveIntervalViolation - Indicating consuming messages is not safe. (i.e. messages currently received by the SDK were generated more than inactivitySeconds(see config options) in the past)
ConnectionDown - Indicating that connection to the feed went down.
Other - indicating that producer will not be producing messages. This might be happening due to multiple reasons like producer maintenance windows.

After the issue is resolved - SDK will bring Producer back Up again along with the ProducerStatusReason:

ReturnedFromInactivity - Producer recovered and is back Up again; any missed feed messages were successfully re-delivered and the customer receives up-to-date messages from the feed.

When Producer is marked Down - the customer does not need to take any action for the SDK to recover. Several mechanisms are implemented within the SDK to handle automatic recoveries to eventually bring Producers back Up again.

Considerations

To isolate unhealthy producers from impacting healthy ones - consider configuring multiple UofSessions or deploying SDK in multiple server Nodes (e.g. if the SDK receives messages from both LO & CTRL producers, one session for live messages and another session for prematch messages could be used). Since the recovery operation can generate a large number of messages in a short time, this can cause that message from another producer take longer to reach the SDK. As a result, another producer can be marked as down due to processing of old messages. This can especially happen during the initial recovery which sometimes has to recover the messages for longer timeframes. Once all the messages generated by the recovery (for all producers) are processed the state of the producers will stabilise.
If the Producer is down and the recovevry for the Producer has started, but no messages related to the currently issued recovery is being received during 5 minute period - a new recovery will be issued with the original recovery timestamp (see "Recovery timestamp"), marking current recovery no longer relevant.
Recoveries are initiated by the API calls, and even though they are thoroughly tested, these API calls can sometimes fail. In such occurence, SDK will issue another recovery request seconds after (i.e. on next Alive message). In the unfortunate event of API failing recovery requests for an extensive period of time - there is a risk of API starting to throttle recovery requests potentially causing infinite recovery loops SDK cannot automatically recover from (see: "Infinite Recovery Loops"). It is highly recommended to implement a strategy accounting for such case and contact support to unblock the bookmaker account. The recommended strategy involves tracking recoveryIds issued by the Producer for the period of time the Producer is Down (see: "Sports API" documentation regarding "Recovery Requests") to identify such cases happening and contact Sportradar support in a timely fashion.
Producer maintenance is a planned activity, usually taking shot periods of time. In an unexpected case of producers being down due to longer taking maintenance, the SDK will continue issuing recovery requests potentially entering into the infinite recovery loops (see "Infinite Recovery Loops") which SDK will not be able to recover from. It is highly recommended to implement a strategy accounting for such case and contact support to unblock the bookmaker account. The recommended strategy involves tracking recoveryIds issued by the Producer for the period of time the Producer is Down (see: "Sports API" documentation regarding "Recovery Requests") to identify such cases happening and contact Sportradar support in a timely fashion.
Sportradar support history shows that sometimes changes in customer network infrastructure causes rabbit looses connection to sportradar infrastructure. After not receiving messages for long enough timeframe, SDK will start issuing recovery requests potentially enterint into infinite recovery loops (see "Infinite Recovery Loop") which the SDK cannot recover from. The root causes of these isues usually lies in firewall settings.

Recovery types

Full recovery

All feed messages since recovery timestamp (see "Recovery timestamp") is going to be recovered automatically, when SDK detects a need for doing it. The customer cannot initiate this process on demand.

Manual recovery of a single sport event

Recovery of the certain type of feed messages corresponding to a single sport event can be requested manually via EventRecoveryRequestIssuer providing two options:

odds messages only recovery and
stateful messages only (BetSettlement, RollbackBetSettlement, BetCancel, UndoBetCancel)

The recovery targetting a single event might be useful to be considered if the customer is aware of not having processed messages on their end due to their system failure, which was eventually successfully recovered from.

Programmatic access to Producers:

Producer is accessible in each message delivered from the feed. Alternatively producers, their state and recovery related information could be accessed via ProducerManager.

Managing amount of time the recovery takes

Recovery timestamp (Recommended)

The most effective and safe way to decrease recovery times is to consider reducing time frame recovery is requested for. See "Recovery Timestamp" section.

maxRecoveryTime and minIntervalBetweenRecoveryRequests

The length of recovery operations have a time limit - indicating the maximum time window in which they need to complete i.e. all the messages related to the recovery are processed and the producer is up-to-date. If the recovery is not completed in a specified time window, it is automatically restared by the SDK.

Certain set of configurations permitations of maxRecoveryTime, minIntervalBetweenRecoveries and recoveryTimestamp may cause a higher risk of SDK entering into an infinite loop of recoveries, which can cause API to start throttling recovery requests due to too many recoveries requested, which is likely going to bring SDK into the state where it cannot ever recover without manual intervention. SDK provides tested default values for these configuration options which are compatible with API throttling behaviour. Changing these values requires extreme care and extensive testing performed beforehand.

Infinite Recovery Loop

Considerations

Today SDK does not provides a way of detecting infinite loops of never-ending Producer Recoveries. SDK does not provide a way to auto-recover from such state if that actually happened. The suggested best-effort workaround to acquire information about the currently being issued recoveries are via SDKProducerStatusListener.onRecoveryInitiated callback, which can provide indirect insight into detection of this problem, but this workaround still requires a judgement call and manual intervention.

A good suspect of a Producer being in an infinite recovery loop is the one which is not Up for significant amount of time and has quite a few distinct recoveryInitiationId being generated. Both, timeframe and number or requestIds generated require a judgement call which should take into consideration current SDK configuration.
Following values help to reduce the likelyhood of Producer entering Infinite Recovery Loop:
- Choosing recovery timestamp (see "Recovery Timestamp") to be closer to current timestamp
- Setting higher value for maxRecoveryTime
- Setting higher value for minIntervalBetweenRecoveries
If the API has not yet started throttling requests - SDK along with the JVM it is runnin in can be restarted right away without adjusting maxRecoveryTime, minIntervalBetweenRecoveries and producerRecoveryFromTimestamp configuration values.
If the API has already started throttling requests - SDK will need to maintain shut down for a period of time until recovery requests' throttling in the API is lifted.
Too often invocations of event recoveries via EventRecoveryRequestIssuer can be causing recovery request throttling as well.

PreviousError Handling NextCaching

Last updated 7 months ago

Was this helpful?

Good morning

Initial state of the SDK

Recovery timestamp

Initial Producer Recovery

Slow processing of feed messages on the customer end

Unreliable delivery of feed message to the SDK

Recovery types

Programmatic access to Producers:

Managing amount of time the recovery takes

Recovery timestamp (Recommended)

maxRecoveryTime and minIntervalBetweenRecoveryRequests

Infinite Recovery Loop