Finding your next investment in less than 40 milliseconds

Hader Morsy
Trade Republic Engineering
11 min readMar 6, 2023

--

Our core mission at Trade Republic is to provide easy and fast access to capital markets for millions of Europeans. Over the past years, we’ve built a trading platform that allows our users to easily find and trade the financial instruments they need, from a vast universe of instruments across the US and various European markets.

An important part of that platform is our Instrument Search service, one of the key entry points for our users. In this article, we will explore how the Recommendations team at Trade Republic made significant improvements to the performance of this service, empowering our customers to find their next investment.

Background

Figure 1: Figure 1: Search Service high-level architecture

The core of our Instrument Search system is an AWS OpenSearch cluster, AWS’s offering of managed ElasticSearch. On the supply end, we work with various different partners to gather data for financial instruments and build a comprehensive universe across multiple EU jurisdictions. Our systems then build an ElasticSearch index based on this data on a regular basis, along with consuming changes to provide Near Real-Time updates to our dataset. Our customers can then access their mobile applications, type a few characters, or discover via applying filters, to find their next investment. Their requests are then sent to our Backend systems, via our gateway, to the Instrument Search service. This service would then translate their intent into an ElasticSearch query, fires that query against the cluster, and returns the results (along with some transformations) to our users.

This is a very standard search set-up. You can imagine that the list of components there are not unique to Trade Republic by any means. In fact, I’ve seen the same exact system diagrams in talks with other companies working in the Travel, Music, and E-Commerce domains. It’s simple, reliable, scalable, and product teams can deliver customer value.

The Task

The Recommendations team here at Trade Republic was working on introducing new personalisation features (which we might discuss some time in the future here) and like every new feature, that means a latency increase.

For starters, our gateway would cache API requests to the Instrument Search service to optimise for most common queries. The caching layer was user-agnostic, caching most common queries per jurisdiction per language. Adding user identifier to the cache key of course meant an explosion in cardinality and, subsequently, memory. This had to go if we wanted to truly personalise results for every customer.

Graph showing an 80% increase in Week over Week traffic after removing the cache layer
Figure 2: Week-over-Week percentage change in traffic corresponding to removal of caching layer.

That is an 80% increase in traffic hitting our search service, which also means that the OpenSearch cluster is guaranteed to get at least as much extra traffic. We expected that increase from the cache hit ratio on our gateway. We also expected an impact on the performance of the OpenSearch cluster, an increase in search latency due to the increase in requests. We did not expect it to be that much.

Figure 3: Higher Percentile Latency increase corresponding to disabling cache

The Challenges

Frequent Spikes

Our first observation was that rather than just an increase in latency, there was latency spikes every minute across all percentiles. This indicates that something runs minutely causing those spikes, and that thing is ElasticSearch Refresh.

In short, Refresh operations enable Near Real-Time updates to ElasticSearch Indexes by writing changes to a new segment on the filesystem cache rather than having to commit every change to disk. The catch is that every refresh operation is still relatively expensive. One can trigger refreshes manually from outside the cluster, or schedule a suitable refresh interval. Our refresh interval was, you guessed it, 1 minute.

Like most search systems, our requirements lean more towards availability than consistency. Since we can tolerate more lenient SLAs for data consistency, we decided to test this hypothesis and increased the refresh interval to 5 minutes.

Figure 4: Less frequent spikes corresponding to Refresh operations (99th Percentile)

Refresh intervals are always a trade-off. Having a very small refresh interval means constant overhead, even if there wasn’t a lot of updates. While having a larger refresh interval means that every refresh will contain a lot more data, and will take longer to execute. More sophisticated refresh strategies are also beneficial, one in particular that we’re excited to test is based on the number of updates rather than a refresh interval. However, this requires implementing refresh logic outside of the cluster itself.

Thread Pool Exhaustion

Our second observation was that, while server-side latency of search queries on OpenSearch cluster did increase, it didn’t increase as much as client-side latency did.

Figure 5: OpenSearch server-side latency per node per shard

This pointed directly at an under-provisioned cluster, which made a lot of sense since we had just almost doubled traffic to it. To prove that hypothesis, we added metrics to monitor the search thread pool utilisation.

Figure 6: Number of queued searches increasing due to thread pool exhaustion

By default, ElasticSearch sets the size of the search thread pool using the formula:

Whenever the number of concurrent requests exceeds the thread pool size, i.e. when there are more requests than threads that can handle them, extra queries are queued with a default queue size of 1000.

In that aspect, tuning an ElasticSearch cluster is similar to tuning any JVM-based server. The request thread pool should be set based on the maximum number of concurrent requests a single node can serve. And it’s always better to have small, or even size zero, queues. With read requests that are easily retryable, it’s better to reject incoming requests if the service is not able to handle them and rely on retries instead. This way a busy node does not keep accepting requests that it can’t handle, and they can be retried on a different node. A great article that tackles those issues and that, frankly, every software engineer should read at least once is The Tail at Scale.

Returning from that digression, we had 3 options to try out in that respective order:

  1. Increase the search thread pool size, or
  2. Vertically scaling our OpenSearch nodes, or
  3. Horizontally scaling our OpenSearch cluster

The first solution was definitely the preferred one. Since our cluster was read-heavy, we could afford expending more of the already existing resources on the read path. However, Amazon OpenSearch Service does not provide the flexibility to tune the cluster manually as it provides only a managed service with a lot of defaults.

We had to go with the second option then. At the time, we also needed to scale our cluster vertically anyways to cater for an always expanding savings offering, on the way to become the largest European provider for ETF and Stock savings plans. At least we had an excuse for the larger instances we’re now using.

Figure 7: Latency dropping after increasing thread pool size and vCPU per node
Figure 8: Queueing and Thread Pool metrics

The high percentiles dropped to levels lower than even before we removed the cache. When a cluster in under-provisioned, a valid solution is to throw more hardware at the problem if the ROI is positive.

Tail Latencies

In large information-retrieval (IR) systems, speed is more than a performance metric; it is a key quality metric, as returning good results quickly is better than returning the best results slowly

Once the cluster was correctly sized, we could now tackle spikes and tail latencies. Variability in latency occur to various reasons from resource sharing, garbage collection in JVM-based systems, to network blips. All of those can cause uniform traffic to produce high difference between median and tail latencies.

Tail latencies refer to the highest percentiles of latencies that a system experience. They can have a significant impact on the user experience, especially for a highly interactive system like a Search service.

Let’s assume that 1 second is the threshold for bad user experience. We see from user data that engagement, conversion, or any other key business metric drops after that threshold. Even if 99% of requests are faster than 1 second (p = P(tᵢ > 1s) 0.01), a user still does multiple search requests per session. If a user types in, deletes a character, or selects some filter only 10 times per session; the probability that the user had at least one bad experience is already ~9.5% based on the binomial trial:

To counteract high tail latencies, we implemented more eager retries and strict timeouts. Since the cluster now had the capacity, we can afford setting lower timeouts on OpenSearch queries to retry them earlier on a different, less busy node.

Figure 8: Latency distribution before tail latency optimisation
Figure 9: Latency distribution after optimisation

Search Optimisations

Alongside optimising the OpenSearch cluster itself, another path we explored was optimising how we use the cluster. This means, in simpler terms, optimising the search queries we send to OpenSearch. We found out a couple of low hanging fruit, mainly searching through fewer fields and optimising fuzzy queries.

Avoiding Joins

During their discovery phase, users can search for financial instruments in various different ways. For example, “Alphabet Inc” was historically called “Google”. There are 2 (public) stocks, class A with the “GOOGL“ symbol and class C with “GOOG“ symbol. Moreover, each one of those symbols have a unique International Securities Identifier Number. All of those terms need to be searchable in our index.

Translating this information into an OpenSearch documents might look something like this:

{
"docId": 1,
"fullName": "Alphabet Inc. Class A Common Stock",
"shortName": "Alphabet Inc.",
"commonName": "Google",
"symbol": "GOOGL",
"isin": "US02079K3059"
}

{
"docId": 2,
"fullName": "Alphabet Inc. Class C Common Stock",
"shortName": "Alphabet Inc.",
"commonName": "Google",
"symbol": "GOOG",
"isin": "US02079K1079"
}

This means that a single user search would result in multiple queries across all of those fields, trying to match any of them. This results in a complex boolean query to OpenSearch, and since those are separate fields they are indexed separately and thus searched separately.

Since most of our queries try to match against all of those fields, we can utilise the copy-to directive. At index time, we can copy all name related fields into a new `names` field. This allows us to still search against (and score) every name-type separately if needed, and at the same time simplifies the search query. The new index mapping would then look like:

PUT instrument-index
{
"mappings": {
"properties": {
"fullName": {
"type": "text",
"copy_to": "names"
},
"shortName": {
"type": "text",
"copy_to": "names"
},
"commonName": {
"type": "text",
"copy_to": "names"
},
"names": {
"type": "text"
},
// Remaining fields
}
}
}

And Indexing those documents would look like:

PUT instrument-index/_doc/1
{
"fullName": "Alphabet Inc. Class A Common Stock",
"shortName": "Alphabet Inc.",
"commonName": "Google",
"symbol": "GOOGL",
"isin": "US02079K3059"
}

PUT instrument-index/_doc/2
{
"fullName": "Alphabet Inc. Class C Common Stock",
"shortName": "Alphabet Inc.",
"commonName": "Google",
"symbol": "GOOG",
"isin": "US02079K1079"
}

Now, for most of our user queries, we can just search against one aggregated `names` field and avoid joins on multiple queries.

GET instrument-index/_search
{
"query": {
"match": {
"names": {
"query": "goo",
"operator": "and"
}
}
}
}

This, of course, optimises for search time at the cost of larger index sizes. And, for our use case, is a worthwhile trade-off.

Fuzzy Queries

With the popularity of search engines like OpenSearch, ElasticSearch, or Solr, it’s easy to forget what powers a lot of their search functionality: Apache Lucene. One of the coolest features that Lucene provides is Fuzzy Queries, and it achieves that via Levenshtein Automaton.

A fuzzy query is mostly used in the context of spell checking, where a user made a typo while searching. In the context of entity search, like in Trade Republic when users are searching for a specific financial instrument rather than arbitrary text, those typos are omissions, duplications, or transpositions rather than a spelling mistake. This makes fuzzy queries and edit distances the best way to solve the problem, rather than, for example, phonetic search.

Even though Lucene’s implementation of Levenshtein Automaton is quite fast (and stood the test of time), it still is a lot slower than normal queries and has to be used precariously and constantly monitored. The performance of fuzzy queries degrades not with an increase in number of documents, but rather with an increase of unique terms being indexed in the Automata.

One aspect that can significantly improve the performance of fuzzy queries is to assume that the first couple of characters are always correct. This is a setting in Lucene fuzzy queries, and exposed by OpenSearch, called `prefix_length`. Setting a prefix length to X means that we don’t catch typos in the first X characters, but at the same time the search space is significantly decreased to provide better performance. The value of X is always dependant on the product use-case: Do users search on a desktop with full keyboards or on mobile phones? What’s the distribution of user query lengths? etc. Based on the answers to those questions, we were able to optimise for the value of X and set the `prefix_length` accordingly.

Impact of Query Optimisation

Figure 10: OpenSearch server-side search latency per node per shard
Figure 11: Search Service total latency drop after query optimisations

Conclusions

Observability

Not all systems have the luxury of testing in production, especially systems that mutate state. Search (and generally, recommendations) engines have that luxury. However, immutability is not the only prerequisite. To be able to test safely in production you need the ability to quickly and reliably detect impact on users and rollback changes.

Knowing that, our team spent time and effort improving the observability of our systems before even thinking of doing such changes, adding metrics especially around the OpenSearch cluster. While managed cloud services make bootstrapping new products easy and reliable, after a certain point your use case will outpace their fine-tuning capabilities.

Tuning Search Engines

One can spend a lot of time and effort understanding how Search Engines works to try and tune every single analyser and query used, but starting from the fundamentals will always result in a higher impact. While we did see a lot of impact from things like tuning fuzzy searches, an under-provisioned cluster will always be a bottleneck. In the end of the day, OpenSearch is just another JVM-based distributed system. And like other JVM-based distributed systems, you should look first at provisioning capacity correctly, managing tail latencies via eager timeouts and retries, and scaling reads and writes according to your use case. Elastic have great guides to get started tuning your cluster and queries based on your use case, whether it was search heavy or indexing heavy.

Caching

Caching is a widely used mechanic, from in-memory caching of user requests to the page cache that ElasticSearch utilises heavily to speed up searches. However, while many use cases do require the introduction of some sort of cache, a lot more is just trying to solve a symptom of an underlying performance issue, rather than the root cause. Think carefully before introducing yet another caching layer. Most of the time your data layer isn’t slow, it’s just not optimised.

--

--