For an investment bank and one of their main pricing and trading systems, I lead the design and implementation of a document repository for historical pricing and trading data. Big Data, with many challenges.
While I cannot reveal project-specific details due to non-disclosure agreements, below some general thoughts on this topic.
Pricing and Trading systems are a bit special.
Financial markets on electronic trading platforms such as Xetra move quickly. When e.g. one of the major German stocks moves, the DAX 30 might move a point up or down, and a derivatives pricing system has to immediately re-calculate thousands of products that are based on the DAX. The market prices of underlyings are not the only frequently updating input data, in addition there are FX rates, volatility surfaces, and tons of other pricing parameters.
In other applications it can be a challenge to keep a high number of (web) users happy by low reponse times. But we still talk about response times in the range of several 100 milliseconds, and if every now and then few users need to wait a bit longer, it is not a big problem.
In pricing and trading systems, the millisecond matters, and for each and every transaction. If prices of derivatives are not reflecting the input data changes fast enough, it opens up opportunities for other market members to arbitrage - and this would mean financial losses.
Pricing systems for financial products can easily involve dozens of servers that produce billions of price updates per trading day, i.e. thousands of updates per second! Not surprising that many pricing systems were mainly caring about the "here and now", and not so much about documenting what happened a few moments ago, or even months ago.
This is changing a bit with new regulations by financial supervisory authorities such as the German BaFin.
E.g. the BaFin regulations on algorithmic trading require: "50. It shall be documented in a comprehensive manner why an algorithm has led to a certain trading decision. Hence, all relevant market data as well as unexecuted orders and quotes shall be kept for a period of three months."
The regulatory requirements can be a typical case for document-oriented databases: The pricing and trading context data often follows complex dynamic data models. The data models change over time, as the business continues developing new products and new pricing strategies to stay competitive. And historical data shall not be migrated to newer models - it shall document how things "were" and not how things "would be" today.
A history repository can be designed as a Java web application that offers clean HTTP REST services: POST/PUT operations for validating, enriching and storing data, GET for retrieving, DELETE for removing. The main payload type should usually be application/json, but consider additional support for binary JSON representations such as Smile or UBJSON, for performance-optimized technical clients.
The repository should not dictate specific document structures. Only few top level attributes such as a document date, upload date etc. are document metadata that the repository should be explicitly aware of.
The underlying document storage system of the repository is an implementation detail that should be hidden from the clients. One option is TokuMX, an alternative MongoDB distribution with a far better storage engine than MongoDB itself. For large repositories, data compression can be one of the killer features. TokuMX compresses the data down to about 10-15% on disk.
In the initial deployment a simple replica set can do the job, however as the data grows consider horizontal scaling through sharding.
It should be a core architectural decision for most systems to never allow any operations to take place directly on their internal databases. Everything needs to run through the repository's REST service layer. Not only to be able to easily migrate to other storage systems later, if required.
Any history repository should also feature a User Interface (UI) for viewing historical data. It should allow for comparing different data versions, and for drilling down from e.g. trade data, down to pricing calculation details, and further to input data such as the volatility surface.
How to combine the best of two worlds? Consider the JavaFX WebView component. This is a full-blown WebKit browser that can be seamlessly integrated into Java applications, without any dependencies on the browser(s) installed on the client machine.
With this approach the UI can be fully integrated into existing trading systems' rich clients, but is additionally available as a completely independent browser-based application.
To document pricing and trading decisions in every small detail for several months, is not at all easy if the pricing and trading system has a significant size.
It is impossible (or at least very expensive) to simply store "everything". Billions of price updates per day, each of them with few 100k of input data, would mean to store several 100 TB of data - each and every day!
Luckily, from a technical point of view, not every price recalculation is usually relevant for trading: If you have 10.000 transactions (quotes and trades) per day and 100k input data per transaction, you end up with 1G of data per day. 1.000.000 transactions with 200k would mean 200G per day. These numbers are still impressive, but feasible to handle.
Still, a major problem persists. When calculating a price, the pricing servers usually cannot know (yet) whether this price leads to a trade or not. As you cannot store each and every tick to a repository, the data needs to be kept temporarily in memory (for a few seconds up to minutes), and should only be stored after it became relevant.
For Java based low latency systems, garbage collection (GC) is a major source of unwanted delays. Be careful to not change the Java heap memory usage characteristics of the pricing system too much! One option is to buy an optimized VM such as Azul Zing, or consider temporarily storing the documents in off-heap memory. On the Java platform this is possible through sun.misc.Unsafe which allows for native memory access. If you do this, you obviuously need to cleanup yourself (no automatic garbage collection). Frameworks such as GridGain or Apache DirectMemory are worth looking into, if you don't want to implement everything from scratch.
This article showed some of the technical challenges that financial systems face when it comes to new regulatory requirements.
A history repository that is designed as an independent technical service, is not at all specific to one financial system. Other neighboring systems can store data there, too. It would even be able to use the repository in completely different non-financial business contexts.
In our project, after a couple of weeks of stable operations, several TB of raw data have been stored in such a history repository. The data growth rate will even increase: We plan to add more data that is not mandatory from a regulatory perspective, but useful to have. Other data is required in order to replace existing, less powerful auditing facilities.