Trading on the cutting edge

March 31, 2014 07:00 PM

Computer hardware has come a long way. Quad-core machines are common. However, not all quad-core machines are alike. Not only are there differences at the processor level, but developers must code their applications to take advantage of the multiple cores. Expensive quad-core processors include enhancements that a skilled developer can use to maximize performance, but we need to look at each processor separately to assess the potential.

Here, we will review recent hardware enhancements, examine how programming efficiencies can benefit traders and quantify the analytical edge the multicore programming provides.

By the numbers

When shopping for a computer, it is important to identify the characteristics that can make the biggest difference in terms of performance. Some of these key features include number of cores, clock speed, bus speed, hyper-threading and compatibility among components. 

There are two main choices for desktop CPUs: Intel and AMD. These processors are not interchangeable. Moreover, all CPUs do not fit all computer makes and models. A CPU attaches to a computer’s motherboard through a special socket. Some manufacturers consistently use the same type of socket in all their products’ motherboards, which can tie a specific CPU type or company to a brand. Finding out whether a certain CPU is compatible with a particular computer model is the most fundamental aspect of CPU selection.

The number of cores, clock speed and hyper-threading capabilities are among the most important specifications. As is often the case with technology, numbers are not everything. Personal needs must be taken into account.

Having more cores on a processor means that it can execute more tasks in parallel. Dual-core processors, which started the multicore revolution, have become the minimum standard on laptops, while quad-core machines have become standard for desktops. It is possible to find CPUs with six or more cores. 

Hyper-threading is an Intel technology that enables a single core to process various threads simultaneously. For example, a six-core processor featuring hyper-threading can process 12 threads at a time. In practice, due to bottlenecks, hyper-threading increases performance by about 30%. Hyper-threading also requires operating system support, in addition to the hardware. 

Then there’s clock speed, which is expressed in gigahertz. This measure describes how often the computer’s clock pulses in one minute. CPUs combining a high clock speed with hyper-threading and multiple cores are the fastest and most powerful. Overclocking allows CPUs to run faster than their regular speed specifications. Overclocking CPUs usually requires more power and will generate more heat. Extreme overclocking can introduce computational errors.

In terms of technology that surpasses the needs of the typical user, Intel remains ahead of AMD in general performance and speed (see “Multiple-Core CPUs,” below). AMD only becomes a competitor when it comes to budget CPUs.

For example, AMD chips do not have hyper-threading and cannot be overclocked. When buying a quad-core machine, the processor is critical to the speed of the machine when running multicore software. The Intel i5 also does not offer multi-threading, like the i7. If we have properly designed multicore software, a hyper-threaded quad-core processor will generally perform about the same as a six-core AMD chip.

Most popular high-end machines use six, core CPUs, such as:

  1. Core i7-3930k and Core i7-3970x: Fastest consumer CPUs on the market with 12 threads and six physical cores.
  2. Core i7-4770k: Ivy Bridge processor and excellent value.
  3. Core i7-4770: Recommended minimum processor for traders and an excellent choice for traders on a budget.

My current machine is an Intel six-core i7-990x with 24GB of RAM and a 256GB solid-state hard drive. Newer machines from a custom builder, such as EZ Trading Computers, might come with the Intel Core i7-4960X and benchmark much higher. Because these machines are overclocked to get the best performance, reliability is important. Only purchase from a company that offers a long warranty and technical support. EZ Trading Computers is one such firm, but there are others.

Consider the cloud

Much of the focus in the technology world is on the cloud. The cloud is more than just storage, however. It promises to provide computational power on demand. Services such as Amazon’s EC2 and Microsoft’s Azure promise to provide computational cycles without having to invest in your own server farm.

This is good for businesses because you only pay for what you use. If you need to compute a large result you can launch many “instances” (essentially virtual computers) to compute the result and then terminate them once they’re done. Although this sounds revolutionary, in practice it doesn’t live up to its promises. To see why, we need to explore more about the infrastructure behind cloud computing.

Amazon offers multiple instance types. You can pay for everything from single-thread machines with just a little over 600MB of RAM, all the way up to heavily compute-optimized instances. The single-thread machines are cheap, but aren’t suited for financial computations. For most of our calculations, we would need instances with many cores. Amazon rates each instance by computational power. They call each “unit” of computation an ECU which is defined as providing “the equivalent CPU capacity of a 1.0-1.2 GHz 2007 Opteron or 2007 Xeon processor.”

Each instance varies primarily by the number of virtual CPUs, memory and “ECU power.” The ECU power means, in theory, that the virtual server, once launched, will provide the equivalent of that many 1.0-1.2 GHz old Xeon processors. It’s worth noting that Xeons are excellent for server applications but tend to be less optimized for numerically intensive computational tasks. 

There are a range of pricing structures, but as an example, for $3.73 per hour, you can launch a 32 virtual CPU Windows server with 108 “ECUs,” 60 GB of RAM, and 640 GB of hard-disk storage. This sounds, at first, to be the bargain of a lifetime. For less than $4 an hour, we can harness the equivalent of 108 Xeons. 

As a basic test of this solution, a simple program was built that would do 200,000 math operations on a parallel “for loop” over all available threads. Initial results were promising. A new i7 4700-MQ laptop did that loop in one minute. A 14 ECU instance took two minutes and a 28 ECU instance took one minute. The eight virtual CPU instance seemed to resemble the computational power of the eight cores in the laptop processor. Further results, though, were disappointing.

TradersStudio was loaded on four Amazon EC2 instances and a trading system (the Parabolic SAR system, which is notoriously computationally intensive) was run using 100 optimizations over three stocks. The results are as follows:

  • Time on laptop: 42 seconds.
  • Time on EC2 large (4 vCPUs): 1:52 @ $0.466/hr.
  • Time on EC2 2xlarge (8 vCPUs): 1:04 @ $0.932/hr.
  • Time on EC2 4xlarge (16 vCPUs): 0:46 @ $1.864/hr.
  • Time on EC2 8xlarge (32 vCPUs): 0:25 @ $3.728/hr.

Amazon’s computation capabilities start to break down. It took the 16 vCPU instance on Amazon to match the laptop. That system costs $1.86 per hour. If you’re looking at doing five hours of calculations per day, 200 days of the year, that would be about $1,800 per year. Even going up to 32 vCPUs doesn’t result in much of an improvement, producing the calculation result in about 60% of the time as the laptop for a whopping $3.73 an hour.

Unfortunately, this result was not an outlier. Further tests were also weak. In short, the two largest cloud computing providers weren’t able to beat a modern laptop quad-core processor, which is less powerful than a desktop quad-core and also probably less powerful than the average middle of the road home PC.

The reason for this poor performance is both Amazon and Windows Azure are sensitive to throttling. Many virtual servers are running on one physical server. That is, one machine in the real world is running many virtual machines for many clients. That physical machine only has so much processing power. If we ping the CPU and ask it to perform complex calculations, it can slow down other servers. To protect against this, large companies throttle users’ CPU usage. If we try and use the server for complex calculations, we trigger this throttling, which slows our speed dramatically. 

Cloud computing seems to be best suited for two groups of people: those that have light server applications (web and database servers) and those that have deep pockets and are willing to pay relatively large amounts of money for the flexibility of being able to add and remove computational power as necessary. Individual traders typically don’t fall into either of those two categories. Maybe in the future, the gains will add up for traders, but for now it seems as though the cloud is still not quite ready for the heavy duty computations that trading research requires.

Multicore programming

To grasp multicore programming there are a couple of basic concepts that must be understood. Let’s start with the concept of a thread.

A thread is a single sequential flow of control. Threading requires an operating system that supports it. Microsoft Windows is considered a pre-emptive multitasking operating system because Windows’ task scheduler parcels out processor time to all the running programs. These chunks of processor time are called time slices. Programs aren’t in charge of how much processor time they get; the task scheduler is. Because these time slices are so small, you get the illusion that the computer is doing several things at once.

Qualifiers include:

  • A process is a single body of code that can have many threads but it has at least one and it has a single context (address space).
  • A thread is a “path of execution” through that body of code
  • Threads share memory so they have to cooperate to produce the correct result.
  • A thread will also have thread specific data such as registers, a stack pointer and a program counter.

A multi-thread application is not the same as a multicore application. All multicore applications are multi-threaded but multicore applications require multiple processors, such as quad-code processors. Each “core” is a separate processor, capable of running programs by itself. You get a performance boost when the operating system assigns a different process to different cores. Using multiple threads and multiple processors for even greater performance is called thread-level parallelism (TLP).

With operating system and processor support required, you can’t use multiple threads on everything. Further, not all computing problems benefit from multi-threated or multicore solutions. We don’t implement multi-threading just because it’s there. You might hurt performance. For example, video codecs may be the worst programs to multi-thread because the data is inherently serial.

A good example of a project to multi-thread would be backtesting a portfolio of markets. We can backtest the markets in individual threads (so eight threads means we can backtest eight markets at once). This type of workload is called an “embarrassingly parallel” workload, which means that little to no effort or change is needed to make the parallelism work.

That said, multi-threaded code is not simple. It often requires complex coordination of threads. Subtle and difficult-to-find bugs are common because different threads often have to share the same data so data can be changed by one thread and the other isn’t notified of the change. The general term for this problem is “race condition.” The two threads can get into a “race” to update the same data and the result can be different depending on which thread “wins.” In addition, programs may freeze because one thread requires data from another thread to finish. 

Perhaps the best lesson here is that to see substantial improvements — and to avoid disastrous consequences — you must design multicore applications from scratch to address these issues. Poorly designed applications will have poor performance, even on a multi-core system.

Bottlenecks are another issue, addressed by how you split up the processes. In TradersStudio, when we optimize a session (which is one system running on one or more markets), we either run the session in parallel (running multiple markets at once), or run multiple trials at once. This depends on the number of markets we are running relative to the number of cores available. Depending on the choice, optimization performance can vary by as much as 50%. 

The challenges of developing a proper multicore application explain why there are so few in trading, but they are coming. The multicore version of TradersStudio is now in pre-release and is being actively used. It covers portfolio-based testing, money management, walk-forward testing, optimization and all elements that traders rely on to develop viable trading systems. TradeStation has its charting running on multiple cores, although it is understood that a multicore backtester is in development. Multicharts and AMIBroker are multicore. 

Trading system benefits

Multicore support is increasingly important. Trading systems have really not advanced in over 20 years because advanced analysis methods like spectral analysis, support vector machines, neural networks, and evolutionary methods like smarm technology as well as genetic algorithms all require a lot of processing power. Until recently these were not feasible to develop on a single-thread machine. 

Here’s an example of a practical benefit of multicore technology. About five years ago, I built an add-in that supported linked optimization of trading systems and automatic selection of parameters based on the shape of the optimization surface. It even performed in-line optimization of multiple systems within the same script. Unfortunately, the technology ran too slow. With multicore hardware, this tool is now viable. The same is true with neural networks. I’m encouraged to continue the research that is now possible with the new tools.

One of my Treasury bond systems, MAR_Bond, was released in 2010. While this is a robust system, it had a natural upward bias assumption in T-bonds. Given the current economic climate, I now no longer believe this to be true. The problem is short trades were greatly limited. The goal becomes to create a symmetrical system that is able to adapt to changes in the T-bond market. The necessary development time was reduced from approximately 20 days to two, thanks to efficiencies in multicore programming. When you consider that time is money — and all traders should grasp that concept — the significance of these capabilities is clear.

When developing trading systems, we first develop a premise, and then develop a number of experiments over different ranges of parameters to test that premise. In addition, markets do change over time, and we need to optimize over a large range of values to see if the parameters are stable or if they “drift.” Sometimes we observe changes in our premise. Testing a large set of parameters can confirm our observations.

The recent change in the bond market is a good example of how tweaks in the broad climate can impact a system. To that extent, this system was redesigned to tweak the filters while the same core intermarket relationships are used in both versions. The original system lost a little over $13,000 in 2013. The new system was up a little over $13,000. Short trades that were filtered out in the old system are taken in the new one. These changes help over a longer period as well. The original system made a little over $187,000 since September 1987, while the new system produced $296,000 during the same period. The drawdown is about $3,000 more at $20,125, but the upside improvement warrants the larger risk.

Based on my own experience developing a multicore solution, development has benefited from TradersStudio’s focus on portfolios. We were able to break up the processes and run each market in the portfolio in parallel. A real example, performed on the six-core machine discussed earlier, demonstrates this. This test uses a triple moving average crossover system on a basket of 23 markets, tested on data from Jan. 3, 1991, through Feb. 11, 2014. It will be run three ways. On TradersStudio Professional, this test took 46.26 seconds. On TradersStudio Multicore in serial mode, it took 13 seconds. On TradersStudio Multicore on the six-core machine, it took 3.10 seconds. 

On a more sophisticated optimization of the portfolio-based system, testing the short average from four to 10 in two-step increments, the medium average from 10 to 30 in five-step increments and the long average from 30 to 60 in 10-step increments, we saw further benefits. On TradersStudio Professional, it took 3,616 seconds. The multicore version, on the other hand, took only 219 seconds. This is a speedup of more than 16 times. Such an increase in speed makes developing systems on portfolios of stocks or baskets of commodities efficient because the optimization finds the best set of parameters across the basket.

In the next 10 years, even 28-core machines will be affordable. The cloud will also become more efficient. Anyone will be able to build intraday systems on short time frames, such as tick charts or one-minute bars. Optimizations across large baskets of stocks, such as the entire Russell 2000, will be possible. Walk-forward testing, neural networks, genetic algorithms and machine induction will be available in real time. A smart technology arms race will empower traders to continue pushing the capabilities of hardware. For a quarter century, we have written about the prospects of testing and trading truly sophisticated systems. The day those dreams become a reality is within sight.

Murray A. Ruggiero Jr. is the author of “Cybernetic Trading Strategies” (Wiley). E-mail him at ruggieroassoc@aol.com.

The Xeon Option

Intel offers another high-end processor, the Xeon. There are several considerations that come into play for choosing the i7 over the Xeon for trading.

First, Intel specifically has designed the Xeon for enterprise server use. While they are powerful processors, Intel’s target market for these processors are data centers and high-performance cloud computing environments. The Intel Core i7 series, on the other hand, was designed for consumer-level computing.

Second, Xeon processors are not designed to be overclocked. While it’s possible to overclock a Xeon, it’s extremely difficult to make stable. You also would have to use a motherboard for which the Xeon is not designed. Traders need stability and reliability! 

On the other hand, the Intel Core i7 series of processors, where the last letter is a “k” (for example, i7-4770k) are actually “unlocked” by Intel and are designed to be overclocked by enthusiasts with the proper know-how. There is nothing to fear from properly overclocked processors that are designed for the practice, however. This is why some companies, such as EZ Trading Computer, will offer a five-year warranty on their overclocked processors.

Last, the cost to run the highest-speed Xeon processors is substantially higher than it is to run the fastest i7s. Currently, the fastest i7 we have measured from our testing is our overclocked version of Intel Core i7-4960x processor at 4.5Ghz. This CPU benchmark testing is done using PassMark’s CPU Mark testing software, which is one of the best and most encompassing benchmark tests. The software generates a benchmark score that can be used for comparison purposes. A benchmark score of at least 7,500 is the absolute minimum for trading.

A representative retail cost of purchasing the Intel Core i7-4960x and the necessary CPU cooler comes to about $1,175. (Note that this is only for illustrative purposes as retail prices do fluctuate.) The Xeon that meets or exceeds the 7,500 CPU benchmark is the Xeon E5-2690v2, which sells for a whopping $2,149. The necessary motherboard for the Xeon version is also more expensive. The build cost is roughly twice as much with a Xeon.

About the Author

Murray A. Ruggiero Jr. is the author of "Cybernetic Trading Strategies" (Wiley). E-mail him at ruggieroassoc@aol.com.