Citrix Local Host Cache - 2017 Performance Evaluation

By Trentent Tye posted 11-27-2017 12:00 AM

  

Citrix introduced a feature with XenApp/XenDesktop 7.7 called “Local Host Cache” (LHC).  When this feature was released, there were some limitations, but as time wore on and Citrix began to understand the technology, limitations were removed or reduced.

There are still some limitations on the technology compared to 6.5 IMA LHC, but it looks much better today than a year ago, and undoubtedly it will be better a year from now. However, the documented limitations of the Local Host Cache has me quite curious.

As an example, going through the Local Host Cache documentation, one of the big “limitations” is the number of VDA’s. When LHC was introduced, the limit was 5,000 VDAs, but there was never a limit on the number of RDS sessions. The closest information I could find was this old LHC sizing guide that is now out of date. In it, the author tested 100,000 RDS users and 5075 VDI VDAs. His findings for RDS sessions were difficult to parse, if only because the raw data is not available, but still, some information can be gleaned from the images.

The ‘theoretical min’ row is the absolute minimum time 100,000 users would take to log on if the environment was able to process 20 launches per second, giving 1 hour 23 minutes 20 seconds. In these tests, the 0 applications row managed 1:30:57 in the 6 vCPU case, and 1:30:48 in the 8 vCPU case. The performance of the Active Directory domain will have some impact on how quickly users are authenticated.

We can kind of see that enumerating 400 applications at 20 requests per second (8/LHC on) appears to have taken 1:55:12(!)

I had a couple concerns regarding the article.

The first concern is the testing was done using a very old processor, an AMD 8431 that was released June 1st 2009. This processor was six and a half years old when the article came out. The processor the author was using has a single thread rating of 854, a CPU three years newer (Intel E5-2680, — still quite old!) scores a single thread performance of 1674. Nearly 2x the performance! It would be extremely handy to understand the actual performance limitation of LHC in a few CPU scenarios (low, medium and high performing tiers). I would imagine that this is more realistic today than deciding that your controllers will live on hardware that is sorely out of date.

The second concern, he tested with 7.12. There is nothing wrong with that in itself, as that was the release at the time he tested, but 7.14 brought dramatic performance improvements to LHC. Enough improvements that a 2x VDA density was achieved for a single zone and 8x the number of VDAs for a site! Retesting with the improved LHC would have been nice and is even more important now because these improvements are in the 7.15 LTSR and people maybe sizing their brokers and/or farms with now outdated and potentially incorrect assumptions.

I had experimented with testing the performance of the brokers at enumerating applications (400 applications actually!) on XenApp 7.13, and I found I could enumerate the applications at a crazy concurrent rate of 200/sec and it completed all requests in less than 1000ms. To offer some real world perspective on a fair-sized environment (concurrent user count of ~14,500), I examined the peak user logon rate and the rate was ~14/sec during peak logon time. This means that over a 15-minute timespan in the morning, this 6.5 environment satisfies 12,600 RDS logons. When I tested a 7.X playground environment, it stayed in lockstep with 6.5 until it outperformed it when pushed with extremely high concurrent logon rates.

With everything I’ve said, I’m going to test the LHC performance on 3 classes of processors.

In order to remove storage from factoring into this testing, I’m going to put the LHC database on a 1.5GB RAMDisk.

I’m going to configure my Broker VM for best performance. The broker will be configured as a 2 socket, 8 core system (16 vCPU total). The hosts this VM will reside on will have 2 sockets with their respective processors with no other VMs residing on them with Hyper-threading enabled. With LHC on, the theory is 4 of those cores will go to the SQL Server Express instance. The VM will have 8GB RAM. The VM will be NUMA aware.

In order to test performance, I’m going to use WCAT to spin up a fixed number of concurrent application enumeration requests against the broker.

Citrix has some excellent performance counters that measure the load against the LHC. “Citrix High Availability XML Service – Concurrent Transactions” accurately measures the load wcat was reporting, so this is excellent! It means that the number of user enumeration requests was spot on. The counter “Citrix High Availability XML Service – Avg. Transaction Time” measured how long it took before I got back a response for the requests. With these two counters I can measure my load and how long it took for my RDS session application enumeration to respond.

I configured wcat to add 10 concurrent connections every minute up to 10 minutes, to a maximum of 100 concurrent users requesting enumeration. Why concurrent? Well, that’s just what wcat does. However, concurrent testing makes this test very different compared to what was described in the original Citrix test. The Local Host Cache Sizing article states testing was not at a fixed concurrent amount, but at a rate. That rate was 20 enumeration attempts per second. My testing at 20 enumeration attempts per second shows that the LHC, on these processors, chew through them like they are nothing. If the processor can finish the task before the rate (per second in this instance) then you’ll just come up with the “theoretical limit.” Example:

Each packet was processed before the “second” was complete, thus the performance will always be at the “theoretical limit” unless the processing time exceeds the “rate.”

Testing the “concurrent enumeration, however, can process more transactions per second because it ensures there are always 5 enumerations requests occurring.  If these transactions take less than a second then more users could be processed per second.

 

In my simplified examples, the 5 enumeration request each second will only do 5 requests. For the concurrent enumeration requests, the range is 9-10 requests per second. Twice the work accomplished!

Can we find out how many concurrent enumeration requests are completed in a given second then?

Yes! Citrix offers a 3rd counter that I will key in on. “Citrix High Availability XML Service – Transactions/sec.“ This counter will tell me how many requests were completed in a given second. This counter provides me with an actual count of the “real work.”

Here is an image of the raw data:

The red line is the number of transactions done per second, the green line is the concurrent number of users requesting enumeration and the blue line is how long it took to satisfy each request.

The raw data:

  E5-2670   E5-2660v4   E5-2690v4
Conc. requests Trans/sec Avg. Trans. Time (ms) Trans/sec Avg. Trans. Time (ms) Trans/sec Avg. Trans. Time (ms)
10 45.41884982 212.1025433 81.37907875 117.0314831 104.1664581 94.58304208
20 45.06658696 439.0654575 78.93333422 251.1161924 104.8795065 187.3677897
30 36.41660588 824.0843186 65.67668397 458.6470345 78.37808685 386.0522131
40 33.6832376 1187.350383 63.71642739 636.0645527 74.79906537 546.612349
50 31.48328618 1584.367745 53.93323804 919.7082417 70.31665937 709.232745
60 32.84991648 1806.315163 54.30730723 1100.968744 64.83313495 931.3893972
70 31.44991974 2233.383315 53.36657651 1295.908498 66.74986539 1031.437336
80 30.46667135 2553.245907 51.53109524 1533.056509 65.91644596 1209.430939
90 29.58324755 3105.931052 52.54982264 1705.29403 58.84991703 1568.117465
100 28.1666105 3610.460116 51.45013343 1932.482214 55.3165561 1911.875756

 

 And now, the pretty graphs:

Average Transaction Time. Lower is better

 

 

Number of transactions per second. Higher is better

 

 

Per processor comparisons:

 

 

 

 

Satisfying a theoretical 100,000 XenApp users at 20 concurrent requests would take:

 E5-2690v4 15:53
 E5-2660v4 21:06
 E5-2670 36:58

 

So, what is all of this information telling us? Knowing your peak concurrent rate is important. Ensuring your VM that will be your LHC is configured correctly and is on the best hardware possible will help ensure you have the best possible performance in the event you need to use your LHC.

These are very surprising results! The performance processor is over 2x faster than the low performance tier! I wasn’t expecting such a discrepancy when you look at the CPUMark numbers.

The LHC does appear to operate at optimum performance at less than 20 concurrent connections. Unfortunately, I do not know of a way to govern LHC to a maximum number of concurrent connections. You can create an artificial limit by creating zones to carve up your environment. If the LHC needs to kick in for that scenario, having the environment carved up will reduce the number of connections as the load will be spread over multiple zones, thus improving performance.

In the end, the performance of the LHC appears to be quite good when viewed from a XenApp workload perspective. Ensuring users get their applications efficiently and quickly, even during a major outage like a database outage, is important and the local host cache implementation of XenApp 7.15 looks to be up to the task.


#XenApp
#XenDesktop
#Local_Host_Cache
#VDA

Comments

11-29-2017 05:40 AM

Nice article

100% with Tobias: this article contains a lot of usefull updated informations! Thanks for sharing :)

11-28-2017 12:27 PM

great article

This is a very nicely put together and insightful article. It accentuates the need to revisit what often is considered "standard" and question and retest thing that are not likely to hold up over time as technology changes. Daniel Feller at Citrix does this all the time and sometimes, the results are truly surprising. Thanks very much for sharing these insights and above all, the metrics that say it all.