debugging slow logon

I love solving tough problem, or at least cast some light. One of my customers (I’m back consulting as since 2018) was having some serious issues on the AD logon. It took above a minute, with sessions timing out, for users to logon to their workstation. Beautiful, overprovisioned setup, we didn’t really spot the error, aside of going down to update fileserver’s fiberchannel card drivers and starting moving around data between shares. I still believe it is suboptimal to either partition the load manually creating new shares, not leveraging, if it does the job, the DFS(R?) solution from Microsoft.

The fun though came for someone like me to the challenge of collecting data in a proprietary enviroment, especially when you do have a vendor on storage, one on the appliance involved, and another and another, and none of them is responsible for the whole solution… a bit like the BER airport, everyone involved, noone responsible for the overall solution. So, how to isolate the problem at least? We had data from the storage itself, all green, all performant, not much from the client OSs… that in this case are the Microsoft fileservers. So? Well, seems that Microsoft itself has a metrics interface called WMI… Windows Metrics Interface… good ah? Out of which it delivers all the info you see in the Task Manager and similar tools. Well, a bunch of skilled hackers came up with 2 nice tools, one built above the other.

On one hand leoluk/perflib_exporter which looks up in memory, actually bypassing the standard WMI interface (more details on its github page) and delivers a full data dump of ALL available metrics that are in the OS. I was having a SysOps orgasm going through it. On the other hand martinlindhe/wmi_exporter that simply reads that dataset and converts to a format understandable by prometheus. Leading to this beautiful chart…


This way we could spot which server was serving, how many filedescriptors were open on a certain share… and so on and so on… and yes, I had to come up with the SMB Samba Share data class… as that was missing, but it was just a couple of hours of cut&paste work. I now need to find the time to clean it up to get it merged back in the main project.

p.s. I didn’t know… golang compiles, with no complain, from linux, a .exe windows binary


Why opensource?

Start from a specific situation: you are on a tight schedule, with pressure from management to approve or prove an issue on the software release being tested, and the counterpart R&D team blames you that the metrics are wrong. Actually, you check the application and they are right… looking into the source code u see the bug, u fix it, and within hours u are back on the main business, finalizing the testing activity. Now, what would have happened with a proprietary testing tool? You couldn’t have checked the code. You would have had to argue with the ur R&D team about the existence of the issue, and with the vendor on the other side. Project would have been delayed or QA totally skipped, with all the risk it takes having a potentially broken release in production. Is it worth the money of the proprietary tool? I preferred to convince my management to spend in R&D resources in my team, and get a solution we could trust. It has proven to be a good choice

Tags: , , , , , , ,

Powered by Qumana

don’t use jMeter to test Apache!

rephrase… “do not use jMeter to test very fast response time applications”

about httpclient: But why isn’t the problem deterministic? Shouldn’t it never recover once the problem starts to happen? The magic is Java garbage collection. I reproduced the effect by forcing garbage collection. It will clean CLOSE_WAIT connections. But, to be accurate, JVM garbage collection does not handle socket closing by itself. It only frees memory. It is Socket object who closes sockets in finanize method as discussed here.

CLOSE_WAIT socket state, Socket object, finalize & garbage collection

you end up with having garbage collection slowdowns on client side (one more link), and that will effect your metrics collection as well… no matter how high you tune your ulimit(this limits how many sockets your process can concurrently open)… you simply can’t test a such an application with a jMeter… it is “too slow” …grinder is a bad choice as well (everything Java based!)

if you want to do so, good will be “ab” aka Apache Benchmark, with the limitations to the scenario you can describe… it won’t allow you do to complex load and performance testing scenarios… it is basically hitting ONLY one url… but it can help when you need to stress a specific call, or your apache settings

other good choice might be tSung, written in Erlang… the system creates queues of tasks, and you can simulate a pretty heavy amount of “browser like scenarios”, having multiple parallel request just after the html has been retrieved… to get js, css, and images…

(4 italians only: “ditelo a Brunetta! – il Fatto Quotidiano: Certificati medici online, sistemi in tilt“)

webservices? well… in 2 years and a half in Vodafone we came to the conclusion that grinder with our self made HTTP-QAT toolkit was the best choice… (create your request from the WSDL with SoapUI and then put it in as a template inside HTTP-QAT, and fire!!!! :p )

I’d like to spend time to integrate tools like cucumber and tSung
performance and easy way to describe/document the test scenario

Also Hadoop, with JFreeChart would be a good rewrite of Ground Report

what’s “ab”?
(read below, 95% of my requests were pretty fast…
we have some long tails on 5% of the overall sent http calls… )

grinder:~ zeph$ ab -c 5 -n 100 http://localhost/ This is ApacheBench, Version 2.3 Copyright 1996 Adam Twiss, Zeus Technology Ltd, Licensed to The Apache Software Foundation, Benchmarking localhost (be patient).....done Server Software: Apache/2.2.15 Server Hostname: localhost Server Port: 80 Document Path: / Document Length: 44 bytes Concurrency Level: 5 Time taken for tests: 0.570 seconds Complete requests: 100 Failed requests: 0 Write errors: 0 Total transferred: 42500 bytes HTML transferred: 4400 bytes Requests per second: 175.45 [#/sec] (mean) Time per request: 28.498 [ms] (mean) Time per request: 5.700 [ms] (mean, across all concurrent requests) Transfer rate: 72.82 [Kbytes/sec] received Connection Times (ms) min mean[+/-sd] median max Connect: 0 0 0.2 0 1 Processing: 0 15 80.6 1 569 Waiting: 0 15 80.6 1 569 Total: 1 15 80.7 1 570 Percentage of the requests served within a certain time (ms) 50% 1 66% 1 75% 1 80% 2 90% 2 95% 78 98% 569 99% 570 100% 570 (longest request)