PSA: So, Opengrok is pretty much I/O bound...

Feb 25, 2011
16,987
1,617
126
Putting in Programming because Opengrok is a code browser / indexing tool:

So, our QA department has an opengrok instance (in a vmware VM) with all of the branches of all of our codebases, that indexes every night. About 400GB total.*

*Only like two guys use it, but that's not my problem.

So, I noticed that when it's sitting on an iSCSI LUN, it takes about 13-14 hours to index. But if DRS has moved it to a host/LUN that are connected w/ Fiber Channel, it does it in under 10 hours. Average CPU load will be around 12% CPU use on an 8-core VM.

I'm not used to seeing that big a performance hit with iSCSI, although I know it's slower. But this made me think.

So I took one of our "20% time project" VM hosts, made sure it had enough on-board storage (not a SAN LUN, but 2x 15k drives in RAID-0 with a 200GB SATA SSD set up as a cache drive) to host a clone of our opengrok instance, and ran a reindex. Which completed in just under 3 hours. (!)

Still never went past 25% CPU use.

I am now migrating the Opengrok data to an mdadm-created RAID-10 of PCI-E SSDs. (Older 350GB Micron P320h's that we had in a closet. So, from around 2012.) This is basically as fast as I can go out of the old testing hardware I have access to, but it should still be pretty enjoyable to watch.

I think I should just quietly point the DNS entry for the opengrok server to the new VM and see if anybody says anything.

tl;dr - Wow. If you are a sysadmin w/ an opengrok instance, get some SSDs for that thing. Yikes.
 

Gryz

Golden Member
Aug 28, 2010
1,551
204
106
Yep.

Everybody always says that running stuff in a VM has almost no extra performance overhead. The typical number you hear is: "programs run about 1 to 5 percent slower in a VM compared to bare metal". And everybody repeats that. And everybody loves VMs.

A friend of mine writes software that has to do with maps and finding your way around the world. He uses an open-source source for his map-data. That source is quite larger, 33GB in its most compressed format.

Every week he copies that 33GB file from the Internet to one of his own machines. He then runs a conversion program to convert those 33GB from PBF format to his own internal format and store it in his own file-format. That program takes a few hours to run. He used to run it on his own old, slow PC at home. After a year or so, people he cooperates with suggested he should use the company's machine-park to do those weekly runs. The machine-park consists only of VMs running on their own hardware in their own machine-room. The rest of the company (who does some other software development) uses only those VMs. So my friend sets up his software to run on one of those VMs.

Duration goes up 5 fold. In stead of a few hours, his conversion program now takes a half a dayto a day. System admins get involved. They promise to improve performance. Must be a small thing, because they only expected a 5% performance decrease at worst. It turned out they couldn't find anything. I believe they got a local representative of VMWare involved. Didn't help. In the end, the only conclusion they could draw was that a fully virtualized environment can have a serious performance impact. Especially if you do a lot of I/O, and your VMs don't have direct storage.

It all makes a lot of sense to me. There ain't no such thing as a free lunch. When you add layers, when you add abstractions, when you separate hardware, it all will have a performance impact. Listening to sales people tell you that their technology only has benefits, and no down-sides is a bit naive.

I work for a large technology company. We outsource our IT to another large technology company. (I won't mention the name, because it's HP. It seems lots of companies outsource their IT to HP). Everything I want to do on the corporate network is slow as thick shit. It's unbelievable. I believe it is because they run all their webservers virtualized. And those servers might be located anywhere in the world. As if RTT doesn't have impact on web-applications. As if more layers of abstraction doesn't have performance impact. Performance of accessing random webpages on my PC at home, with a 6.5 Mbps DSL line, feels snappier than any official web-application at work. And nobody seems to care.
 
Feb 25, 2011
16,987
1,617
126
Well, it's still a VM, and it's actually running on an older/slower system CPU-wise now. But yeah, cutting the SAN out of the loop does increase performance, like, a lot.

BTW, 4x PCI-E SSDs in passthrough mode to the VM, using mdadm to make them a RAID-10 = 70 minute indexing times. THIS is going into production. :-D
 
Feb 25, 2011
16,987
1,617
126
I work for a large technology company. We outsource our IT to another large technology company. (I won't mention the name, because it's HP. It seems lots of companies outsource their IT to HP). Everything I want to do on the corporate network is slow as thick shit. It's unbelievable. I believe it is because they run all their webservers virtualized. And those servers might be located anywhere in the world. As if RTT doesn't have impact on web-applications. As if more layers of abstraction doesn't have performance impact. Performance of accessing random webpages on my PC at home, with a 6.5 Mbps DSL line, feels snappier than any official web-application at work. And nobody seems to care.

Could also just be poorly written web applications. I see a lot of those. Particularly Java applications in Tomcat containers. You can be running them on your local system and it still "feels" like the web server is in China. On dialup. And Oh. My. God. It's. Full. Of. Javascript.
 

Cogman

Lifer
Sep 19, 2000
10,284
138
106
Yep.

Everybody always says that running stuff in a VM has almost no extra performance overhead. The typical number you hear is: "programs run about 1 to 5 percent slower in a VM compared to bare metal". And everybody repeats that. And everybody loves VMs.

A friend of mine writes software that has to do with maps and finding your way around the world. He uses an open-source source for his map-data. That source is quite larger, 33GB in its most compressed format.

Every week he copies that 33GB file from the Internet to one of his own machines. He then runs a conversion program to convert those 33GB from PBF format to his own internal format and store it in his own file-format. That program takes a few hours to run. He used to run it on his own old, slow PC at home. After a year or so, people he cooperates with suggested he should use the company's machine-park to do those weekly runs. The machine-park consists only of VMs running on their own hardware in their own machine-room. The rest of the company (who does some other software development) uses only those VMs. So my friend sets up his software to run on one of those VMs.

Duration goes up 5 fold. In stead of a few hours, his conversion program now takes a half a dayto a day. System admins get involved. They promise to improve performance. Must be a small thing, because they only expected a 5% performance decrease at worst. It turned out they couldn't find anything. I believe they got a local representative of VMWare involved. Didn't help. In the end, the only conclusion they could draw was that a fully virtualized environment can have a serious performance impact. Especially if you do a lot of I/O, and your VMs don't have direct storage.

It all makes a lot of sense to me. There ain't no such thing as a free lunch. When you add layers, when you add abstractions, when you separate hardware, it all will have a performance impact. Listening to sales people tell you that their technology only has benefits, and no down-sides is a bit naive.

I work for a large technology company. We outsource our IT to another large technology company. (I won't mention the name, because it's HP. It seems lots of companies outsource their IT to HP). Everything I want to do on the corporate network is slow as thick shit. It's unbelievable. I believe it is because they run all their webservers virtualized. And those servers might be located anywhere in the world. As if RTT doesn't have impact on web-applications. As if more layers of abstraction doesn't have performance impact. Performance of accessing random webpages on my PC at home, with a 6.5 Mbps DSL line, feels snappier than any official web-application at work. And nobody seems to care.

It just depends.

We run quite a bit of our stuff in VMs and we usually don't seen too big of a performance impact (even databases!).

Where we did find a lot of performance problems with IO stuff for our databases was in the network drive layer. Some network drive services don't take too kindly to a bunch of reads and writes. We ended up going on a hunt for NAS solutions for our dev environments because the solution (netapp is what we previously used, it was dog slow).

With that being said, there is a reason containerization is getting popular. Zero overhead while offering most of the same benefits of virtualization is pretty sexy.
 

Cogman

Lifer
Sep 19, 2000
10,284
138
106
Could also just be poorly written web applications. I see a lot of those. Particularly Java applications in Tomcat containers. You can be running them on your local system and it still "feels" like the web server is in China. On dialup. And Oh. My. God. It's. Full. Of. Javascript.

The problem is that many devs don't have a good sense for performance and performance problems of webapps. "Why can't a single endpoint make 100 different requests to the database?". You'll find that with any language, however web devs tend not to be great at this and java devs doing web dev stuff tend to really struggle :(.