Socket/Server performance issue/question

Red Squirrel · Sep 9, 2008

I'm writing a server program for testing/learning purposes as I want to start getting into bigger programs but first I'm coding base C++ libraries for repetitious stuff so I'm not reinventing the wheel all the time.

All this server does is accept client connection and write any data to it's own file. I noticed with 200 concurrent connections sending data as fast as they can, it really bogs down and is slow. I'm only getting about 500KB/minute written to disk out of all files. This is basically the structure of the program, I'm wondering how I could improve it:

1-A SessionHandler class is created, this basically houses most of the program stuff, and a global instance of this class is then declared

2- In main, 10 worker threads are started, I'll get to those later

3- The socket is initiated and listening.

4-A loop runs poll() and acts upon any data. This is non blocking. If a new connection is detected a client is created and sent inside the session handler class which has an array of all the active connections.

5- poll then is called again and loops through each client to see if it has data if it does it grabs one byte and stores in it's buffer.

- The loop goes back to 4 at this point.

Now for the worker threads:

Basically it's just a loop where an int in the session handler keeps track of which session to poll. A poll is done to see if it has any pending data, if it does, it processes it (in this case, just appends it to file)

I noticed that increasing the number of worker threads has very little performance difference.

In this there are a lot of mutexes to ensure its thread safe, but I think that may be what is slowing me down. Maybe I need to somehow reduce the amount of shared memory I'm using.

Ex: Each time a client is accessed (to get or put data) it is locked then unlocked. Each time a session is accessed it is locked and unlocked.

Now based on this info, can you think of a better way I should be handling this, or point me to proper resources?

This is rather huge so don't expect someone to proof it line per line, but here is a link to the program and headers: (lot of them you probably wont need but was easier to just include whole thing - servertest is the actual program folder)

http://www.iceteks.com/misc/servertest.zip

Also I'm working on changing poll to select() so not sure if that will do anything performance wise, but realized poll only works in Linux so trying to make it cross platform.

QuixoticOne · Sep 10, 2008

ONE BYTE?! I'd read all the data that was available and store that.

5- poll then is called again and loops through each client to see if it has data if it does it grabs one byte and stores in it's buffer.

Code profiling is your friend. If your code is running slowly, compile it with profiling / profiling instrumentation and source level debugging enabled and look at where it is spending the majority of its time. Improve those areas.

Generally doing highly buffered I/O (i.e. large buffer sizes at once -- many kilobytes or more) is a huge benefit for overall performance, especially for disk I/O operations. If you have the RAM for megabyte level buffers that would not be unreasonable at all.

Select can be very inefficient on some platforms especially when only a small set of descriptors need to be checked. It is often best to check them individually if you just have several to several dozen descriptors to look at. I'm sure they've improved that a lot over the years, and I'm sure it is still system dependent as to what is most efficient. But since you're working multi-platform, just profile it and see what works well for your needs.

doing mmap or something like that on the files might save you some buffer management code and help performance in some cases. Obviously in some cases it is definitely not the right approach, though.

Red Squirrel · Sep 10, 2008

Is there a way so I can do a select and know how many bytes are pending? That's the only reason I'm doing 1 byte and I had a feeling that could be an issue. also how do I do code profileing, what tools do I need? Kinda heard of it but not more then that.

QuixoticOne · Sep 10, 2008

RECV(2) Linux Programmer?s Manual RECV(2)

NAME
recv, recvfrom, recvmsg - receive a message from a socket

SYNOPSIS
#include <sys/types.h>
#include <sys/socket.h>

ssize_t recv(int s, void *buf, size_t len, int flags);

ssize_t recvfrom(int s, void *buf, size_t len, int flags,
struct sockaddr *from, socklen_t *fromlen);

ssize_t recvmsg(int s, struct msghdr *msg, int flags);

DESCRIPTION
The recvfrom() and recvmsg() calls are used to receive messages from a
socket, and may be used to receive data on a socket whether or not it
is connection-oriented.

...
All three routines return the length of the message on successful com-
pletion. If a message is too long to fit in the supplied buffer,
excess bytes may be discarded depending on the type of socket the mes-
sage is received from.

If no messages are available at the socket, the receive calls wait for
a message to arrive, unless the socket is non-blocking (see fcntl(2)),
in which case the value -1 is returned and the external variable errno
set to EAGAIN. The receive calls normally return any data available,
up to the requested amount, rather than waiting for receipt of the full
amount requested.

The select(2) or poll(2) call may be used to determine when more data
arrives.

The flags argument to a recv() call is formed by OR?ing one or more of
the following values:

...
MSG_DONTWAIT
Enables non-blocking operation; if the operation would block,
EAGAIN is returned (this can also be enabled using the O_NON-
BLOCK with the F_SETFL fcntl(2)).

...

QuixoticOne · Sep 10, 2008

GCC(1) GNU GCC(1)

NAME
gcc - GNU project C and C++ compiler
...
-glevel
Request debugging information and also use level to specify how
much information. The default level is 2.

...
-p Generate extra code to write profile information suitable for the
analysis program prof. You must use this option when compiling the
source files you want data about, and you must also use it when
linking.

-pg Generate extra code to write profile information suitable for the
analysis program gprof. You must use this option when compiling
the source files you want data about, and you must also use it when
linking.
...
GPROF(1) GNU GPROF(1)

NAME
gprof - display call graph profile data

DESCRIPTION
"gprof" produces an execution profile of C, Pascal, or Fortran77
programs. The effect of called routines is incorporated in the profile
of each caller. The profile data is taken from the call graph profile
file (gmon.out default) which is created by programs that are compiled
with the -pg option of "cc", "pc", and "f77". The -pg option also
links in versions of the library routines that are compiled for
profiling. "Gprof" reads the given object file (the default is
"a.out") and establishes the relation between its symbol table and the
call graph profile from gmon.out. If more than one profile file is
specified, the "gprof" output shows the sum of the profile information
in the given profile files.

"Gprof" calculates the amount of time spent in each routine. Next,
these times are propagated along the edges of the call graph. Cycles
are discovered, and calls into a cycle are made to share the time of
the cycle.

Several forms of output are available from the analysis.

The flat profile shows how much time your program spent in each
function, and how many times that function was called. If you simply
want to know which functions burn most of the cycles, it is stated
concisely here.

The call graph shows, for each function, which functions called it,
which other functions it called, and how many times. There is also an
estimate of how much time was spent in the subroutines of each
function. This can suggest places where you might try to eliminate
function calls that use a lot of time.

The annotated source listing is a copy of the program?s source code,
labeled with the number of times each line of the program was executed.
...

Red Squirrel · Sep 10, 2008

Great MSG_DONTWAIT seems to be faster then using select. But only problem is that does not compile under windows, what is the numeric value for that so I can just do a define?

QuixoticOne · Sep 10, 2008

/usr/include/linux/socket.h:#define MSG_DONTWAIT 0x40 /* Nonblocking io
/usr/include/bits/fcntl.h:#define O_NONBLOCK 04000
/usr/include/bits/fcntl.h:# define FNONBLOCK O_NONBLOCK

I wouldn't be too optimistic about some of the options working as well under MS Windows as they do on UNIX; it isn't unexpected for Microsoft to leave POSIX/LINUX compatibility a bit broken to favor the use of native Windows APIs instead of portable ones.

Try fcntl() to set the non-blocking option and see if that is supported on your MS Widows platform.

Another option is just to fork N threads each handling one of your N socket streams and have each of them do a blocking I/O socket call on their connected socket waiting until the next message is received and handle its buffering and queueing for output locally to that thread.

You can do MS windows specific OVERLAPPED I/O with things like ReadFileEx(), WaitForMultipleObjectsEx(), et. al. to get access to various kinds of timeouts, asynchronous I/O capacities, et. al.
http://msdn.microsoft.com/en-u...y/aa365468(VS.85).aspx
http://msdn.microsoft.com/en-u...y/ms687028(VS.85).aspx

Originally posted by: RedSquirrel
Great MSG_DONTWAIT seems to be faster then using select. But only problem is that does not compile under windows, what is the numeric value for that so I can just do a define?

Red Squirrel · Sep 10, 2008

Trying to avoid relying on multithreading, at least for the base class functionability. As it is I can technically make a 1 thread server app that handles multiple connections. Though of course for bigger apps I will be using multithreaded.

I managed to get the non blocking to work in windows with fcntl. Though I did notice my mutexes are not working as expected. In linux my lock feature was blocking when I did not want it to - some cases I do, some I don't, but it was blocking everywhere. Fixed that (did not test yet) and in windows my mutexes are not working at all. So once I get my simple mutex class working I'll retest in Linux and see what kind of performance increase I get.

Also I made it so rather then get data byte by byte in the same poll function as checking for new connections, I made it so the client class itself has a RXdata function which then fills the buffer of that client with any pending data. It gets called in the appropriate worker thread instead of in the main thread. I have a mutex that ensures two clients are not processed at once for data i/o but two seperate clients can receive data at once, as opposed to before. I already saw a small increase in performance before fixing the mutexes, I'll report back once I fix those.

My main concern is this working well in Linux, and in windows I want it to work reliably but I don't care as much for performance, I don't code that much for windows anymore, I just want that option to be open.

Red Squirrel · Sep 11, 2008

With the mutex fix and making the client data RX happen at each slice interval for that client's processing really speeded things up. I'm still not transfering MBs per sec though so I'll run a code profiler to get deeper.

Fallen Kell · Sep 12, 2008

Originally posted by: RedSquirrel
All this server does is accept client connection and write any data to it's own file. I noticed with 200 concurrent connections sending data as fast as they can

Well, that there is your problem. Whatever system you are writing to has 200+ open file writes occurring. Your disk is probably trashing with that many I/O interrupts occurring and the OS is starting to get overwhelmed because of all the disk I/O threads.

Next time you have that many running, open up a terminal window and run "iostat -x" and see what your "await", "svctm", and "%util" values are for the disk that is storing all those files. I bet you will see some very high numbers for all those things.

Red Squirrel · Sep 13, 2008

So would this be normal then, like say an ftp server, if it had 200 connections, would it perform just as bad? or should I be able to still witstand that? Or what about a DoS attack (someone just blasting data to the port to try and overwhelm it)

Also iostat does not work on my system, is there an alternative in fedora core 5?

Red Squirrel · Sep 13, 2008

I just ran a code profiling and this is the result:

http://www.iceteks.com/misc/codeprofile.log

So does this mean the biggest time consumer functions are on the top? Would my program be more efficient if I made my own linked list rather then use a deque/vectors? Since the slowdowns seem to actually be more due to STL containers, which surprises me, unless I'm reading this wrong.

Crusty · Sep 13, 2008

What are you using the deque list for? Chances are there is a better data structure for you to use.

Red Squirrel · Sep 13, 2008

Storing a list of bits for bit level manipulation. With the BitStream class I can do stuff like this:

BitStream bs;
bs.WriteInt32(1834);

That will write the binary value of 1834 using 32 bits.

I can later on go something like this:

bs.ReadInt32(0);

There's also push/pop functions for start and end. Most of the time this class is used in networking where I will fill a buffer usually with int8's then pop them out and handle the packet data.

I DID notice that I was actually using insert and not push_front and push_back though in my bitstream class. Where it applied, I changed that so it should of helped a bit.

A big thing too is finding a reasonable buffer size to work with when sending/receiving data. Right now I'm using 1KB buffers.

I'm wondering if I should turn my BitStream class into a dual linked list, would this more be efficient? It does get iterated through sometimes though, how slow is that with linked list? is there ways to speed it up maybe?

QuixoticOne · Sep 14, 2008

1KBy is extremely small for network or disk use... the ethernet MTU of 100Mbit ethernet is 1500 Bytes in a single packet, so usually you'd get an appreciable fraction of that at a time unless you're sending really small messages. Of course with TCP you get the whole message at once to a certain extent but my point is if you're sending messages of many kilobytes or frequent small ones, a 1KBy buffer is going to be small.

Red Squirrel · Sep 14, 2008

Should I maybe step that up to like 10k maybe? or is that even too small? Don't want to overdo it either.

QuixoticOne · Sep 14, 2008

10k sounds good... I'd make it several times longer than the amount of data in one packet / message / transaction unless you have reason to believe that you'll process packets in real time and double-packet-size buffering is enough or whatever.

The deque shouldn't be much slower at 1x the size or 10x the size.. try 10k or whatever and run benchmarks / profiling and see if it helped or hurt and go from there.

Also do be sure you're compiling (perhaps when not debugging) at -O2 or -O3 optimization level for maximum performance... turn off debugging / assert checking et. al. to see if your performance is radically dependent upon these kinds of settings.

Red Squirrel · Sep 14, 2008

I'm just doing g++ -o program program.cpp -pthread

Anything else I should be doing for optimization? Like do I put -O3 too? Anything to optimize even more then default I'm up for, even if compiles take longer, as I can do non optimize for testing then at least for prod I can do an optimized compile.

Crusty · Sep 14, 2008

After looking over your BitStream class why are you screwing with the endianness of the bits in the Write/ReadVarInt functions. I didn't look through the rest of your code(maybe the answer is there), but it seems like you are trying to over-engineer things.

Red Squirrel · Sep 14, 2008

Reason is that class is made for bit manipulation. It's more or less for networking.

Ex lets say a game packet looks like this:

8bit int: packetID
32bit int: Serial
16bit int: Item Hue
16bit int: item ID
1bit: flag for something1
1bit: flag for something2
1bit: flag for something3
1bit: flag for something4
1bit: flag for something5
1bit: flag for something6
1bit: flag for something7
1bit: flag for something8

So I receive this packet, then with bitstream class I can access each element as is (normally pop it out as I go). That's pretty much what that class is for. In most cases if I need to store strings I use regular strings.

I fixed up my class a bit though so that the push function actually calls the deque push function, as I was not taking advantage of that.

And it looks like -O3 wont work, it totally changes the way the program even works. Like IP addresses are not even displayed the same or anything. It's messed.

Crusty · Sep 14, 2008

I understand that, but you still don't explain why you write the bits backwards, and then read them backwards again. You are killing your efficiency by changing your bit order back and forth that way. For example, in your write functions you calculate 2^(bits-1) manually, and then in a second loop you start dividing that number by 2 everytime. I understand you are generating a bit field for masking the input tw in order to get the bits from it, but why? What's the point of using a list of type bool to represent your data when you can just use a byte array or something similar to store your buffer and it's binary data. I see no need to convert it to a list of bools.

A couple other things, your code files are a mess. You have files named sessionhandler.h that have nothing to do with a sessionhandler class. .h files should be declaring your class definitions, and .cpp files should be implementing them. They should have the same name as the class they are defining and implementing. You should also be working with only one class inside of a given file. It makes it a HUGE pain to find anything in your code. Another example, rslib_p_networking.h defines 3 different classes whose names contain nothing from the filename.

Each one of those classes should be in their own files. Use sub directories to help organize your code, something like headers/networking/SocketClient.h and headers/networking/SocketClient.cpp to implement your SocketClient class.

Finally, makefiles are your friend, especially as the code starts getting larger and more intricate. If you are planning on making this open source, or at least getting help from others on it you need to help them out by providing makefiles to compile your code.

Red Squirrel · Sep 14, 2008

I always figured .cpp was for the actual program itself, everything else is .h. So I used the naming convention rslib_i_name for implimentors and rslib_p_name for prototypes.

And the reason I read it backwards is because the IP address is actually setup that way. Like when you start decoding it, they actually put it backwards, not sure why though.

Keep in mind the header folder is a base set of classes I'll be using in ALL programs I write, been working on them and just writing small apps to test them with. So most of the dirty stuff is in those headers so I can avoid reinventing the wheel for other programs. The classes that are on the same file are for the same thing. ex: rslib_i_network has everything that has to do with networking, some may have multiple classes. If I was to put every single class in a seperate file it would make it a real pain to try and find stuff not to mention having 30 tabs open when working on a single thing.

Anyway this is the code that acts weird in -O3:
Is there a better way to accomplish it? Basically what happens in -O3 is that it returns a long and random number. I can execute that function with same parameters 30 times, and I get 30 different numbers.

unsigned long int BitStream::ReadVarInt(int pos,int bits)
{
if(stream.size()<=0)return 0;
if(pos>=stream.size())return 0;
if((pos+bits)>=stream.size())bits=stream.size()-pos;

if(bits>MaxBits())bits=MaxBits();
if(bits<1)return 0;

unsigned long int tmp=0;

unsigned long int orer=1;
for(int i=1;i<bits;i++)
{
orer*=2;
}

for(int i=0;i<bits;i++)
{
if(stream[pos+i])tmp |=orer;
orer/=2;
}

return tmp;
}

Crusty · Sep 14, 2008

If those header files are going to compiled as a library for later use then you should be doing exactly that already. Makefiles make this very easy.

I work on projects with hundreds if not thousands of source files and trust me, it's a LOT easier to find something when you can tell what a file contains just by looking at the filename. If I have to open a file to figure out what code is inside of it, that's one extra step needed to get where I'm going.

Red Squirrel · Sep 14, 2008

what exactly is a make file though? maybe I can look into it once I put this into production. Basically when its deem stable I will move it all into my compiler's header folder and I'll just have to include rslib.h in my main project. Those headers wont be visible during actual projects.

Anyway with the -O2 optimization I seem to be doing way better. I have 600 concurrent connections blasting out data in 15KB chunks and the server is setup to handle 10KB chunks. It's back logging a bit considering I told the client to send more at once, but I have to be ready for anything when it comes to a server application that would normally be on the internet. Writing at a few MB per second as opposed to a few KB per second now, so I really speeded things up but it could probably use more optimization. At this point though its hard to tell if the bottleneck is my program or just the server/HDD. The load is at 17 which is very high.

Crusty · Sep 14, 2008

Get started reading on makefiles by reading the man page for make. It'll point you in the right direction.

Socket/Server performance issue/question

No Lifer

Golden Member

No Lifer

Golden Member

Golden Member

No Lifer

Golden Member

No Lifer

No Lifer

Diamond Member

No Lifer

No Lifer

Lifer

No Lifer

Golden Member

No Lifer

Golden Member

No Lifer

Lifer

No Lifer

Lifer

No Lifer

Lifer

No Lifer

Lifer