Our app's UDP discovery on AIX 6.1 is not working

Felecha · Nov 7, 2008

We have an application that runs as an agent, or service, so far it's out on linux and solaris, and we are now going to be releasing it on HP-UX, and IBM AIX. It's a java app, so its the same code on all platforms. Part of its functionality is that when it starts up it opens a udp listening port on 4446. Then if clients want to discover agents in the local subnet, they send out a udp broadcast and any agents in the net will respond and then the client knows about them. Straightforward enough.

We have test machines for all platforms, one of them here is AIX 6.1. It was newly upgraded from 5.3 just a week ago so we could test on it. I suspect that's the issue.

The AIX 6.1 agent is not responding to discovery requests. All the other platforms are fine, and AIX 6.1 agents in other offices of our company are having no problem. At least 2 others I have consulted with are saying - no problem, our AIX 6.1 agents are discovered fine. I have logged in and seen it for myself on one of them.

I have tried pinging the port with udp and it DOES respond to that. I did

[*****@*****:~] ping -s -U -p 4444 *****
PING *****: 56 data bytes
36 bytes from ***** (10.24.112.23): udp_port=4444.
36 bytes from ***** (10.24.112.23): udp_port=4445.
36 bytes from ***** (10.24.112.23): udp_port=4447.
36 bytes from ***** (10.24.112.23): udp_port=4448.

So the ping "skips over" the 4446, I take that to mean it does not respond if the port is in use

The log for the agent shows

2008-11-07 12:59:07,015 [DEBUG] [UDPMulticastServerThread] com.services.discovery.UDPMulticastServer.run(?) - Not a valid request header: IJz
1234567
2008-11-07 12:59:07,016 [DEBUG] [UDPMulticastServerThread] com.services.discovery.UDPMulticastServer.run(?) - Ignore the request.

So there's the puzzler - the agent is there, it is listening on 4446, what could it be ON THIS BOX that could make a probably pretty ordinary udp message not able to do anything?

* ping udp packets are ok
* other platforms here and elsewhere are ok
* other AIX 6.1 agents are ok
* this box used to work just fine for udp discovery requests when its OS was 5.3
* now it's not working, and it was recently upgraded (wiped clean and fresh OS install)

The IT guy says he did all the patches after the install. He has scratched his head over it, too.

Stumped!

bsobel · Nov 7, 2008

Your answer is in your post. Your receiving the data, but not processing it correct because you have not accounted for the fact that AIX is big endian. Hence the 'not a valid request header'

Bill

Felecha · Nov 8, 2008

but "not a valid request header" is for a udp request as a ping from another machine. I was figuring the format of a ping is somehow different from the format of our clients' discovery requests? But even with that:

* There is no response at all when the DISCOVERY request comes from a client ON AIX, not just from linux or solaris or hp-ux.

* I have personally seen discovery working on another AIX 6.1 box in a distant office of our company ... ya don't always trust other engineers in distant offices when they say "Oh, it's working for me"

. Identical setup as far as I can see on that box and the one I am working with. And identical client requests there and here. I just suspect there is something NOT identical and we havent found it.

* I get the same "not a valid request header" from agents on the other platforms

The pinging is coming from a solaris box. I got that idea from someone and quickly found that the syntax of his suggested ping command was not working from wherever I was logged in at the time. So I found he had sent from solaris and I then went to solaris and that syntax worked from there. I had to read the man pages a bit to figure out what it was doing. I started out at 4444 and pinged (pang?) that and then 4445 and then 4446, etc.

Nothinman · Nov 8, 2008

Did you run tcpdump or similar on the box while doing a discovery to see if the packets are making it to the box?

Most likely there's something different in the multicast config, you just have to find it.

Felecha · Nov 9, 2008

no, I didnt, I dont know that one. Part of the thing is that I know less than a lot of folks, just 3rd year at it, always going down to Larry's office to ask rookie questions.

But even when the discovery request comes FROM OUR CLIENT APP ON THE AIX BOX itself, still the agent does not indicate it even received the request. So that would kinda negate that reasoning.

But ya know, when I get back tomorrow I want to explore the possibility that the host name is involved. We have had an odd issue with the agent's "agentHostName" value. When the agent starts up it gathers up its environment values and populates its variables, and this box has been coming up with <hostname.company.com> where others just come up as <hostname>. I've been trying to find out where that comes from, cause a couple of other issues have come up around that. I THINK the other AIX box I mentioned, where discovery is working, just has <hostname> for its agentHostName. Now I'm wondering if the particular structure of the discovery packets is that when the request is issued, the client got the <hostname> from some DNS somewhere, and sent out requests to all the <hostname>'s and this machine is listening for <hostname.company.com> and never hears it. But then the ping requests ARE heard and rejected. Maybe the pings just go out to IP's?

That's kind of speculative. I'm taking a known thing - we have seen odd things due to the full domain name - and I'm trying to fit the evidence to work with that.

Felecha · Nov 9, 2008

no, I didnt, I dont know that one. Part of the thing is that I know less than a lot of folks, just 3rd year at it, always going down to Larry's office to ask rookie questions.

But even when the discovery request comes FROM OUR CLIENT APP ON THE AIX BOX itself, still the agent does not indicate it even received the request. So that would kinda negate that reasoning.

But ya know, when I get back tomorrow I want to explore the possibility that the host name is involved. We have had an odd issue with the agent's "agentHostName" value. When the agent starts up it gathers up its environment values and populates its variables, and this box has been coming up with <hostname.company.com> where others just come up as <hostname>. I've been trying to find out where that comes from, cause a couple of other issues have come up around that. I THINK the other AIX box I mentioned, where discovery is working, just has <hostname> for its agentHostName. Now I'm wondering if the particular structure of the discovery packets is that when the request is issued, the client got the <hostname> from some DNS somewhere, and sent out requests to all the <hostname>'s and this machine is listening for <hostname.company.com> and never hears it. But then the ping requests ARE heard and rejected. Maybe the pings just go out to IP's?

That's kind of speculative. I'm taking a known thing - we have seen odd things due to the full domain name - and I'm trying to fit the evidence to work with that.

Felecha · Nov 9, 2008

any other suggestions for how to snoop on whats going on? tcpdump is new to me. The idea of pinging to a specific port helped, and I didnt know about that one. That settled the issue of whether the app even OPENED port 4446. I'm intrigued that the port is there, it's alive, requests can be sent from ON THE SAME BOX and still the port does not hear it

Nothinman · Nov 9, 2008

But even when the discovery request comes FROM OUR CLIENT APP ON THE AIX BOX itself, still the agent does not indicate it even received the request. So that would kinda negate that reasoning.

Which seems to point to the multicast stuff, not that I've ever used multicast.

Now I'm wondering if the particular structure of the discovery packets is that when the request is issued, the client got the <hostname> from some DNS somewhere, and sent out requests to all the <hostname>'s and this machine is listening for <hostname.company.com> and never hears it. But then the ping requests ARE heard and rejected. Maybe the pings just go out to IP's?

Everything is just sent to the IP, the application just uses the hostname to find that IP. Whether or not the hostname is embedded in your app's packets is another thing but shouldn't be a problem at this point. If hostname and hostname.company.com don't point to the same place you've got other issues that need resolving.

Felecha · Nov 9, 2008

Is multicast somekind of key to the puzzle here? I am not at work but I know that the agent log has two lines when a discovery request happens, at least for an agent that is responding:

<timestamp stuff> [UDPMulticastServerThread] Discovery request received from <ipaddress> at 4446
<timestamp stuff> [UDPMulticastServerThread] Responded to discovery request from <ipaddress> at 44520

Or something pretty close to that. I take it there is a UDPMulticastServerThread class in there tasked with this job.

I looked and there is a difference between broadcast and multicast, but since a lot of this is kinda new stuff I didnt get the whole thing too clearly.

I hahas been my understanding so far, that the clients were broadcasting a request - "Anybody out there on the subnet with a UDP listener for these things, at 4446?" I figured it would be sent out to ***.***.***.1:4446 all the way through ***.***.***.255:4446. But the multicast is apparently not quite like that. I found this:

Multicast is a special protocol for use with IP. Multicast enables a single device to communicate with a specific set of hosts, not defined by any standard IP address and mask combination. This allows for communication that resembles a conference call. Anyone from anywhere can join the conference, and everyone at the conference hears what the speaker has to say. The speaker's message isn't broadcasted everywhere, but only to those in the conference call itself. A special set of addresses is used for multicast communication.

OK, but what would that list of selected hosts be? It's tempting to think that the client is first picking out the IP's on the subnet that have agents and then ONLY sending discovery requests out to them. In which case, maybe this one AIX machine is not being found by the FIRST step? If it's not in the multicast list, the of course it would not respond - it wouldnt get the request. By that logic, it gets the ping because it gets it, but not the multicast.

So again I wonder if the thing of host versus full domain name is the thing - maybe the client is only getting <hostname> from DNS, but somewhere along the line the full domain name is presented and passed over - as if it's looking for folks named John, and skipping over someone named John Smith.

Felecha · Nov 9, 2008

And by the way, I'm a little bit on my own here. I have had some very kindly help from a couple of developers, but higher up, this is not being treated as a high-level bug concern. Since everything works fine on every other AIX box in the company, the thing is "that box must have some kind of environment issue - not a code problem". My concern is that whatever are the special circumstances causing the problem on MY box, what if a customer has the same thing? BOINK! Calls to Tech Support. And I just keep wondering about that domain name thing. Probably few customers would have it, but .....

Felecha · Nov 9, 2008

Wait, what am I saying? It cant be that the client first goes out and figures out which hosts have agents, and then does the discovery from that list. That's what discovery IS .... going out and finding hosts with agents.

Doh ...

Nothinman · Nov 9, 2008

I figured it would be sent out to ***.***.***.1:4446 all the way through ***.***.***.255:4446. But the multicast is apparently not quite like that. I found this:

Even if it was broadcast the list of hosts might be different depending on how your subnet is setup, it won't always be .1-.255.

OK, but what would that list of selected hosts be?

Those that join that specific multicast group.

Felecha · Nov 9, 2008

OK, I guess I'm getting lost. I dont know enough yet on my own. I've been to a handful of websites like

http://www.firewall.cx/multicast-ip-list.php

that say they will explain the whole thing, but I get lost. Oh well.

I put out a question to the company email alias for developers of the agent, asking what it might be about the format of the discovery requests that could go wrong, and give the symptoms I see. Maybe I can add to it, that the difference might be specifically related to multicast.

Thanks

Felecha · Nov 10, 2008

I guess the game winning question would be - can I find out the members of the multicast group? Cause if this box is not on the list, then we've come a long way

Felecha · Nov 10, 2008

I found this on a website

Multicast traffic is sent to a single address but is processed by multiple hosts. Multicasting is similar to a newsletter subscription. As only subscribers receive the newsletter when it is published, only host computers that belong to the multicast group receive and process traffic sent to the group's reserved address. The set of hosts listening on a specific multicast address is called a multicast group. Other important aspects of multicasting include the following:
Group membership is dynamic, allowing hosts to join and leave the group at any time.

which sounds like the members arent DETERMINED by anyone, they JOIN the group.

So my agent's host is supposed to have joined the group at some time? Would there be something in the agent's startup where it sends out a message "Please have the host I am running on added to the multicast group for this area"? Can apps do that?

If so maybe it's the joining of the group that went bad?

Felecha · Nov 10, 2008

reading further, it sounds like the joining is done at the level of ....

I thought at first maybe the local router or something needed to be told - "join this guy up"

But now I see

Following with the previous analogy, you have to tune your radio to hear a program that is transmitted at some specific frequency, in the same way you have to "tune" your kernel to receive packets sent to an specific multicast group. When you do that, it's said that the host has joined that group in the interface you specified.

"Tune your kernel"? So something in the host itself gets set so it hears messages sent to the group?

Who then determines what the multicast address will be that our clients' discovery requests will be sent out to? Somewhere in our code it must be that the designers said "We're gonna use ***.***.***.*** and when our agents start they make the host join that group and our clients will send to that group address. Then if the agent is up it will respond and if not it wont.

If that's right, then our agent does the setting of the group address into some place in the host? And it would never unjoin, we dont want that.

Am I barking up the right tree?

degibson · Nov 10, 2008

Originally posted by: Felecha
Am I barking up the right tree?

Maybe. True IP Multicast is nonstandard, and there are probably lots of ways that an AIX box can be configured in unusual ways...

But, before you go barking up that particular tree, make sure it really is IP Multicast that is in use here -- there is often some confusion about what "Multicast" actually means. Some folks mean it as:
* True IP/Multicast (aka subscribers, member lists, etc.),
* Multiple unicast (e.g. blast packets to some range, say, x.x.x.a - x.x.x.b),
* Some even use it as a broadcast (i.e. x.x.x.255 or whatever your broadcast address is).

For my part, if I was writing a UDP discovery routine for a LAN, I would certainly rely on broadcast rather than true multicast.

It should be easy to find out whether multicast is in use. Look at the client-side discovery code. If it sends a discovery to only one address, then the method is probably broadcast. If it sends it to many addresses in a loop, the method is probably mulitple unicast. Otherwise, you might be using true multicast.

Felecha · Nov 10, 2008

I dont have direct access to the code, I am QA, and the dev office is elsewhere. I often get help at the level of a quick tip from the dev guys, but in this case it's not likely anyone will be assigned to dig into what I am seeing on my test machine in my office, as I said I am the only one reporting this error, a number of other test machines are fine, and it's not a real crucial thing (someone already said "Hey, if a customer has a UDP problem they can always use JINI. We support JINI discovery")

So at the tail end of a release cycle when people are sweating, this level of problem is not going to get that help, I wish I could but I am being realistic.

So I have put it out on our email alias for dev, basically asking "Can it be that the host here is not getting added correctly to whatever is the multicast group, so it's not even hearing the requests?" I hope that someone who writes that particular section might see it and get interested.

That's the down side of a big software company. Lots of power, but lots of powerlessness too

Felecha · Nov 10, 2008

At least a little further ahead

I did find some code, I opened some of the jar files and decompiled some very likely looking classes. Plain as a post, there is code on the agent side that opens a port at 4446 and also clearly joins a group

{
inboundSocket = new MulticastSocket(4446);
groupAddress = InetAddress.getByName("232.0.0.1");
inboundSocket.joinGroup(groupAddress);
}

And I learned that netstat -g will show some info. When I start the agent the list looks like this

lo0 ALL-SYSTEMS.MCAST.NET 1
bge0 232.0.0.1 2
bge0 ALL-SYSTEMS.MCAST.NET 1

And when I shut it down

lo0 ALL-SYSTEMS.MCAST.NET 1
bge0 ALL-SYSTEMS.MCAST.NET 1

I feel pretty clear that the agent is telling the host to join that socket to that group and it does in fact join and unjoin

But of course it gets maddening again. The above netstat output is from solaris

On the AIX box I was excited to see netstat -g say:

Virtual Interface Table is empty
Multicast Forwarding Cache is empty

Cool! So it looked like the whole damn thing was dead or something.

But ...

Then I went to the AIX box that works for discovery and it shows the same thing for netstat -g.

Damn!

Anyone know how I could see if the agent on the bad box really is or is not joining the group on itself? I see no other clues in man netstat

I'm making this up as I go ....

Felecha · Nov 10, 2008

finally found it -

netstat -a -I en0 gets the 232.0.0.1 to show.

And starting and stopping the agent makes it show and go just like I saw on solaris.

So this is very strange.

The port is plainly there, the port joins the 232.0.0.1 group address, but a request sent to 232.0.0.1 gets no response at all

Felecha · Nov 11, 2008

got it - someone pointed out that the machine that I keep saying was the same 6.1 version - well, it is 6.1.0.0 and the one I am using is 6.1.1.0.

So I was able to find another 6.1.1.0 box and reproduced the problem there.

So at least it's clear at last that there is an AIX versioning issue here. I can dial back and turn it over to development

Robor · Nov 11, 2008

Congrats on getting to the bottom of it, Felecha. Thanks for posting your troubleshooting too!

Our app's UDP discovery on AIX 6.1 is not working

Golden Member

Moderator Emeritus<br>Elite Member

Golden Member

Elite Member

Golden Member

Golden Member

Golden Member

Elite Member

Golden Member

Golden Member

Golden Member

Elite Member

Golden Member

Golden Member

Golden Member

Golden Member

Golden Member

Golden Member

Golden Member

Golden Member

Golden Member

Elite Member