I’m really keen on network monitoring. I have this thing about knowing how things are performing, if everything’s up to spec, and if there are any impending problems that are going to cause way more grief than anyone needs.
If you can picture this, we have four twenty-two inch monitors, set up two by two, as a visual indicator of network status for the entire department. This never fails to impress everyone who sees it. I get the general impression that if the monitors fails to work, people would still be impressed by it. Hint: drop a few bucks on four monitors and a mounting system. Put it up high or on the wall. Have it display something impressive. It’s worth whatever it costs. It’s like the machine that goes BING.
There is a ton of software that will help you monitor your network. I like very little of it. I have tried everything from open-source to five-figure monitor software. At the moment, I’m finding the best, most comprehensive software is called The Dude. It’s free. Yes, free. It runs under Windows or linux. It can monitor via plain old ping or do all sorts of snmp magic. It can notify you via all sorts of paths.
It is the alerting bit that I’m going to address. I used to vpn into work fairly frequently to monitor things. As things were made more reliable, the need for watching lessened. As I stopped watching closely, things started to fail every now and then, just out of spite. After a brief huddle, I figured out how to make The Dude send a text message to my phone when something went down. The message states very plainly what stopped responding and when. Very useful.
The next day at work we had to do a lot of rebooting of a server for some installs. DING – “Server RINGO not responding to ping.” DING – “Service ftp on RINGO down.” DING = “Service http on RINGO down.” Yay – it worked!!!
Then there were the three DINGs telling me they had all come back up. “Yay,” I thought, with somewhat less enthusiasm. This could potentially get slightly annoying. Especially as RINGO needed to be rebooted several times before the updates were complete.
It got to the point, over the next few days, when we’d all hear DING and I’d say, “It’s ok, I’ll get it.” Ten times in a row. My coworkers were amused, especially as they didn’t have to carry around my phone. I stopped the chuckling immediately when I told them that some of them were going to have to be on alert after I was nice enough to `tune’ the alerting system first. I suspect some of the server reboots after that incident were not as needed as they claimed.
Last Wednesday I pulled up in front of my house and got a few DINGs. I rushed into the house with a phone that couldn’t go DING fast enough to get the next DING out. I attempted to figure out what was allegedly down but every time I got a message up, a new DING preceeded a new message.
I tried phoning people still at work. Nope – the DINGing continued. I could barely get a number dialed or hear a voicemail prompt. Have you ever noticed that outgoing voicemail messages are even longer when you’re agitated and really need to get in touch with someone? I did. As did my wife and houseguest, who had to suffer through the constant DINGing as well as my constant cursing about the phone `helping’ me by constantly interrupting whatever I was trying to do on it to tell me there was a text message waiting for me. And then cursing because out of three people on staff at that hour, I got three voicemails.
By the time the phone stopped DINGing long enough for me to feel it was safe, the total was ONE HUNDRED NINETY TWO messages. Only one hundred ninety two DINGs.
I finally reached someone who told me everything was fine. It appeared that someone may have `helped’ us out by unplugging a switch or two, causing all connectivity to appear down. Or someone on the floor made one of those nice little loops, where they plug a hub or switch into itself, causing Network Havoc<tm>.
I figured it was my fault anyway. For some odd reason I have been waking up at horrible hours of the morning and going to work early as a result. The first time I figured I’d leave an hour early and forgot I was supposed to (always something to do anyway). The next day, the Wednesday in question, I left an hour early because I came in over an hour early and I had company. So if I had put in a 9-10 hour day, I would have been there when everything went wonky. Mind you there’s quite a competent crew there so I don ‘t technically have to feel that bad when I’m not in the office but it’s my parents’ fault for giving me this nasty work ethic.
At this point I start investigating dependencies; where if a switch or firewall was down, it would alert me but if anything under it was down, they wouldn’t. This is a great concept but it fails a bit in the execution. I need to know when any single piece goes down and comes back up.
Until an epiphany sets in, I’m going to have to endure a lot of DINGing. I figure I’ll get a lot more help on this project once I start adding coworkers to the DING list. In the midst of virtualization, I’m hoping that less hardware will mean less alerts.