Last week, I decided to buy some log analysis software for Voor Beginners, a collection of web sites about a wide variety of topics that I’ve been working on for the past eight years or so.

Previously, I had used Piwik, an open source solution that uses JavaScript to track visitors; however, for a number of reasons, I decided that I wanted to use software that does not use JavaScript but analyzes the “raw” log files instead. I ended up buying WebLog Expert, a modestly priced program (about $125) that seems to do a decent job, based on what I’ve seen so far.

However…

WebLog Expert offers a variety of ways to access the log files, including FTP and HTTP. I initially chose HTTP because the program wouldn’t have to submit a username and password that way but would simply download the files from a specific directory on my server. Being somewhat paranoid, I decided to make the directory unavailable to any IP address but my own. At least, that’s what I thought I did…

The next day, I noticed that traffic was unusually low; but when I checked a few pages, everything seemed just fine. Until it hit me: instead of locking just one directory, I had accidentally made the entire domain unavailable to any IP address but my own! 🙁

I quickly corrected my mistake and thought that was the end of it; but unfortunately, I was about to learn an expensive lesson about the way Google works!

Because of the IP restriction, my site had thrown a “403 Forbidden” error for every “external” visitor —including Google’s spider. Now, when I check Google’s webmaster tools, I see that more than 400 pages (out of a total of about 800) show crawl errors –and these pages seem to have been deleted completely from the search results! Pages that used to rank well seem to have vanished completely, because of one stupid mistake…

I can think of no other reason why the pages have disappeared, other than my slip-up. For example, a page about “lenen zonder bkr” (a loan type) is gone, whereas the page about “hypotheek” (mortgage) can still be found; the entire subsite about “schaken” (chess) is missing, while “sudoku” still ranks nicely. In other words, “commercial intent” does not appear to be a reason. Likewise for other explanations I’ve considered (page size, number of backlinks, …) –the only thing the “missing” pages seem to have in common is that they happened to be spidered during my “momentary lapse of reason“…

I fear that there is very little that can be done about the situation, other than wait and see if the pages start to “return” during the next few weeks when Google has spidered them anew; I have also sent a “reinclusion request” explaining the problem, but I’d be (happily) surprised if that would “solve” the problem within a few days or so.

Lessons learned:

  1. Think carefully about any “clever tricks” you plan to use — especially about the way they may backfire…
  2. If you create a “solution” for a special condition, be sure to test both the special condition (i.e., my own IP) and the “normal” condition (all other IP addresses)!

(If and when the issue is resolved, and/or if I get a response from the reinclusion team, I will post about it here.)

UPDATE:

About two days after I wrote the above, I started to see pages from the domain in the Google results again; and another two or three days later, everything appeared to be “normal” again… 🙂

Today (June 1, 2010) I received a message from the Webmaster Tools team that sounded rather “standard”: they had checked the site and would consider reinclusion if they hadn’t encountered any problems. (But as I stated a moment ago, the problem already seemed to have disappeared, “automatically,” after a few days.)

Lesson learned: be very careful when you start to exclude people –or spiders!–, using either robots.txt or IP addresses!