![[personal profile]](https://www.dreamwidth.org/img/silk/identity/user.png)
While I think my and my immediate supervisor's work on our Postfix implementation has been fine, our managers' handling of it has been idiosyncratic, to say the least. We prepared the project initially as a project. Full project management plan, communications strategy, risks assessment, etc etc. This was all disregarded, and the thing has been handled by management as an "upgrade". Since we have:
Adding to the joy was the fact I suggested - more than once - that we inform all the customers within the organisation of the migration process happening at present, rather than just the few operational groups we knew would have special issues. That was shot down in flames, because it's apparently a great idea not to find out about potential issues as they are happening, but to leave the users in ignorance and present them with a fait accompli - assuming that all was ok. And because this is just an upgrade, that is apparently just a fine way to manage the process.
So, this week, we made our new gateway the primary mail servers on our domain. I noticed in the logs a few instances where certain headers were malformed by sending servers, thus leading to messages being rejected that may well be legitimate. But it wasn't until today, when I loosened up one rule, that I got a call logged by one of the more important groups in the organisation (who send out briefing emails to external customers), saying, "Gosh, was there a network outage, because we suddenly got a deluge of messages that had been held up somewhere." Obviously, they were getting accepted once I loosened the restriction.
That rule was rejecting thousands of messages a day. When I did my initial testing, I calculated less than 0.1% of these were rejected from what appeared to be real senders. Unfortunately, that 0.1 percent also included one external organisation that sent regular messages to that v. important internal group. Now, if they had been aware that they needed to watch out for delayed messages, I probably would have found out on Monday, and I could have crafted an exception for that sender (assuming they couldn't do a fix). There are probably a few other senders in the same boat, thus the relaxation of the restriction.
It means that rather than allow the (appropriately notified) users to be canaries in the mineshaft for those few instances (and normally an hour or so's delay in receiving email is fine while I fix it), I need to allow 60% more spam to flow through, and somehow review tens of thousands of lines of log files to find the "good" senders and make exceptions for them (as well as contacting them to see if they will fix their problem), before reinstating the rule that was doing an extremely nice job of rejecting spam from zombied machines. GRAR.
So, if you're a mail admin, make sure your goddamn server sends out a FQDN with its HELO (it's in the RFC!), and also make sure that that hostname is resolvable via DNS lookup! I mean, how do you expect your mail to be delivered if we can't check to see if it's coming from a real host? Give me a break. No wonder the spammers are winning, especially when people who should know better are making that kind of elementary mistake. The poor buggers are probably all using Sendmail.
- Changed the hardware (new HP servers)
- Changed the operatiing system (Linux, from VMS)
- Changed the MTA software (Postfix, from PMDF)
- Changed the method used to find valid email addresses (a batch LDAP query that harvests all the mail aliases in the domain and populates a lookup list hourly, rather than an ugly kludge that required a custom LDAP attribute to be populated at the Windows account creation time, that only found the primary email address (no aliases), and which needed to be checked each and every time a message was to be delivered)
- Changed the server presence in the network (in a DMZ and directly internet-facing, rather than on the internal network with the firewall accepting mail first)
Adding to the joy was the fact I suggested - more than once - that we inform all the customers within the organisation of the migration process happening at present, rather than just the few operational groups we knew would have special issues. That was shot down in flames, because it's apparently a great idea not to find out about potential issues as they are happening, but to leave the users in ignorance and present them with a fait accompli - assuming that all was ok. And because this is just an upgrade, that is apparently just a fine way to manage the process.
So, this week, we made our new gateway the primary mail servers on our domain. I noticed in the logs a few instances where certain headers were malformed by sending servers, thus leading to messages being rejected that may well be legitimate. But it wasn't until today, when I loosened up one rule, that I got a call logged by one of the more important groups in the organisation (who send out briefing emails to external customers), saying, "Gosh, was there a network outage, because we suddenly got a deluge of messages that had been held up somewhere." Obviously, they were getting accepted once I loosened the restriction.
That rule was rejecting thousands of messages a day. When I did my initial testing, I calculated less than 0.1% of these were rejected from what appeared to be real senders. Unfortunately, that 0.1 percent also included one external organisation that sent regular messages to that v. important internal group. Now, if they had been aware that they needed to watch out for delayed messages, I probably would have found out on Monday, and I could have crafted an exception for that sender (assuming they couldn't do a fix). There are probably a few other senders in the same boat, thus the relaxation of the restriction.
It means that rather than allow the (appropriately notified) users to be canaries in the mineshaft for those few instances (and normally an hour or so's delay in receiving email is fine while I fix it), I need to allow 60% more spam to flow through, and somehow review tens of thousands of lines of log files to find the "good" senders and make exceptions for them (as well as contacting them to see if they will fix their problem), before reinstating the rule that was doing an extremely nice job of rejecting spam from zombied machines. GRAR.
So, if you're a mail admin, make sure your goddamn server sends out a FQDN with its HELO (it's in the RFC!), and also make sure that that hostname is resolvable via DNS lookup! I mean, how do you expect your mail to be delivered if we can't check to see if it's coming from a real host? Give me a break. No wonder the spammers are winning, especially when people who should know better are making that kind of elementary mistake. The poor buggers are probably all using Sendmail.