At one of my previous jobs, I worked at a midsize e-commerce company where I was responsible for maintaining a small .Net batch processing application. Every evening, this .Net application would process the orders that had come in during that day. The next morning, our users would run reports off the processed data and send the reports wherever they needed to go: other departments, outside vendors, and so on.
Throughout the years, the previous developers had done an excellent job at securing this application. It was integrated with the company’s Active Directory installation, and each user had access to only the needed areas within the application based on their role within the company.
Each user was also tied to a very specific role in the database, so they could only access specific information on the database level. Any changes to data was logged in a central location, and we could easily tell what user made what change to what data at a very fine-grained level.
For running the batch processing in the evenings, we created a user, Mr_Roboto, and added him to Active Directory and to the database. Through an automated process, Mr_Roboto was responsible for running the batch jobs at night, and all the logging for those batch jobs were done under his name. We gave Mr_Roboto as little access as needed, and “he” did his job as expected with few problems for years.
Oh, that security hole
In the same company there was a very bad security practice that needed to be remedied. On numerous servers, the same user name and password was used for the System Admin account. If anyone compromised that account on one server, then that person would have administrative access to pretty much any server within the domain. It was a bad setup and needed to be fixed by the system admins, but the issue persisted for reasons unknown to the development team.
Thus, when one of the admins pulled me aside on a Friday and told me that for security reasons they were going to be restricting users on the servers, I couldn’t have been happier. The system admin said they were going to rename accounts and change passwords so that the same user would not have identical access to different servers across the company.
He then asked if this would be an issue for our .Net application. I told him it wouldn’t because we didn’t use the system admin accounts for our application — change away!
Later that day, I was informed that the changes had been put into effect. I went home for the weekend happy, thinking our company’s servers would be much safer.
The fix that wasn’t
When I arrived at work on Monday, I found that disaster had struck. My inbox had messages from Mr_Roboto showing the evening batch jobs he’d attempted had failed. There were open tickets from multiple users, reporting their batch jobs had not run. I had users calling to ask what was happening. I checked, and all of the batch processes had failed.
My team went into crisis mode immediately and tried to figure out what had happened. Looking over errors, it become apparent that Mr_Roboto no longer had access to the servers he was supposed to be running on. In fact, Mr_Roboto was no longer in Active Directory.
Horrified, we called up the system admin to find out what was going on. The response: “Yes, I changed Mr_Roboto to Mr_RobotoA and Mr_RobotoB and only gave them access to one of your two processing servers, respectively.”
Our displeasure with this situation was immediately and loudly communicated to the system admin. After a few minutes he agreed to change his “security upgrades” back to way they were before. It wouldn’t fix the security log-in problem, but at that point we had a larger issue on our hands: A whole weekend’s worth of batch processing still needed to our attention.
As a last resort, my boss had the developers use their machines to run the batch processes. Thankfully, by the end of the day we had cleared up the backlog, and our users ran all their reports and sent them to the correct parties.
It’ll be better next time – right?
Our team conducted a postmortem on the situation and came to the following conclusions:
First, as easy it was to blame the system admin, I should have requested more details before allowing the change. I had wanted to hear — and took away from the conversation — that the system admin accounts were being changed, but that was not what the system admin was saying. Also, any future changes needed to have an email listing what changes and why. For the future we resolved not to stand in the way of any positive changes, but wanted a clear explanation about what changes were being made before anything was done.
Second, we also realized that the system admins didn’t know enough about what we were doing. This was handled by a two-hour meeting with the admin team in which we brought donuts and explained how our application worked and described the incident as “the weekend in which Mr_Roboto got fired.”
Thankfully, positive changes came out of the debacle, such as better communication between the development and system administrator teams. But one huge problem remained: The system administrator accounts were still the same for all the servers. For some reason, the sys admins never fixed this security problem — and hadn’t by the time I left the company. I’m told the failure stands to this day.