This is Hacker Public Radio Episode 3758 for Wednesday, the 28th of December 2022. Today's show is entitled, First Sis Edmond Job War Story. It is hosted by Norris, and is about 28 minutes long. It carries a clean flag. The summary is. Now I got my first job as a Sis Edmond and a story about NFS. Okay, so I thought I'd record a quick holiday episode for HPR, and I'll do kind of a combo story about how I got my first job in tech. I haven't always worked in tech, I'm currently a Linux admin, and then I'll combine that with a bit of a war story about my first week. I have for a long time since like 2000, been a Linux user, and I didn't have a Linux job that far back. But I was working for a place that had a contract with the government, and the contract was going to end. We didn't know like the specific date. It was going to end, but we knew the job itself that we were there doing was only going to take about 10 years. So we all knew. We took a job there. You would at some point you were going to get laid off, and if you made it until the end, someone was going to get laid off at the end. Even though it kind of sucked him in a job where you can't work there forever, it gives you sort of a unique opportunity to sort of plan for changing careers. So since you can look ahead and know, owner about this year, what I have to do something different gives you time to prep for it. So since I've been sort of a Linux on the desktop, hobbyist for a long time, I thought, well, now there's my chance to do what I can, and then maybe when I do get laid off, I can find a job as a Linux admin or something. So started just kind of adding to the things that would normally do around the house with Linux. So instead of printers and playing music and configuring X11, I would do things like trying to set up web servers or file servers, or maybe even a LDAP server or stuff like that, and doing virtualization, and whatever I could, that I thought maybe things that a Linux admin might do. The other thing I started working on was getting some certifications. So I started with the Red Hat certifications I went and got. The Red Hat, at the time it was Red Hat certified system administrator, or no, at the time it was Red Hat certified technician, and they've since changed it to Red Hat certified system administrator. But I started with that, that's kind of their entry-level cert. And then a few years later, I got the Red Hat certified engineering cert. Eventually, I got laid off, just like on the OLED, I started kind of slowly starting looking for tech job. So one of the jobs I applied for, they kind of got pretty quick, like the next day. And it turns out, the company that called me, there were a small web development shop, and they had three Linux admins, while there were staff to have three Linux admins, and earlier in the year, two of them had left, not at the same time, but in different reasons. But one had left, and they were kind of dragging their feet a little bit on replacing them, and another one left, and they started getting serious about replacing them. And then there was a third guy who was kind of a junior admin. He was kind of a mix of an admin and a developer. So he was sort of running the show by himself for a little while, and eventually he had left, he had decided to leave. And so at this point, they were desperate to get some new people in, because they had, like I said, there were staff for three people, and they were just a few weeks away from having zero. So they were able to hire from like a temp, IT agency, a Linux admin, but he could work there forever. And they had found another kind of senior admin, but he wasn't going to be able to start right away, because he had a job, and he had some big projects and stuff, he wanted to finish. But they needed, they needed someone to start immediately, and since I was laid off, and even though I could tell they weren't really sure if I could do the job, since I could start immediately, it really got their attention. So the, like I said, it was a small web development shop. They had about 10 developers, a few project managers and designers, and support desk. So people can call in with support stuff. It was most of their applications were PHP applications that ran on Linux. And they were kind of, they were all over the place with Linux. They would, as well as Linux versions. It was kind of whoever was charged at the Tom, would deploy whatever Linux version happened to be their favorite at the Tom. So there were suits, there was a bunch of, there was devion, there was red hat, there was lyrous, it was, it was a big, big mix of things. And there was also some Java and a little bit of Windows. Like I said, there were desk print and I could start right away, so they started interviewing. So I got to interview with the county who was leaving the, of the three, the last one that was there. And I was basically, as last week, so I interviewed with him and some of the kind of senior developers they knew a bit about Linux. And they did, they were really careful with me or their row. So I did, you know, I came in and I did an interview, but like the person who was going to be my boss's boss and the developers, and then the guy who was going to be my boss but hadn't started yet, he wanted to be me. So we kind of met for a quick lunch interview, because he wanted to make sure, you know, I would do, or we could at least get along and that, you know, things I said made sense to him, then they wanted to do something a little more technical, so they had someone set up a laptop with a Linux VM on it, they sort of wrote out a list of tasks for me to do. So I mean, it was anywhere from simple stuff to adding users and making sure they could sit and they want to be, for some reason, they want to be a compile from source, specific version of Apache and PHP and they, they said all these kind of crazy things that they wanted me to do, and that the list was long, they gave me a big long list and like two hours to do it, not, I didn't finish, the list was too long, I didn't finish, but the other thing I wanted to do was after that kind of technical interview, they wanted me to meet with all the managers, so again, it was a boss's boss and his boss's boss, all just kind of set down, and it wasn't, they asked me a few technical questions in that, and everyone, I think it was mostly just trying to figure out, I'm not for real, you know, is it really possible that someone who's never worked in IT before can do the job? So obviously since I'm telling the story, they didn't hire me, my first week there, it was just me and the guy from the tip agency, the third guy who had helped with the interviews and stuff, he was gone, so his last day was like the Friday before my first day, but there was, you know, there was a minimum turnover, some, you know, maybe a 20 page hard document, and then the two or three weeks that the temp had been had before, that was really the extent of the trading turnover, so a little bit about kind of the infrastructure there, all of their servers were the data center that wasn't too far from the office so we could go, this is the data center that we needed to, and it was in, I was like three racks, worth of equipment, it was mostly virtualized, there were a few physical servers, physical machines, for heavy loads like databases and repeat physical servers, a lot of ESX hosts that we virtualized on VMware and some storage and stuff like that, the applications were mostly virtual machines, for the PHP applications, they would all kind of share a directory to get their PHP code from, I'm going to say I get their code from, I don't mean like, they would copy it, whatever new code was available, I mean they would just literally mount this NFS share and like, far WWE or whatever, the way every application server had the exact same code all the time, and then there's a few of the things that we have on this NFS server, including config files for some of the load balateurs available in there, application logs will be on there, it was just kind of a generic place to put things, anything that needed to be available to more than one server was probably on this NFS server, the NFS server was a virtual machine also, you could tell it had kind of grown over time, you know, there's a few strategies weren't adding disk space to a virtual machine when it's running kind of the easiest one, it just had another disk, so this NFS server that was a virtual machine had like 5 disk attached to it because it would every time they would add a new kind of project or something for it to do, they didn't have enough space for it, they would just add another virtual disk to it, storage for the VMware cluster was kind of an oldish sand was branded son, but this was after Oracle had bought son, so it was all son branded snap that it was supported by Oracle and to kind of maximize the available space, most of the sand was RAID 5, so they would you know take a group of disks put it together and RAID 5 and then use those RAID 5 disk bundles to export that to VMware and then that's where VMware would store the virtual disk for the machines including all of the application servers and this NFS server, so even before I started there was a kind of history, it went in the last year, before I started, there was a history of really poor performance with PHP applications and no one really understood why, I mean any all of the troubles you did with the previous admin to done just kind of let it date ends, but one thing we would notice when that applications were running slow, isn't that the load average on the NFS server would climb and it wouldn't get high like it wouldn't get into the hundreds or anything, but it would just go from like right normally run at one or one and a half, it would go up to like four and we could tell like we could look at the load average on the NFS server and based on that tell how well or poorly the PHP applications were running, one of our sort of first indicators of things were going poorly was that we had one of the office staff process payments that people would make, so you know a lot of our applications would take take payments and then the sort of the accounting personnel we had a kind of a homegrown tool that was also PHP application ran on the same infrastructure, but they were usually the first to notice that things were going south and they would they would try to say can you guys check the load average on the NFS server but they would usually come in screaming about the load balance or instead we did have a load balance or but that was actually the problem but it was it was clear to us you know everyone was sort of frustrated with how things were performing and frustrated with the fact that you know we despite all of us looking at it no one could really figure out you know we tried a lot of different things PHP settings and NFS settings but nothing else so we had this one of our applications and it was basically the company's kind of flagship application that's biggest most popular if anyone asked it you know if anyone asked you know what does this company do that there was list all things and this would be always me and the list of things that they had made but the application took payments for the system that was taking payments for there was an annual deadline and it was the deadline was the same for everybody so you could pay it any time here in the year but people being people everyone will wait until the very last day to make the payment so this particular application ran okay most of the time but you know once a year on sort of the big day things would get slow things would always get slow and it was sort of known that there's going to be some slow down and some performance problems and we'd all be you know kind of geared up and ready for this particular year you know approximately I'm about 10 days into into the job when you know big day arrives and it's terrible it's awful like I've never seen you know I didn't have time to explain it there but in my two weeks I saw some poor performance this was it was absolutely positively unusable I mean you would bring up the website if you could log in as soon as you try to do anything you would just stall extra stall extra stall so it was pretty bad so we were all kind of desperate to figure out what a solution just to get us through the day so remember on to that the page keep p application states how they all had an investment where they kept their code that way they got a lot of the same code and the developers they were pretty insistent that that's how it they wanted the developers wanted to be that way so they could ensure that every application was running exactly the same well um we talked our manager than to you know for today only let us build some application servers that are exactly the same except that instead of you know instead of mounting the interface server we just copy all the files over and let these applications run you know just totally off local disc and you know in reality it's a virtual disc on that same we mentioned before but it's not touching the the interface server so that that quick fix got us some pretty good results so we went from unusable to actually pretty good yeah at the time we didn't understand why we didn't we didn't know like in our heads we're thinking okay all let's doing is reading PHP which isn't that big of the NFS server and separating the NFS server from the application fix the problem we didn't understand it one of the things we thought might be an issue was this was the sand performance but the same thing you know the applications um reading their content directly from the sand versus the applications reading their content from an NFS server that's on the sand uh was nine day difference so after we all had a minute a few days after the big day and we could kind of collect our thoughts and calm down and breathe a little bit we started trying to figure out okay what is it about this NFS server anything a server is in the mix performance tanks so as we're digging in and as we're digging in we start trying to involve the developers a little bit and one thing that this application is doing that we didn't know about is logging and when I say logging I mean obviously we would look at you know the PHP logs and the Apache logs and those are things we were always looking at trying to figure out why is this slow and they didn't really use anywhere we didn't know with the application um had another log that would log every SQL query that the application ran so if you did a uh select I mean if you if you just logged in and search for yourself search for your name and the application um query would be written to the logs and if you made a payment that query would be written to the logs every query was written to the logs and I want to say the logs that's wrong it was all of those queries went to the same log file that's that's sort of okay no that's not that's really bad idea so um NFS doesn't allow multiple clients to write to you the same file at the same time so if a client says hey I need to write to this log file and if a server will block the file let the client log to it and then unlock the file so because we had multiple applications servers trying to write to the exact same file the NFS server was slowing down the applications so it could queue up the rights so that was the the reason we saw such big performance gains when we moved off the NFS server is that the application didn't have to write and you didn't have to wait anymore before it can write to the query log now eventually when we heard about this that's a bad idea for a lot of reasons writing and required to a log so eventually we were able to talk to the developers out of uh logging this information but it was a clear win for us because we we were finally able to figure out like what is it about this NFS server that makes these applications so bad and this this particular application wasn't the only one that was doing that writing to a common log file but um like I said it was the biggest one and it was the one that calls the most problems and it was the one that got the most attention so after that we were still kind of interested in why the NFS performance was so bad and why it had gotten worse because the the application itself you know where it's writing to this kind of common log file I mean it had been like that for years and there were some growth in the application but I don't know if growth to explain the performance drop year or a bit year so yeah even though we fixed we fixed the problem but we knew there there had to be something else kind of underlying because the problem was getting worse and worse and worse so we had some pretty decent monitoring and we were able to remember I said the load average on the NFS server would would go up when performance was bad and you could see it you know the owner monitoring and we could look at graphs of load average and we could see you know big spikes whenever on busy days and drop off on weekends and stuff like that and when we we could zoom all the way out we could zoom the graphs out to like a year and we could see you know then we could see big days and small days but it was interesting to see sometimes you know you would go much so when you do that to a like a year you can see like a month at a time and the lawns would be pretty steady you know from month to month to month and then you you would see kind of a drop and then month to month to month and you may see a stair step rise month to month to month a lot of times we would look at those and we would try to investigate okay what happened on this day that that caused this sort of stair step and one thing we really noticed was we finally got rid of that crappy old son slash or we'll say on upgraded to something considerably better then you could definitely see the load average on that NFS server you know when I said I used to average one and maybe go up to four you know now it was down in the light point two is in point three's and might go up to point eight so that was a huge difference in application just changing the sand but there was another place when we looked at the annual graph where we could see a drop and load average a pretty significant maybe about 30 percent drop and we couldn't figure out why a lot of times we could go back and we could see these stair steps and go back and oh that was the day we changed this application or that was the day we got a new sand but we couldn't figure out why there was one day particular day and happened to be about a few months after this big day event where everything went south a few months after that we saw like a 30 percent pretty steady month over month we could over week 30 percent drop and load average and we couldn't figure out why so I'm going to work in here working at this talk for about five years you know sort of the always always still kind of the new guy you know and just trying to where you work if you're working out to you if you're a success and you're always kind of the afterthought like no one really thinks about IT unless something's broken and so I was on a team that no one ever thought about and I was like the junior guy on the team that I went over with the odd about so I had to I said that to tell you I had to move I had to change officers a lot it was kind of a it was like a cube farm kind of place where there was there was cubes and desk and offices and it was always nice to be able to move out you know from a cube into an office but someone else would show up you know I never want my office so I don't have to move out because the sort of the last person that was ever really considered whenever thinking about who was going to work and what office I had to move almost as a lot one time I was getting ready to move offices again and I was cleaning out a file cabinet and it just the folder I was looking through just it was just all kind of random receipts and hardware things and stuff like that I picked up a receipt and I was looking at it and I was trying to figure out what it was and it was a receipt for a returning a disc to son or historical and trying to figure out what it was like why did we do that and then I remembered the the guy who was my boss whenever I first started the guy who started on the same day as me and didn't really have any good turn over and he was supposed to be my senior he had done an RMA or like you but one day he was in the day center and he saw on this son storage system that one of the disc had a yellow light instead of a green light so he reported it to son they sent him a replacement disc and he sent the old bad disc back to son and when I was I was staring at the piece of paper work that documented that change and I thought to set the order it just has anything to do with that unexpected load average drop or the unexpected performance boosts on that NFS server and I looked at the date and it was within like a few days of that drop so finally I was able to piece together that that NFS server some of its disks were on the portion of the storage system that was using the was built using rate 5 and that disc that he replaced was part of that array so the reason that the NFS performance had gotten worse year over year was because at some point during the year no a notice but a draw failed that was part of a rate 5 array if you know anything about rate and rate 5 if you don't know anything what you do need to know is that rate 5 is fine but if you lose a single disc out of a rate 5 array all of your data will still be there but the performance will be terrible it no longer has an extra disc to write the parity information to some because this rate 5 array was running with a bad disc the performance was terrible and then when he swapped the disc out that's when we could see we didn't notice at the time but that's when we could see the performance increase in the NFS server so kind of a long rambling story I don't know if you can learn any lessons from that except maybe if you want to change careers one one key to doing that is plan ahead if you can but sort of the real key is you have to find someone who's different desperate enough to hire someone with no experience always always be careful when you're logging or writing to a network share and never ever ever run rate 5 in production period it I'll see you guys next time you have been listening to hacker public radio as hacker public radio does work today show was contributed by a hbware listener like yourself if you ever thought of recording podcast you click on our computer link to find out how easy it means hosting for hbr has been kindly provided by an host host dot com the internet archive and our sing dot net on the satellite stages today show is released on our creative comments attribution for going to international license