I have had a dead Celestica D2060 network switch laying in my garage for a while and decided to finally attempt to fix it. D2060 is a Broadcom Trident2 BCM56854A2 powered device offering 48x10G SFP+ and 6x40G QSFP+ ports, with an Intel Atom Rangeley C2558 based control plane – notorious for dying due to a CPU bug.
The backstory
I bought a few of these Celestica switches a few years back as I wanted to gain familiarity with whitebox switching platforms and Cumulus Linux. Two of them got installed in a rack and had been working fine for a year or two, then one decided to stop working.
After a quick datacenter trip, I realized the switch was failing to boot properly, it couldn’t initialize the ASIC and was throwing cryptic EEPROM read errors. I replaced the faulty switch with a spare and went home.
This faulty unit had been sitting in my garage since then, collecting dust for a few years until today. I have finally decided to attempt to fix it.
Cumulus Linux
Before I go any further, let’s talk about Cumulus Linux and switching platforms in general.
Most network switches on the market, especially from traditional vendors like Juniper or Cisco come with their own vendor-supplied operating systems. Things like NX-OS, IOS, JunOS and similar.
It is actually possible to buy so-called whitebox switches, which is just a fancy term for a commodity switching platform that does not come with an operating system from the factory. Instead, they are usually supplied with an ONIE environment, which can be used to install/uninstall/rescue whitebox operating systems such as NVIDIA Cumulus Linux or the open-source Microsoft SONiC.
The switch at hand has Cumulus Linux 4.3.0 installed on it, a distribution based on Debian 10 with some fancy services and tooling built on top of it. The main part of the OS is the switchd
daemon, which ensures communication and synchronization between the kernel and the actual physical Broadcom chip.
The first test
First of all, I connected a USB-serial converter into the CONSOLE port of the switch to verify what was actually happening. My memory at that point was a bit fuzzy after all those years and I didn’t precisely remember what the issue was. After starting a screen
session on my Mac with a 115200 baud rate, I plugged in a power cable into one of the power supplies and watched as the switch tried to boot up.
It wasn’t terribly happy to do so, the ismt_smbus
module was timing out and I did see several systemd services fail to start, things such as the ledmgrd.service
throwing errors, causingswitchd
to not even attempt to start due to a failed dependency.
2024-07-24T12:12:38.691221+00:00 hostname platform-hw-init[405]: /usr/cumulus/bin/decode-syseeprom : ERROR : unknown target, should be one of 2024-07-24T12:12:38.743105+00:00 hostname platform-hw-init[405]: could not find eth0 MAC address in EEPROM ... failed! 2024-07-24T12:12:38.774318+00:00 hostname platform-hw-init[405]: Initializing QSFPs... 2024-07-24T12:12:38.812127+00:00 hostname platform-hw-init[405]: Initializing SFPs... 2024-07-24T12:12:40.356231+00:00 hostname systemd[1]: Failed to start Cumulus Linux switch port setup.
Not good! This meant the switch was completely failing to initialize. All port LEDs on the switch were lit, which just further confirmed the unfortunate.
I had done a few further checks and discovered that the i2cdetect -l
command was reporting only two devices:
i2c-1 smbus SMBus iSMT adapter at dffd0000 SMBus adapter i2c-0 smbus SMBus I801 adapter at f000 SMBus adapter
That was certainly a part – or a symptom of – the problem, there should’ve been dozens.
Intel Atom Rangeley – and LPC clock woes
The Intel Rangeley C2000 platform is suffering from a plague caused by physical degradation that eventually causes the LPC clock to stop ticking that media and various unhappy users reported on. As Intel eventually put in their platform notes:
AVR54. System May Experience Inability to Boot or May Cease Operation Problem: The SoC LPC_CLKOUT0 and/or LPC_CLKOUT1 signals (Low Pin Count bus clock outputs) may stop functioning.
Luckily for us, there is a workaround in the form of adding a pull-up resistor in between the LPC_CLKOUT pin and a 3.3V power source, as documented on the EEVBlog forums. I couldn’t find any official references from Intel that would describe this workaround in detail, and the precise pull-up resistor value is not known. I have seen anything from 100 ohms to around 1.5k ohms recommended across various forums.
Finding the clock
It was about the time to crack open the switch and try to find the magical clock. Equipped with an oscilloscope, I went on a hunt for a 25MHz signal. I did unfortunately have very limited success on the broken switch though.
It makes sense – once the clock had degraded to the point the switch no longer works correctly, the clock is either fully dead, has a very low voltage level or suffers from other deformities. There are also many other clocks present in such a complicated system and I wasn’t able to reliably identify the right one.
After a bit of Googling, I stumbled upon one ServeTheHome forum user who shared their experience with applying the workaround on a Celestica D4040 switch. Looking at the pictures, I realized they use the same – or at least a very similar layout for the Intel Atom carrier board:
The right pad of R562 is the clock signal and the top pad of the C410 is just a 3.3V supply voltage.
Clock necromancy
Lo and behold, I did indeed find the appropriate pins on my D2060 and considered the next steps. I verified that C410 had 3.3V across it, which it indeed did. Based on the proximity to the nearby chip, the other end being connected to the ground and the thickness of the supplying trace, I was confident I was indeed looking at a power rail (as opposed to a signal trace), with C410 being a blocking capacitor across it. That meant it would be a decent option to tie the pull-up to.
The other side would obviously have to be connected to one of the pads of R562. I decided to mask off the area with a bit of Kapton tape in order to prevent any accidental shorts. While these pads look quite big in the picture, they are actually 0402 and quite close to the connector, definitely not the easiest combination to solder to.
With the connection points planned out, only one thing remained to resolve – the resistor. Where should I place it, how will I connect it and which one?
To resist or not to, that is the question
I went on a hunt through my drawers and a tape with some 750 ohm 0603 SMD resistors fell into my hands. The value seemed to be within the range of what people had success with, and 0603 is a fairly manageable size.
I pondered a bit more about how I should affix the resistor – maybe with glue, or tape? Then I realized I was staring at a bunch of perfectly usable pads!
With a bit of scraping, the solder mask came right off. After a bit of tinning, soldering and mod wire work, the end result was born.
Yes, I know, I should have cleaned the flux off the resistor. The truth is, I forgot and didn’t want to take the whole thing apart again. The resistor is a bit crooked, unfortunately 0603 is right at the edge of connecting the two adjacent pads and it didn’t really want to center with ease. The pads were way too big as well, and the SMD part simply wanted to flow and stick to one of them.
I used a 30AWG solid-core copper wrapping wire for the connections.
Testing
After putting the switch back together, I hooked up an oscilloscope to the R562 side of the pull-up resistor, powered on the switch and saw some rather ugly looking signal.
Granted, my probing technique was far from great and I picked up the ground from a really far away. I am pretty sure the signal is not supposed to look even remotely similar to this in any case though. Nonetheless, I did have a stable 25MHz clock of some sorts, a pretty significant improvement.
To my great surprise, the switch booted up as expected. I actually saw I2C devices connected to the bus this time as well:
# i2cdetect -l i2c-69 i2c i2c-41-mux (chan_id 19) I2C adapter i2c-97 i2c i2c-43-mux (chan_id 47) I2C adapter i2c-59 i2c i2c-40-mux (chan_id 9) I2C adapter i2c-87 i2c i2c-43-mux (chan_id 37) I2C adapter i2c-20 smbus i2c-9-mux (chan_id 0) SMBus adapter i2c-77 i2c i2c-41-mux (chan_id 27) I2C adapter i2c-10 smbus i2c-9-mux (chan_id 0) SMBus adapter i2c-67 i2c i2c-40-mux (chan_id 17) I2C adapter i2c-1 smbus SMBus iSMT adapter at dffd0000 SMBus adapter i2c-95 i2c i2c-43-mux (chan_id 45) I2C adapter i2c-57 i2c i2c-40-mux (chan_id 7) I2C adapter i2c-103 i2c i2c-42-mux (chan_id 53) I2C adapter ...
All the sensors were reporting values (previously only reporting ABSENT) and the system looked fully initialized.
$ net show system Hostname......... hostname Build............ Cumulus Linux 4.3.0 Uptime........... 4:36:03.790000 Model............ Cel REDSTONE CPU.............. x86_64 Intel Atom C2538 2.4 GHz Memory........... 4GB Disk............. 14.9GB ASIC............. Broadcom Trident2 BCM56854 Ports............ 48 x 10G-SFP+ & 6 x 40G-QSFP+ Part Number...... xxxxxxx Service Tag...... xxxxxxx Serial Number.... xxxxxxx Platform Name.... RANGELEY Product Name..... Redstone-XP D2060 Vendor Name...... CELESTICA ONIE Version..... 2014.08 Base MAC Address. xxxxxxx Manufacturer..... CELESTICA $ net show system sensors Fan1 (Fan Tray 1 ): OK Fan2 (Fan Tray 2 ): OK Fan3 (Fan Tray 3 ): OK Fan4 (Fan Tray 4 ): OK Fan5 (Fan Tray 1 ): OK Fan6 (Fan Tray 2 ): OK Fan7 (Fan Tray 3 ): OK Fan8 (Fan Tray 4 ): OK PSU1 : BAD PSU2 : OK PSU1Fan1 (PSU1 Fan ): ABSENT PSU1Temp1 (PSU1 Inlet Temp Sensor ): ABSENT PSU1Temp2 (PSU1 Max Temp Sensor ): ABSENT PSU2Fan1 (PSU2 Fan ): OK PSU2Temp1 (PSU2 Inlet Temp Sensor ): OK PSU2Temp2 (PSU2 Max Temp Sensor ): OK Temp1 (Intel CPU external sensor ): OK Temp2 (Rear Outlet Air sensor ): OK Temp3 (Front Outlet Air sensor ): OK Temp4 (Temp Sensor close to Networking ASIC ): OK Temp5 (Intel CPU die sensor ): OK Temp6 (Intel CPU die sensor ): OK Temp7 (Intel CPU die sensor ): OK Temp8 (Intel CPU die sensor ): OK
Success!
I ended up with a working switch that I can run lab experiments on again!
On a final note, I noticed the switch control plane refuses to boot up and output anything to the console without the ASIC daughterboard connected to it. This initially threw me off, as I wanted to test just the carrier board by itself. I am still not entirely certain why, I am guessing the POST process expects something to be present.