I have had a dead Celestica D2060 network switch laying in my garage for a while and decided to finally attempt to fix it. D2060 is a Broadcom Trident2 BCM56854A2 powered device offering 48x10G SFP+ and 6x40G QSFP+ ports, with an Intel Atom Rangeley C2558 based control plane – notorious for dying due to a CPU bug.

Overview of the opened D2060 switch

Insides of the D2060 switch

The backstory

I bought a few of these Celestica switches a few years back as I wanted to gain familiarity with whitebox switching platforms and Cumulus Linux. Two of them got installed in a rack and had been working fine for a year or two, then one decided to stop working.

After a quick datacenter trip, I realized the switch was failing to boot properly, it couldn’t initialize the ASIC and was throwing cryptic EEPROM read errors. I replaced the faulty switch with a spare and went home.

This faulty unit had been sitting in my garage since then, collecting dust for a few years until today. I have finally decided to attempt to fix it.

Cumulus Linux

Before I go any further, let’s talk about Cumulus Linux and switching platforms in general.

Most network switches on the market, especially from traditional vendors like Juniper or Cisco come with their own vendor-supplied operating systems. Things like NX-OS, IOS, JunOS and similar.

It is actually possible to buy so-called whitebox switches, which is just a fancy term for a commodity switching platform that does not come with an operating system from the factory. Instead, they are usually supplied with an ONIE environment, which can be used to install/uninstall/rescue whitebox operating systems such as NVIDIA Cumulus Linux or the open-source Microsoft SONiC.

The switch at hand has Cumulus Linux 4.3.0 installed on it, a distribution based on Debian 10 with some fancy services and tooling built on top of it. The main part of the OS is the switchd daemon, which ensures communication and synchronization between the kernel and the actual physical Broadcom chip.

The first test

Switch failing to boot

First of all, I connected a USB-serial converter into the CONSOLE port of the switch to verify what was actually happening. My memory at that point was a bit fuzzy after all those years and I didn’t precisely remember what the issue was. After starting a screen session on my Mac with a 115200 baud rate, I plugged in a power cable into one of the power supplies and watched as the switch tried to boot up.

It wasn’t terribly happy to do so, the ismt_smbus module was timing out and I did see several systemd services fail to start, things such as the ledmgrd.service throwing errors, causingswitchd to not even attempt to start due to a failed dependency.

2024-07-24T12:12:38.691221+00:00 hostname platform-hw-init[405]: /usr/cumulus/bin/decode-syseeprom : ERROR : unknown target, should be one of
2024-07-24T12:12:38.743105+00:00 hostname platform-hw-init[405]: could not find eth0 MAC address in EEPROM ... failed!
2024-07-24T12:12:38.774318+00:00 hostname platform-hw-init[405]: Initializing QSFPs...
2024-07-24T12:12:38.812127+00:00 hostname platform-hw-init[405]: Initializing SFPs...
2024-07-24T12:12:40.356231+00:00 hostname systemd[1]: Failed to start Cumulus Linux switch port setup.

Not good! This meant the switch was completely failing to initialize. All port LEDs on the switch were lit, which just further confirmed the unfortunate.

I had done a few further checks and discovered that the i2cdetect -l command was reporting only two devices:

i2c-1 smbus SMBus iSMT adapter at dffd0000 SMBus adapter
i2c-0 smbus SMBus I801 adapter at f000 SMBus adapter

That was certainly a part – or a symptom of – the problem, there should’ve been dozens.

Intel Atom Rangeley – and LPC clock woes

The Intel Rangeley C2000 platform is suffering from a plague caused by physical degradation that eventually causes the LPC clock to stop ticking that media and various unhappy users reported on. As Intel eventually put in their platform notes:

AVR54. System May Experience Inability to Boot or May Cease Operation Problem: The SoC LPC_CLKOUT0 and/or LPC_CLKOUT1 signals (Low Pin Count bus clock outputs) may stop functioning.

Luckily for us, there is a workaround in the form of adding a pull-up resistor in between the LPC_CLKOUT pin and a 3.3V power source, as documented on the EEVBlog forums. I couldn’t find any official references from Intel that would describe this workaround in detail, and the precise pull-up resistor value is not known. I have seen anything from 100 ohms to around 1.5k ohms recommended across various forums.

Finding the clock

Overview of the control-plane carrier board with the Atom CPU under the heatsink

Overview of the control-plane carrier board with the Atom CPU under the heatsink

It was about the time to crack open the switch and try to find the magical clock. Equipped with an oscilloscope, I went on a hunt for a 25MHz signal. I did unfortunately have very limited success on the broken switch though.

It makes sense – once the clock had degraded to the point the switch no longer works correctly, the clock is either fully dead, has a very low voltage level or suffers from other deformities. There are also many other clocks present in such a complicated system and I wasn’t able to reliably identify the right one.

Front side of the ASIC carrier board

Front side of the ASIC daughterboard

Back side of the ASIC carrier board

Back side of the ASIC daughterboard

After a bit of Googling, I stumbled upon one ServeTheHome forum user who shared their experience with applying the workaround on a Celestica D4040 switch. Looking at the pictures, I realized they use the same – or at least a very similar layout for the Intel Atom carrier board:

The right pad of R562 is the clock signal and the top pad of the C410 is just a 3.3V supply voltage.

Clock necromancy

Lo and behold, I did indeed find the appropriate pins on my D2060 and considered the next steps. I verified that C410 had 3.3V across it, which it indeed did. Based on the proximity to the nearby chip, the other end being connected to the ground and the thickness of the supplying trace, I was confident I was indeed looking at a power rail (as opposed to a signal trace), with C410 being a blocking capacitor across it. That meant it would be a decent option to tie the pull-up to.

The other side would obviously have to be connected to one of the pads of R562. I decided to mask off the area with a bit of Kapton tape in order to prevent any accidental shorts. While these pads look quite big in the picture, they are actually 0402 and quite close to the connector, definitely not the easiest combination to solder to.

I masked off nearby pads with Kapton tape to not accidentally short them

I masked off nearby pads with Kapton tape to not accidentally short them

With the connection points planned out, only one thing remained to resolve – the resistor. Where should I place it, how will I connect it and which one?

To resist or not to, that is the question

I went on a hunt through my drawers and a tape with some 750 ohm 0603 SMD resistors fell into my hands. The value seemed to be within the range of what people had success with, and 0603 is a fairly manageable size.

I pondered a bit more about how I should affix the resistor – maybe with glue, or tape? Then I realized I was staring at a bunch of perfectly usable pads!

Solder mask scraped off two thieving pads on the PCB

Solder mask scraped off two thieving pads

With a bit of scraping, the solder mask came right off. After a bit of tinning, soldering and mod wire work, the end result was born.

Finished workaround, 750 ohm pull-up resistor connected between C410 and R562 pads

Finished workaround

Yes, I know, I should have cleaned the flux off the resistor. The truth is, I forgot and didn’t want to take the whole thing apart again. The resistor is a bit crooked, unfortunately 0603 is right at the edge of connecting the two adjacent pads and it didn’t really want to center with ease. The pads were way too big as well, and the SMD part simply wanted to flow and stick to one of them.

I used a 30AWG solid-core copper wrapping wire for the connections.

Testing

After putting the switch back together, I hooked up an oscilloscope to the R562 side of the pull-up resistor, powered on the switch and saw some rather ugly looking signal.

Granted, my probing technique was far from great and I picked up the ground from a really far away. I am pretty sure the signal is not supposed to look even remotely similar to this in any case though. Nonetheless, I did have a stable 25MHz clock of some sorts, a pretty significant improvement.

Clock signal with the workaround applied

To my great surprise, the switch booted up as expected. I actually saw I2C devices connected to the bus this time as well:

# i2cdetect -l
i2c-69 i2c i2c-41-mux (chan_id 19) I2C adapter
i2c-97 i2c i2c-43-mux (chan_id 47) I2C adapter
i2c-59 i2c i2c-40-mux (chan_id 9) I2C adapter
i2c-87 i2c i2c-43-mux (chan_id 37) I2C adapter
i2c-20 smbus i2c-9-mux (chan_id 0) SMBus adapter
i2c-77 i2c i2c-41-mux (chan_id 27) I2C adapter
i2c-10 smbus i2c-9-mux (chan_id 0) SMBus adapter
i2c-67 i2c i2c-40-mux (chan_id 17) I2C adapter
i2c-1 smbus SMBus iSMT adapter at dffd0000 SMBus adapter
i2c-95 i2c i2c-43-mux (chan_id 45) I2C adapter
i2c-57 i2c i2c-40-mux (chan_id 7) I2C adapter
i2c-103 i2c i2c-42-mux (chan_id 53) I2C adapter
...

All the sensors were reporting values (previously only reporting ABSENT) and the system looked fully initialized.

$ net show system
Hostname......... hostname
Build............ Cumulus Linux 4.3.0
Uptime........... 4:36:03.790000

Model............ Cel REDSTONE
CPU.............. x86_64 Intel Atom C2538 2.4 GHz
Memory........... 4GB
Disk............. 14.9GB
ASIC............. Broadcom Trident2 BCM56854
Ports............ 48 x 10G-SFP+ & 6 x 40G-QSFP+
Part Number...... xxxxxxx
Service Tag...... xxxxxxx
Serial Number.... xxxxxxx
Platform Name.... RANGELEY
Product Name..... Redstone-XP D2060
Vendor Name...... CELESTICA
ONIE Version..... 2014.08
Base MAC Address. xxxxxxx
Manufacturer..... CELESTICA

$ net show system sensors
Fan1 (Fan Tray 1 ): OK
Fan2 (Fan Tray 2 ): OK
Fan3 (Fan Tray 3 ): OK
Fan4 (Fan Tray 4 ): OK
Fan5 (Fan Tray 1 ): OK
Fan6 (Fan Tray 2 ): OK
Fan7 (Fan Tray 3 ): OK
Fan8 (Fan Tray 4 ): OK
PSU1 : BAD
PSU2 : OK
PSU1Fan1 (PSU1 Fan ): ABSENT
PSU1Temp1 (PSU1 Inlet Temp Sensor ): ABSENT
PSU1Temp2 (PSU1 Max Temp Sensor ): ABSENT
PSU2Fan1 (PSU2 Fan ): OK
PSU2Temp1 (PSU2 Inlet Temp Sensor ): OK
PSU2Temp2 (PSU2 Max Temp Sensor ): OK
Temp1 (Intel CPU external sensor ): OK
Temp2 (Rear Outlet Air sensor ): OK
Temp3 (Front Outlet Air sensor ): OK
Temp4 (Temp Sensor close to Networking ASIC ): OK
Temp5 (Intel CPU die sensor ): OK
Temp6 (Intel CPU die sensor ): OK
Temp7 (Intel CPU die sensor ): OK
Temp8 (Intel CPU die sensor ): OK

Success!

I ended up with a working switch that I can run lab experiments on again!

On a final note, I noticed the switch control plane refuses to boot up and output anything to the console without the ASIC daughterboard connected to it. This initially threw me off, as I wanted to test just the carrier board by itself. I am still not entirely certain why, I am guessing the POST process expects something to be present.