Connectivity issues between Cohesity C5016 Nodes and Nexus 93180YC-FX3H Switches

Very recently, I was deploying a new Cohesity C5016 appliance with 25Gb NICs, connecting up to a pair of Nexus 93180YC-FX3H switches. When using the 9K’s in a VPC pair, my personal preference is to configure the Cohesity nodes with LACP to get the most bandwidth possible (regardless if it’s 10Gb or 25Gb connectivity). Nothing super creative there, and I’ve done this dozens of times in the past with no issue, on both the Cohesity appliances and Nexus 9k’s. But this time, it was different…

A little background…

This deployment included a pair of Nexus 93180YC-FX3H switches (I love this model, and if you didn’t know what the H stands for, it stands for Half – go figure). The FX3 switch is a 48 1/10/25Gbe and 8 100Gbe port switch with a fantastic set of features. But in some cases, we just don’t need the full 48 ports on each switch, so to help with the overall cost we use the FX3H model, which is the same switch but only 24 of the ports are licensed. In my specific scenario, we were performing a refresh for a past customer from a 3-tier environment to a pair of new 9k’s, Nutanix nodes for hyperconverged storage and compute, and Cohesity for the backup solution. A perfect trifecta if I say so myself…

When bringing up the C5016 nodes with 25Gb NICs connected redundantly to the Nexus 93180YC-FX3H switches running 10.3.4a code, even though I had visible light on the fiber, I wasn’t getting any connectivity. So I set about with normal troubleshooting. Checked the SFP modules (and swapped them out), checked the fiber (and swapped that out as well). Rebooted the nodes, rebooted the switches, nothing. But go figure, the Nutanix nodes using the same 25Gb modules and the same switches (even the same ports) worked just fine.

Forward Error Correction

Forward Error Correction (FEC) is a method used in data communication and storage systems to detect and correct a limited number of errors in data without the need for retransmission. It works by adding redundancy to the original data so that errors can be detected and corrected on the receiver’s end. This is particularly useful in high-speed networking environments where retransmissions can be costly in terms of time and bandwidth.

Impact of FEC on 25Gb Connectivity on Nexus Switches:

  • Error Correction: FEC enables error detection and correction, which is crucial for maintaining data integrity, especially at higher speeds like 25Gbps. This helps in reducing the bit error rate (BER) and ensures more reliable data transmission.
  • Compatibility: Different FEC modes (such as RS-FEC and FC-FEC) may be required depending on the transceivers and cables used. Ensuring the correct FEC mode is enabled is vital for establishing a stable link.
  • Latency: While FEC improves data integrity, it can introduce a slight increase in latency due to the time required for error correction processes. However, this latency is typically minimal and outweighed by the benefits of reduced errors.
  • Interoperability: For 25Gb links, both ends of the connection (transmitter and receiver) must support and be configured for the same FEC mode. Mismatched FEC configurations can result in link failures or degraded performance.
  • Configuration: On Nexus switches, FEC settings can be configured to match the requirements of the connected devices. Incorrect FEC settings can lead to connectivity issues or suboptimal performance.
  • Performance: With the correct FEC settings, 25Gb links can achieve optimal performance, ensuring high throughput and low error rates. This is critical for applications requiring high bandwidth and low latency.

Now, I’ve had this issue in the past when connecting a pair of Cisco Catalyst 9200/9300’s switches using 25Gb uplinks to the Nexus 93180, and had to tune the FEC parameters, but it just didn’t click that this might be the same issue.

Enter Random Cohesity KB Article

Doing some searching on the Cohesity support site, I stumbled upon a KB article addressing a similar issue with Arista switches. So why not, let’s check it out.

A quick synopsis of the KB article.

Symptom

The links on the nodes do not come up after upgrading the switch.

Cause

The issue might stem from a mismatch in the Forward Error Correction (FEC) mode between the NICs/SFPs and the switch configuration. The specific problem noted in the Cohesity article was related to Arista switches allowing FEC Mode RS to be set, which needed to match the NIC/SFP configuration.

Resolution

The FEC mode supported is determined by the SFP and switch configuration. Cohesity devices default to negotiating FEC automatically using RS encoding. You might need to force RS encoding on the Nexus switch ports connected to the C5016 series NICs.

Here’s an example of checking and setting the FEC mode:

  1. Check the current FEC mode on the NIC:
   sudo ethtool --show-fec ens13f0

Output:

   FEC parameters for ens13f0:
   Configured FEC encodings: Auto RS
   Active FEC encoding: RS
  1. Set the FEC mode on the Nexus switch:
    The exact commands to force RS encoding will vary based on your switch model and IOS version. However, a typical approach involves configuring the interface settings:
   interface Ethernet1/1
   fec rs

Ensure that the interface matches the one connected to your C5016 NICs. You might need to adjust these commands based on the specific syntax for your Nexus switch and IOS version.

  1. Upgrade Switch Software if Necessary:
    If the issue persists, it might be worth checking if an upgrade to a newer software version on the Nexus switch is available and recommended by the vendor. Software upgrades can often resolve compatibility issues and introduce enhancements that improve overall performance and stability.

Additional Steps and Considerations

  • Validate Fiber Connections: Double-check the physical connections, ensuring that the fibers are clean and properly seated.
  • Check SFP Compatibility: Ensure that the SFP modules used are compatible with both the NICs and the switch. Sometimes, swapping out the SFPs can resolve connectivity issues.
  • Consult Vendor Documentation: Refer to the Nexus 93180YC-FX3H switch documentation for specific configuration commands and troubleshooting steps related to FEC and 25Gb connections.

The Fix

After reading the Cohesity article, and remembering the experience with the C9200 connectivity with the Nexus, I dug into the port FEC options, as I didn’t just want to disable FEC, but find the right combination. After trying each of the below settings, I found that rs-ieee worked and I was able to establish connectivity between the Cohesity nodes and Nexus switches.

Conclusion

Quite often, I use my blog to help remind me of situations I encountered, call it my personal knowledgebase. As higher bandwidth connections continue to be the norm, running into this on 25Gb and higher connections will remind me to Check the FEC!

Thanks for reading, I hope you found this valuable if you ran into a similar situation!

Weekly Tech Tip: Check your FEC!
Tagged on:         

Leave a Reply

Your email address will not be published. Required fields are marked *