As Network Engineers we get involved in all kinds of issues that on the surface they appear as communication issues underneath they are configuration or system issues.
In this post I will go over an instance in which an ESX Host was unable to mount an NFS store and it all appear to point into the network as an issue.
The ESX Servers have vKernel IP Addresses but only the management address is the one that has the default gateway. Therefore the Management address is the one used to communicate with IPs outside of the ESX Networks.
For this particular case the IP Address of the NFS target is 10.232.213.102 and the IP Address of the ESX Server is 10.231.222.14.
The NFS network target was a NetApp. The following ports were identified in prior NFS connections and by looking at online documentation:
TCP/UDP 111 – RPC Bind.
TCPUDP 635 – NFS Mount.
TCP/UDP – 2049 – NFS Server Daemon.
The base topology has a firewall for which a security rule is in place that allows the NFS communication as well as ICMP. Below is the base topology:
In order to validate NIC configuration we enabled SSH on the ESX host. To do this open vSphere Client, select the Host and navigate to Configuration>Software>Security Profile> click on the SSH label and if the daemon is not running click on “Options” and then “Start” then click “OK” to close the dialog box and “OK” once more to close the Security Profile configuration window.
We then opened a terminal client and used SSH to connect to the host using IP 10.231.222.14.
The commands listed below were used:
vmkping -I vmk0 -s 1472 10.232.213.102
This command sends an ICMP ping to 10.232.213.102 (NFS Target) using vmk0 as the source with a packet size of 1500-28(overhead)=1472.
-I Parameter is used to specify the outgoing source interface.
-s Parameter is used to specify the number of ICMP data bytes to send. This can be helpful when MTU size is in question. For a Jumbo frame configuration use 8972 since adding the 28 bytes of overhead this will result in a frame size of 9000.
RESULT: vmkping was successful confirming routing reachability and firewall rule.
esxcfg-vmknic -l
This command displays the vmknic. These are all the vKernel IP, it includes management, subnet mask.
-l Parameter is used to list the vmknics.
RESULT: In the image below you can see the management vmknic as vmk0 with an MTU of 1500 as well as another vmknic configured with an MTU of 9000.
esxcfg-route -l
This command displays the Network Routes. On the output of the command you can see the default route is tied to the management vmknic.
-l Parameter is used to list the route entries.
RESULT:
nc -z 10.232.213.102 2049
This command is the Netcat utility to test connectivity to an IP with a specific port number.
-z Parameter specifies that we will only check if the port is open and not attempt to make a connection.
RESULT: There was no output, the command timed out, indicating that the connection was not successful.
nc -uz 10.232.213.102 2049
This is the same command as previously but the -u parameter makes the connection test use UDP instead of TCP.
RESULT: The output showed a successful port verification test over UDP.
Our firewall logs show that the connection attempts over UDP and TCP were both allowed. This ruled out routing and firewall as a possible problem but it was not sufficient proof that the network was not the problem in this communication flow.
To further validate that the network was in good operation and that the NFS communication was not getting blocked we researched a way to perform a packet capture directly on the ESX host in question.
VMware provides a “tcpdump” utility to perform a packet capture. The ESXi command to run the utility is “tcpdump-uw” we used it in the following way:
tcpdump-uw -vv -i vmk0 -s 9014 port 2049 or port 111 or port 635 -w /var/tmp/NFS-PCAP-02222018.pcap
-vv Parameter indicates a that a full protocol decode will be used. A single -v would have indicated verbose output suppressed according to information I found online but I couldn’t get the verbose to work.
-i Parameter is to indicate interface to capture
-s Parameter is used to indicate the Snap Length or SnapLen, this is the packet size that we will capture. Since in this case we could be dealing with jumbo we specified a size of 9014.
-port Parameter is used to indicate the port number to capture. “and”, “or” operands are allowed but keep in mind that using multiple “and” ports will result in no data being captured.
-w Parameter is used to specify the file used to store the captured data. For our use case we stored the file on the /var/tmp/ folder.
To stop the capture we used CTRL-C.
The output looked as follows:
In this case you can see that we captured 9 packets. To retrieve the file we must use the SCP protocol to connect to the server.
Once we got the file, we opened it in wireshare and identified the following:
- The TCP Handshake happened successfully for protocol TCP 111
- The port-map operation was completed successfully and the portmap to be used was identified as 635.
- There was one retransmit for the portmap operation.
- the TCP Handshake was successful for port 635.
- The “mount” operation was attempted over port 635 an error message was sent by the NFS Target. The error was “ERR_ACCESS”
- There was a retransmit of the error packet.
Below is a screenshot of the capture:
Based on the message we were able to identify that the issue was located at the NFS Target. Once the configuration was verified on the NetApp Controller it was found that the permissions were not properly configured to allow the connection from the ESX server.
Once the permissions were corrected the NFS mount operation was successful.
Below are the online resources we used to identify this issue:
https://kb.vmware.com/s/article/1003967
https://communities.vmware.com/thread/474503
http://www.virten.net/2015/02/esxi-network-troubleshooting-commands/
https://kb.vmware.com/s/article/1031186
http://docs.netapp.com/ontap-9/index.jsp?topic=%2Fcom.netapp.doc.dot-cm-nmg%2FGUID-49D0B88F-42CF-4766-A688-1C77A0AE8BD5.html
http://pubs.vmware.com/vsphere-6-5/index.jsp?topic=%2Fcom.vmware.vcli.getstart.doc%2FGUID-C3A44A30-EEA5-4359-A248-D13927A94CCE.html