Friday, August 9, 2019

ORA-12571: TNS:packet writer failure – One of the hardest problems I’ve had to resolve

ORA-12571: TNS:packet writer failure

It is a very complicated problem,
mostly related to getting information correctly back and forth across the network.

- a firewall is getting in the way
- host name resolution failing
- multiple process colliding on the same port number
- somewhere in the infrastructure there is a timeout, so (for example) a network router may drop inactive connections
- anti-virus software (typically includes a software firewall).
- somewhere in the infrastructure there is a timeout, so (for example) a network router may drop inactive connections

It is "something is stopping us from successfully sending information between the db and the client across sqlnet."

ORA-12571: TNS:packet writer failure. This is one of the hardest problems I’ve come across in my career and take us a whopping 16 months to resolve. The issue wasn’t really causing any serious problems on our website, but it was a very annoying problem that we spent a ton of time resolving. I want anyone else experiencing this problem to see the solution, so I posted it on Experts Exchange and this blog. Below is the subsequent question I posted on Experts Exchange, and 16 months later the actual solution.

Question/Problem

We have multiple web servers that are load balanced. Each web server connects to our Oracle Database Server through a firewall and load balancer. Everything was working correctly when our Oracle Database Server was Solaris 8 running Oracle 9i.
Since we migrated to Oracle Enterprise Linux 64-bit 5.4 (RHEL 5) running Oracle 11G update 2 (11.2.0.1.0), we are now seeing ORA-12571: TNS:packet writer failure at Oracle.DataAccess.Client.OracleException.HandleErrorHelper messages. The new Linux Oracle Database Server is on the same subnet as the Solaris Oracle 9i Server, and the path the network traffic takes is identical.
What is particularly strange about it is that it happens most (if not exclusively) at times where our website traffic is low. For example, on our busiest day of the month website traffic wise, we went without a TNS:packet writer failure error for 17 hours (7:00am until midnight). Yet as soon as we went into the early hours of the morning (where are website traffic is low or lower), those errors came back.
When we get one of those errors, the website visitor can just wait 1 -3 secs and refresh the page and everything works correctly.
Here’s some other info:
Web servers are running IIS 6, Oracle ODP Net 10.2.0.1.0. Web server connection string (excluding username, p/w, server, port) is ;Persist Security Info=False;Connection Timeout=30;Connection Lifetime=120;Enlist=False;Pooling=True;Max Pool Size=25;Min Pool Size=5;Incr Pool Size=1;Decr Pool Size=1
On the Database Server
listener.ora connect_timeout=10 – no other listener parameters explicitly set.
sqlnet.ora – no parameters explicitly set
Network port stats on Server and switch do not show dropped packets or anything leading us to believe there is a network problem. Also, like I say, those TNS:packet writer failure messages occur during times of less website traffic.
Does anyone have any idea why we are getting those TNS:packet writer failure messages?

Answer/Solution

We finally discovered the cause of the problem. What was happening is that in periods of inactivity the database connections from the web servers had no activity and were being severed after 1hr by the Cisco Firewall and/or Cisco ACE 4710 load balancer (both have 1hr inactivity timeout thresholds by default).
To get around this problem, we added “SQLNET.EXPIRE_TIME= 10” to the sqlnet.ora and reloaded the listeners. This is a dead connection detection parameter that will tell the database server to check that the connections are still alive. How this helps is that it checks the database connections with the web servers every 10 mins and thus the firewall and/or load balancer will thus see activity and thus NOT severe the connections due to inactivity.
This was actually implemented on our old Oracle database Server (Solaris 8/Oracle 9i) but NOT implemented on the new Linux/Oracle 11G Server. And unfortunately was overlooked until now.

No comments:

Post a Comment

  How to Change Instance Type & Security Group of EC2 in AWS By David Taylor Updated April 29, 2023 EC2 stands for Elastic Compute Cloud...