Vertica.log is the main log file of the Vertica database. After reading my previous post, HP Vertica Database Log Files and Their Locations, you know how to locate vertica.log on each node. In this post we will examine the content of this log file and will use it for different operational scenarios.
Each node in the Vertica cluster writes its own vertica.log file. You can logically divide contents of vertica.log into 2 categories.
The first category is information about things happening to the node itself. This category will provide information into the state of the processes running on the node. Also it will provide information about activity in user sessions, which is using this specific node as an initiator. Information in this category is very unique on each node and not shared among nodes. You will need to identify a node, which is having trouble, ssh to it and read vertica.log on this specific node to get insights into node behavior.
The second category is information about the cluster state. This is the information which is provided to nodes over spread. Spread is a high performance group communication system, which is used in HP Vertica to provide a messaging service that is resilient to network faults. Information in this category will look very similar on each node at any given moment. If nodes are writing information like “node state map” then it will look similar on each node if written at same time.
Log Entry Structure
Generally, every line of Vertica.log follows the same structure.
In the beginning of the line you will see the timestamp of the event this log entry describes.
Closer to the center you will see a word in square brackets. It indicates the source of the log entry. Meaning of these is self explanatory. Examples are : [Comms], [Recover], [Session], [Init], [Shutdown], [ResourceManager], etc.
Further, you will see a word in angle brackets. It indicates the severity of a message in the log entry. You may see the following severities: <LOG>, <INFO>, <NOTICE>, <WARNING>, <ERROR>, <ROLLBACK>, <INTERNAL>, <FATAL> and <PANIC>. In this list, <LOG> is the lowest severity and <PANIC> is the highest. <ERROR> and <ROLLBACK> severities generally represent messages related to a single statement. The severity <PANIC> indicates messages related to issues on the node. In most cases, events with a severity of <PANIC> will cause the node to go down. In some cases it can cause the cluster to declare itself UNSAFE and shut down.
Looking at vertica.log
Generally, vertica.log is hard to read. It contains output from all Vertica subsystems resulting in a mix of different information. When you are trying to troubleshoot something and looking at vertica.log you will find that only a small portion of lines is relevant to your current efforts. The majority of the lines will be irrelevant. My personal preference in situations like this is to use GREP. Here are some operational scenarios you may encounter and explanations on how to deal with them.
When was Vertica restarted?
Every time Vertica process starts or log rotation occurs Vertica will leave an empty line in the log and write a line with a timestamp and “INFO New log” string. Here is an example of such a line:
2015-09-02 14:09:12.527 INFO New log
The following code snippet shows how to find how many restarts happened.
[root@vertica-testdb-03 catalog]# grep -n "INFO New log" vertica.log 2:2015-09-02 03:13:01.919 INFO New log 424760:2015-09-02 12:54:35.320 INFO New log 479688:2015-09-02 14:03:27.744 INFO New log 482443:2015-09-02 14:09:12.527 INFO New log
In the GREP output we can see that the first occurrence of the string “INFO New log” was on line 2. Most likely this was generated during a log rotation. The following 3 lines clearly indicate that the node had 3 restarts. All of them happened on the same day, September 2nd of 2015.
I usually use the nano text editor. So if I want to examine what happened during the “2015-09-02 14:03:27.744” restart I would open vertica.log and instruct nano to go to line 479688 with the following command:
[root@vertica-testdb-03 catalog]# nano +479688 vertica.log
Why was one of the nodes down?
Sometimes one of the nodes can go down. To have basic understanding of what happened on the node I would scan vertica.log for messages with high severities. You want to scan for <INTERNAL>, <FATAL> and <PANIC> severities. Also, the strings “Shutdown” and “shutting down” will point you towards parts of the log you are interested in.
[root@vertica-testdb-03 catalog]# grep -n "FATAL" vertica.log 3850476:2015-09-11 17:00:42.201 Init Session:0x7f27f00133d0-1c0000000003632 <FATAL> @v_testdb_node0003: 57V03/4748: Shutdown in progress. No longer accepting connections
Why was my cluster down?
If the Vertica cluster loses several nodes then some portion of data may become unavailable for queries. In this situation Vertica will declare the cluster UNSAFE and will automatically shut down.
When the cluster was automatically shutdown the majority of the nodes had no issues. They were safely shut down so looking at vertica.log on those nodes will have no value. To troubleshoot this scenario you need to find what nodes went down first and check their vertica.log files.
SSH to any node on the cluster and GREP for “left the cluster”.
[root@vertica-testdb-03 catalog]# grep "left the cluster" vertica.log 2016-07-12 04:17:05.586 Spread Client:0xabf44a0 [Comms] <INFO> nodeSetNotifier: node v_testdb_node0011 left the cluster 2016-07-12 11:11:30.463 Spread Client:0xabf44a0 [Comms] <INFO> nodeSetNotifier: node v_testdb_node0010 left the cluster 2016-07-12 11:11:36.087 SafetyShutdown:0x7f0f7403f890 [Comms] <INFO> nodeSetNotifier: node v_testdb_node0007 left the cluster 2016-07-12 11:11:36.087 SafetyShutdown:0x7f0f7403f890 [Comms] <INFO> nodeSetNotifier: node v_testdb_node0008 left the cluster 2016-07-12 11:11:36.088 SafetyShutdown:0x7f0f7403f890 [Comms] <INFO> nodeSetNotifier: node v_testdb_node0009 left the cluster 2016-07-12 11:11:36.088 SafetyShutdown:0x7f0f7403f890 [Comms] <INFO> nodeSetNotifier: node v_testdb_node0012 left the cluster 2016-07-12 11:11:36.091 SafetyShutdown:0x7f0f7403f890 [Comms] <INFO> nodeSetNotifier: node v_testdb_node0001 left the cluster 2016-07-12 11:11:36.091 SafetyShutdown:0x7f0f7403f890 [Comms] <INFO> nodeSetNotifier: node v_testdb_node0002 left the cluster 2016-07-12 11:11:36.091 SafetyShutdown:0x7f0f7403f890 [Comms] <INFO> nodeSetNotifier: node v_testdb_node0003 left the cluster 2016-07-12 11:11:36.091 SafetyShutdown:0x7f0f7403f890 [Comms] <INFO> nodeSetNotifier: node v_testdb_node0004 left the cluster 2016-07-12 11:11:36.092 SafetyShutdown:0x7f0f7403f890 [Comms] <INFO> nodeSetNotifier: node v_testdb_node0005 left the cluster 2016-07-12 11:11:36.092 SafetyShutdown:0x7f0f7403f890 [Comms] <INFO> nodeSetNotifier: node v_testdb_node0006 left the cluster
Now you can see that SPREAD reported v_testdb_node0011 leaving cluster at 2016-07-12 04:17:05.586. Afterwards, this cluster stayed UP until 2016-07-12 11:11:30.463. At that time node v_testdb_node0010 also went down. The rest of the nodes reported SafetyShutdown and left the cluster at the same time and shortly after v_testdb_node0010. By looking at this output we can be sure that we need to examine vertica.log on v_testdb_node0010 and v_testdb_node0011. Those nodes caused the cluster to do an automatic shutdown. All other nodes were shut down cleanly so in most cases we are not going to find anything useful in their vertica.log.
Conclusion
In the majority of troubleshooting cases you will need to examine the content of vertica.log. Since vertica.log contains output from all of Vertica’s subsystems it has a lot of different information. To find the answers to questions you are interested in one should use any file search utility (eg. GREP). It makes sense to build a library of strings that you will be searching for. Those strings will represent keywords of phrases, which can be mapped to specific events in the cluster. If you want to be more proactive you can feed the vertica.log in to ELK Stack or any other log analysis software and get much more insights into the operations of your Vertica cluster.
Very useful information.
Can you provide some clear understanding about K-safety. How many nodes can be down but the cluster can run safely in a 100 node cluster with k-safety of 2 ?
Answer is 66.
Maximum alive number of nodes can be calculated using below formula
=1 + Floor(Number of Nodes/(K Safe + 1))
=1 + 33
=34