Snooping ODBC connection

ODBC is a very popular database access mechanism though it is considered old compared to others such as OLEDB, etc. Data Direct (Progress) has released a light weight version of snoop some time back. It allows us to connect to local port on your computer for logging ODBC activity. You need to run snoop tool as a command line for redirecting from local port connection to host where a database is running.

Shown below is an illustration for connecting to a database (Oracle as an example) from my laptop with local port of 9999 (you may use other available port on your laptop instead).

Steps:

1) Download snoop from Data Direct. Choose a local port and a local directory where log files can be written to.

2) Create a ODBC DSN and run the snoop command as given below:

java Snoop 10.5.XY.ABC 1521 -snoopport 9999 -traceFilePrefix c:\temp\snoop.log

3) You will notice corresponding logs depending on your database activity through ODBC connection.

Download Netezza (not really!!)

Lots of us tried to download Netezza and try out for free. No software of that sort exists as it is entirely based on hardware. Closest one that comes to my mind is Greenplum. With a bit of luck, you can get Greenplum up and running on a Red Hat (or compatible) Linux in a couple of hours:

http://www.greenplum.com/community/downloads/

Lingo:

  • Greenplum calls segment servers, these are similar to Netezza SPU.
  • In a small PC based implementation, you may have one master (no standby required) and two segment servers.
  • No windows version available, though windows client connectivity tools are supplied.

I believe (not sure) that you can deploy Greenplum in production with up to two segment servers. Look and feel is pretty similar to Netezza. Greenplum has a few more bells and whistles, check their administration guide; only one doc that you need to do anything on Greenplum.

Teradata Compression Tool: AtanaSuite

Teradata Compression Tool

Teradata supports compression from version 12 onwards. Unlike some other databases, we cannot do this command:

alter table daily_inventory compress; // We can’t do this in Teradata

We need to specify what column values we like to compress and we can only pick up to 256 of them. Well, what happens if data changes and if my top 256 is no longer valid. I guess, you need to recompress using most recent table definition. Though compression is cumbersome, Teradata highly recommends using compression for large (fact) tables. We all agree that Teradata need to come up with a better tool or command to implement compression.

Tool

AtanaSoft has many tools that only work on Teradata. They came up with a nice tool called “AtanaSuite” and includes an optional add on for compression. This is easy to use and quite intuitive, no training is recommended. They produced a web seminar that can help us to understand this product better.

 

[For security reasons, text on part of this screen is erased].

Connect and analyze

It is as simple as connect this tool to Teradata and start analyzing. This tool is based on .NET and requires little configuration to connect. Start a new session with “File/New/Compression tool”. This gives an above screen in which “selection” and “options” tabs are enabled.

This tool will list all tables in that schema. Choose a table that you like to compress. This is where the second tab comes handy. As you can see this tool allows:

  • Sample data from a table (like 1%).
  • Sample using first N rows (such as 1000 rows).
  • Full table.

Once you run this tool, it returns quickly; that is, if you have a table > 1TB, it completes full table analysis within a couple of hours. After completion, this tool gives DDL for creating a backup table (under scripts tab). This script does not result in changing the original table itself.

What if scenario

As you can see in the second image, you can ask this tool what if “Change CHAR to VARCHAR” is applied on a table. What compression rate can be expected. Likewise, a number of other options are present in this second tab. You can choose as many check boxes as applicable to see what compression ratio can be obtained.

You may also assign dollar value for storage and will tell you how much storage savings are expected.

Compression ratios

From my experience, these compression rates are quite typical. Obviously, this depends greatly on your data and compression algorithm picked up by database:

  • Teradata: less than 1 to 1.5.
  • Oracle: About 2, depends on PCTFREE, etc., among other factors.
  • Datallegro: Typically 4, can be up to 6.
  • Vertica: Up to 10 or more.

As I said, this is just my observation. Your mileage may vary.

Performance versus storage space

Compression is meant for saving storage. I am yet to come across a situation where database performance improves after compression. Vendors always tell you performance (throughput or response time) greatly improves after compression.

Netezza architecture

IBM published Netezza architecture on their red book site. Nothing different, IBM talks about “FAST”, possibly improved architecture and / or algorithms.

Netezza architecture from IBM

Netezza models: New branding

Looks like Netezza will be branded as

  • Skimmer
  • TwinFin
  • Cruiser

Look at the attached picture, I do not know complete technical details yet.

New Netezza branding by IBM

Netezza Query Monitoring

Netezza query monitoring Netezza twinfin uses a little different architecture compared to old Mustang.

Each node maps to eight disks with 1 TB on each disk on TwinFin 12 system. That is, if we have a TwinFin 12, total capacity is 12 x 8 = 96TB. Here are steps that I used for monitoring purpose; I may update these findings as I find more info:

1) Once a SQL is submitted, corresponding plan is generated at /nz/data/plans/nnnn. Last four digits correspond to the plan that Netezza generates. Check how many steps that plan has by just opening the plan file. As mentioned in my previous posting, each snippet consists of a number and ending with 5xx number. If a snippet is 3, this is a sub section within that plan starting with 3 [01] and ending with 503. As you can recall, entire snippet work is done in parallel. Once this snippet is done, Netezza moves to the next snippet and so on, till completion.

2) Just like in other data appliances, Netezza typically creates a temp table involving two tables that are joined. This table is again joined with the next one, till all of them are done. This mostly involves hash joins.

3) Smaller tables involve replication or broadcast across all SPU. Larger tables involve use of sampling of data. Note that when Netezza data is loaded, it involves a quick statistics update known as express stats.

4) Assuming that we have only one query is running, you can easily evaluate SPU usage with this script at Netezza host:

/nz/kit/bin/nzstats -type spuPartition | grep -i swap

If more than one query is running, you should still get some idea on what is happening. If a SPA is getting filled up quickly compared to others indicate either data distribution problem or data that you are selecting is not evenly spread out on all SPU / SPA.

5) Check GRA and monitor, you can do this through Netezza windows tool.

Checking Netezza SPU swap partition error

You may come across this error due to data distribution. This error is displayed as:

SPU swap partition : Disk temporary work space is full

This is displayed both to the client and in the logs (pg.log or pg.log.x). A couple of simple steps are given below:

1)      If your query consists of sub queries, run each of them. I prefer that you create a table for each sub query section. That is, if your sub query has “select name, make ….from table A order by 1, 2”, create a table using the same SQL with ‘create table t_1 as ‘ followed by that SQL. If this table is created, check distribution.

2)      You can check data distribution is a couple of ways; easy way is to use Netezza administrator. Right click on that table from database tab and choose ‘record distribution’. If data is distributed evenly across all SPU, you should see vertical bars correspondingly.

3)      If step (1) creates a SPU swap partition error, this requires further investigation. This query most likely available under ‘Query history’ under session branch in Netezza administrator. Double click that query and it shows summary.

 

Likely, this points to a problem. In my case, no rows returned and only 5 snippets completed out of expected 7.

(4) If all sub queries and SQL code blocks are completed without a problem when used with ‘create table as’, probably the final SQL block is where you need to check.

Netezza Explain Plan

 
Each query that is submitted to Netezza results in a corresponding query plan. These plans are kept under /nz/data/plans and the archived ones are located under /nz/kit/log/planshist on the Netezza host. Archived files are kept up to a certain number and are purged; something like 5000 files. To save them on a permanent basis, we need to run a scheduler like a cron to move to a directory.

Query plan consists of snippets, beginning with a number ‘n’ (such as 1, 2 or 3, based on how many snippets a code consists of) and ends with 500+n.
 
Illustration (“conf” on the first line is artificially introduced by me):
         1[00]: spu ScanNode table 10001 memoryMode=yes flags=0×4 index=0 cost=24 conf=85
         1[01]: spu ProjectNode, 6 cols, projectFlags=0×0   
                    0: 1: 2: 3: 4: 5:       
         1[02]: spu HashNode table 0 flags=1 numFields=6: 0 0 4 5 6 0        
         1[03]: spu HashHistNode table 10001    
         1[04]: spu TableSizeNode table 10001    
         1[05]: spu ReturnNode     
       501[00]: dbs TableSizeNode table 10001    

 
As can be seen above, this code snippet starts with “1[00]” and ends with “501[00]“. Note that all steps in between are done in parallel; meaning, 1[00], 1[01], 1[02], etc., is all executed in parallel. “spu” refers to code executed at SPU and “dbs” is done at node or host level.
 
All other snippet blocks are executed one after another until end; basically top to bottom, within a snippet code is executed in parallel and thus node does not wait for data as such, it gets streamed as soon as available. Note that snippet block is starts with “5XY”, assuming that we have less than one hundred blocks for snippets!
 
A couple of quick points:
1) All table scans are full table scans; amount of data that is returned to SPU memory is dependent on where filter and other conditions. This results in less amount of data.
2) conf in explain plan refers to confidence level. If we are choosing a year worth of data from a four year sales table, Netezza may assume that it is choosing 25% of data. In this case, confidence level can be anything between 100 to 25%, or even lower, based on data distribution. Basically, Netezza does some kind of sampling based on our where clause. If Netezza does not have required statistics, it may do a quick statistics sampling. This can be seen as a query on its own in Netezza administrator.
3) Look for “DownloadTableNode”, this means that data is getting redistributed on SPU. This is usually a bad thing to happen. Note that double redistribution of two tables is also possible. This means that both tables are relocated based on a common column on another SPU.
4) Netezza sorts data on SPU and brings back this data to host. It does a sort merge, rather than a hash join on returned data OR a nested loop. There is no point trying to do anyting other than sort merge. See my previous posting on “hidden” Netezza parameter which you can change at session level.
5) Just like in any database, time expected, cost and other factors are estimates.
6) Update of statistics is probably one of the most important tasks for dba. A table is recognized as a fact table based on Netezza settings; at nzsql, use ‘show FACTREL_SIZE_THRESHOLD;’ to see this value (OR) you can set it using ‘set FACTREL_SIZE_THRESHOLD = xxxxxx’, you can set it to different numbers.
7) Dimension tables, if small, are broadcasted first, before starting of the query. For fact tables, all statistics details are not generated as it is too expensive for Netezza to accomplish this; instead, Netezza does a dynamic sampling like in Oracle.
8) Each query with some complexity, results in this type of activities:
 a) Create hash for selected first table
 b) Create another hash for next table
 c) hash join both tables into another table on node
 d) hash another table data and hash join with (c) above
 e) repeat above steps until done
This creates a number of temporary tables and are dropped once they are no longer required.
 
Check my earlier posting on hidden Netezza parameters, you can set them at session level to check their impact.

Teradata Architecture (Hardware)

This is a high level architecture for Teradata 5xxx series, 5625 series employ smaller hard drives and thus the architecture is slightly different. Note that number of arrays correspond to number of nodes; in this case, only two arrays are shown for two nodes.

Netezza data types and functions

Here is a list of data types along with associated info. Note that this is generated by “SQuirrel” tool using Netezza JDBC. These data types are likely applicable to both 5.x and 6.x.
TYPE_NAME DATA_TYPE PRECISION NULLABLE CASE_SENSITIVE MAXIMUM_SCALE
NVARCHAR -9 [NVARCHAR] 16000 true true 0
NCHAR -8 [ROWID] 16000 true true 0
BOOLEAN -7 [BIT] 1 true false 0
BYTEINT -6 [TINYINT] 4 true false 0
BIGINT -5 [BIGINT] 20 true false 0
CHAR 1 [CHAR] 64000 true true 0
NUMERIC 2 [NUMERIC] 38 true false 37
DECIMAL 3 [DECIMAL] 38 true false 37
INTEGER 4 [INTEGER] 11 true false 0
SMALLINT 5 [SMALLINT] 6 true false 0
FLOAT 6 [FLOAT] 7 true false 6
REAL 7 [REAL] 7 true false 14
DOUBLE 8 [DOUBLE] 15 true false 14
VARCHAR 12 [VARCHAR] 64000 true true 0
DATE 91 [DATE] 10 true false 0
TIME 92 [TIME] 15 true false 6
TIMESTAMP 93 [TIMESTAMP] 26 true false 6
INTERVAL YEAR 101 [UNKNOWN] 50 true false 0
INTERVAL MONTH 102 [UNKNOWN] 50 true false 0
INTERVAL DAY 103 [UNKNOWN] 50 true false 0
INTERVAL HOUR 104 [UNKNOWN] 50 true false 0
INTERVAL MINUTE 105 [UNKNOWN] 50 true false 0
INTERVAL SECOND 106 [UNKNOWN] 50 true false 0
INTERVAL YEAR TO MONTH 107 [UNKNOWN] 50 true false 0
INTERVAL DAY TO HOUR 108 [UNKNOWN] 50 true false 0
INTERVAL DAY TO MINUTE 109 [UNKNOWN] 50 true false 0
INTERVAL DAY TO SECOND 110 [UNKNOWN] 50 true false 0
INTERVAL HOUR TO MINUTE 111 [UNKNOWN] 50 true false 0
INTERVAL HOUR TO SECOND 112 [UNKNOWN] 50 true false 0
INTERVAL MINUTE TO SECOND 113 [UNKNOWN] 50 true false 0
TIMETZ 1266 [UNKNOWN] 21 true false 6
           
Supported refactoring: If Netezza allows or disallows database commands. For instance, we can not create an index.
           
Key Value        
supportsAddColumn FALSE        
supportsAlterColumnNull FALSE        
supportsAlterColumnType FALSE        
supportsCreateIndex FALSE        
supportsDropColumn FALSE        
supportsDropIndex FALSE        
supportsAddForeignKeyConstraint TRUE        
supportsAddPrimaryKey TRUE        
supportsAddUniqueConstraint TRUE        
supportsAlterColumnDefault TRUE        
supportsAlterSequence TRUE        
supportsCreateSequence TRUE        
supportsCreateTable TRUE        
supportsCreateView TRUE        
supportsDropConstraint TRUE        
supportsDropPrimaryKey TRUE        
supportsDropSequence TRUE        
supportsDropView TRUE        
supportsRenameColumn TRUE        
supportsRenameTable TRUE        
supportsRenameView TRUE        
           
           
Numeric Functions String Functions        
 acos  lcase        
 asin  length        
 atan  locate        
 atan2  ltrim        
 ceiling  rtrim        
 cos  substring        
 cot  ucase        
 degrees concat        
 exp          
 floor          
 log          
 log10          
 mod          
 pi          
 power          
 radians          
 rand          
 round          
 sign          
 sin          
 sqrt          
 tan          
 truncate          
abs          

 

Follow

Get every new post delivered to your Inbox.