Teradata Compression Tool: AtanaSuite

Teradata Compression Tool

Teradata supports compression from version 12 onwards. Unlike some other databases, we cannot do this command:

alter table daily_inventory compress; // We can’t do this in Teradata

We need to specify what column values we like to compress and we can only pick up to 256 of them. Well, what happens if data changes and if my top 256 is no longer valid. I guess, you need to recompress using most recent table definition. Though compression is cumbersome, Teradata highly recommends using compression for large (fact) tables. We all agree that Teradata need to come up with a better tool or command to implement compression.

Tool

AtanaSoft has many tools that only work on Teradata. They came up with a nice tool called “AtanaSuite” and includes an optional add on for compression. This is easy to use and quite intuitive, no training is recommended. They produced a web seminar that can help us to understand this product better.

 

[For security reasons, text on part of this screen is erased].

Connect and analyze

It is as simple as connect this tool to Teradata and start analyzing. This tool is based on .NET and requires little configuration to connect. Start a new session with “File/New/Compression tool”. This gives an above screen in which “selection” and “options” tabs are enabled.

This tool will list all tables in that schema. Choose a table that you like to compress. This is where the second tab comes handy. As you can see this tool allows:

  • Sample data from a table (like 1%).
  • Sample using first N rows (such as 1000 rows).
  • Full table.

Once you run this tool, it returns quickly; that is, if you have a table > 1TB, it completes full table analysis within a couple of hours. After completion, this tool gives DDL for creating a backup table (under scripts tab). This script does not result in changing the original table itself.

What if scenario

As you can see in the second image, you can ask this tool what if “Change CHAR to VARCHAR” is applied on a table. What compression rate can be expected. Likewise, a number of other options are present in this second tab. You can choose as many check boxes as applicable to see what compression ratio can be obtained.

You may also assign dollar value for storage and will tell you how much storage savings are expected.

Compression ratios

From my experience, these compression rates are quite typical. Obviously, this depends greatly on your data and compression algorithm picked up by database:

  • Teradata: less than 1 to 1.5.
  • Oracle: About 2, depends on PCTFREE, etc., among other factors.
  • Datallegro: Typically 4, can be up to 6.
  • Vertica: Up to 10 or more.

As I said, this is just my observation. Your mileage may vary.

Performance versus storage space

Compression is meant for saving storage. I am yet to come across a situation where database performance improves after compression. Vendors always tell you performance (throughput or response time) greatly improves after compression.

Teradata Architecture (Hardware)

This is a high level architecture for Teradata 5xxx series, 5625 series employ smaller hard drives and thus the architecture is slightly different. Note that number of arrays correspond to number of nodes; in this case, only two arrays are shown for two nodes.

Teradata Single Join Index

Colocation of data is important in data appliances as all required data is located in the same SPU (Netezza) or on the same AMP (Teradata). Note that SPU is no longer a valid terminology in the new Netezza TwinFin architecture. If we are joining on a column on two tables, how do we ensure that data is located on the same AMP. Though we can create a new table with a different data distribution, single join index allows us to specify primary index (this is similar to ‘distribute on’ clause in Netezza). This way, we can join two tables effeciently as newly indexed data ensures that data is colocated.

Teradata Compression Tool

Teradata Compression Tool

Teradata supports compression from version 12 onwards. Unlike some other databases, we cannot do this command:

alter table daily_inventory compress; // We can’t do this in Teradata

We need to specify what column values we like to compress and we can only pick up to 256 of them. Well, what happens if data changes and if my top 256 is no longer valid. I guess, you need to recompress using most recent table definition. Though compression is cumbersome, Teradata highly recommends using compression for large (fact) tables. We all agree that Teradata need to come up with a better tool or command to implement compression.

Tool

AtanaSoft has many tools that only work on Teradata. They came up with a nice tool called “AtanaSuite” and includes an optional add on for compression. This is easy to use and quite intuitive, no training is recommended. They produced a web seminar that can help us to understand this product better.

 

[For security reasons, text on part of this screen is erased].

Connect and analyze

It is as simple as connect this tool to Teradata and start analyzing. This tool is based on .NET and requires little configuration to connect. Start a new session with “File/New/Compression tool”. This gives an above screen in which “selection” and “options” tabs are enabled.

 

This tool will list all tables in that schema. Choose a table that you like to compress. This is where the second tab comes handy. As you can see this tool allows:

  • Sample data from a table (like 1%).
  • Sample using first N rows (such as 1000 rows).
  • Full table.

Once you run this tool, it returns quickly; that is, if you have a table > 1TB, it completes full table analysis within a couple of hours. After completion, this tool gives DDL for creating a backup table (under scripts tab). This script does not result in changing the original table itself.

What if scenario

As you can see in the second image, you can ask this tool what if “Change CHAR to VARCHAR” is applied on a table. What compression rate can be expected. Likewise, a number of other options are present in this second tab. You can choose as many check boxes as applicable to see what compression ratio can be obtained.

You may also assign dollar value for storage and will tell you how much storage savings are expected.

Compression ratios

From my experience, these compression rates are quite typical. Obviously, this depends greatly on your data and compression algorithm picked up by database:

  • Teradata: less than 1 to 1.5.
  • Netezza: Version (TwinFin) 5.x onwards; table compression by default. Typically between 1 and 1.5.
  • Oracle: About 2, depends on PCTFREE, etc., among other factors.
  • Datallegro: Typically 4, can be up to 6.
  • Vertica: Up to 10 or more.

As I said, this is just my observation. Your mileage may vary.

Performance versus storage space

Compression is meant for saving storage. I am yet to come across a situation where database performance improves after compression. Vendors always tell you performance (throughput or response time) greatly improves after compression.

Teradata versus Netezza

I was checking a small table with over 26,000 rows in Teradata. Irrespective of data skew or lack of index or other features, I expected this table to return data within a couple of seconds. I noticed a simple “select *” from Teradata was forcing data to an AMP (similar to SPU in Netezza). Query results returned in about 2 minutes later. This never happens in Netezza, data simply streams from SPU to host without storing data any where. 

* Netezza: within a couple of seconds or less.

* Teradata v12: More than 2 minutes.

I do not buy an argument that this table is not based on mulitset, partition, NUSI, JI or any other exotic operations. I expect any appliance or database to return data within a few seconds, including Oracle.  Surprisingly, not so with Teradata?!

Comparison of Teradata against Netezza

Teradata: AMP; one AMP to many disks; up t0 four columns for distribution; No key may be specified for distribution; PK and FK enforcement; Most database features such as indexes, stored procedures, functions, partitions, etc.; ODBC, JDBC and .NET support; Good support with libraries; Not such a big platform for third party support. Compression in Teradata requires us to specify for each column up to 256 high cardinality values. Somehow, you need to find high occuring values (check out Atanasoft tool with compression optional module in this blog) and create a table structure by specifying all these values. You may ask, what happens a year from now when a fact table is massive and data distribution changed? Looks like we need to rebuild that table with new structure.

Netezza: SPU; one SPU to one disk; up to four columns for distribution; No key may be specified for distribution; PK and FK defined at table level, no enforcement; Lacks most database features comared to Teradata; ODBC and JDBC support; Relatively small set of libraries and code base; Emerging as a platform for third party tools, especially FPGA based programming and analytics.

Netezza materialized views (MV) are pretty lame in stark comparison to quite impressive join indexes in Teradata. Netezza MV is limited to just a single table, no where clause and even restrictions of aggregates such as sum, count, etc.. Teradata supports a few combination of join indexes, including Aggregate Join Index (AJI) and Single Table Join Index (STJI) among others. Teradata AJIs can and usually do cut down query processing on aggregations from a few hours to seconds! Teradata supports many database features including a large set of User Defined Functions (UDF). Netezza also supports UDF, developing them is not as easy. Netezza compression is built in from Netezza 4.5 onwards, achieves typically over 30% savings.

More on Teradata

How rows are stored in Teradata.

Teradata uses primary index (Teradata also uses primary key; this is not to be confused with primary index. Primary key is related to PK / FK type of relationships, whereas primary index is for accessing data. Primary index is essential for Teradata. Primary key is optional) to store data. As a new row is inserted into Teradata table, PE (parsing engine or PE; mostly used for preliminary checks on SQL and generates query plan based on table statistics maintained by Teradata) hashes each row based on primary index. For example, if primary index is customer number, a corresponding hash is generated. Since each row is only stored on one AMP (similar to SPU in Netezza), a simple calculation is done to see on which this row is to be stored.

Shows basic idea behind distribution of data on Teradata AMPs

Teradata data distribution on AMPs

If a table uses multiset, exact same data in a new row are tolerated while inserting that data into a table. If the table is defined with set, Teradata takes that row to an AMP where it could be inserted. Upon finding that there is another row with the same data, it will drop this duplicate data.

Why do we need spool in Teradata

In Teradata, whenever a query is executed, it goes to AMP to get data. This data is always assembled on spool on AMP. If two tables are joined, one a small one and the other pretty large one; in this case, small table is copied on all AMPs so that larger table can be joined without redistributing data. Why distribute?

Think of a Teradata with 32 AMPs, each AMP holds a small set (1/32 of a table roughly). Imagine a large table (Table A) with millions of rows and another table with relatively large rows (Table B). If both tables are distributed on AMP using same column(s), there is no problem in joining these two tables irrespective of join condition. This is obvious from the fact that corresponding data is on the same AMP. If a join has to take place, depending on the size of a table, joining is very fast. If Table A and B are distributed using different primary index, these table data is most likely not on the same AMP (because of distribution key, in this case primary key, is different for these tables). In this case, data has to get redistributed on corresponding AMP spool spaces, this is very expensive. If one of the tables (Table B) is small, Teradata may decide to copy on all AMPs as mentioned above; this is relatively quick. Note that this principle applies to most data appliances, including Netezza and Datallegro.

From explain plan below, you will notice that records are duplicated on all AMPs. You will also notice local processing as opposed to distributed processing of data on all AMPs.                 Sample explain plan, output with colored text from Teradata SQL assistant.:

SQL: explain select count(*) from Heicht.manufacturer a, Heicht.category b where a.category_id=b.category_id

 

  1) First, we lock a distinct Heicht.”pseudo table” for read on a RowHash to prevent global deadlock for Heicht.a.

  2) Next, we lock a distinct Heicht.”pseudo table” for read on a RowHash to prevent global deadlock for Heicht.b.

  3) We lock Heicht.a for read, and we lock Heicht.b for read.

  4) We do an all-AMPs RETRIEVE step from Heicht.b by way of an all-rows scan with a condition of (“NOT (Heicht.b.category_ID IS NULL)”) into Spool 4 (all_amps), which is duplicated on all AMPs.  The size of Spool 4 is estimated with low confidence to be 1,024 rows (17,408 bytes).  The estimated time for this step is 0.01 seconds.

  5) We do an all-AMPs JOIN step from Spool 4 (Last Use) by way of an all-rows scan, which is joined to Heicht.a by way of an all-rows scan.  Spool 4 and Heicht.a are joined using a single partition hash_ join, with a join condition of (“Heicht.a.category_ID = category_ID”). The result goes into Spool 3 (all_amps), which is built locally on the AMPs.  The size of Spool 3 is estimated with index join confidence to be 156,800 rows (2,352,000 bytes).  The estimated time for this step is 0.03 seconds.

  6) We do an all-AMPs SUM step to aggregate from Spool 3 (Last Use) byway of an all-rows scan.  Aggregate intermediate Results are computed globally, then placed in Spool 5.  The size of Spool 5 is estimated with high confidence to be 1 row (23 bytes).  The estimated time for this step is 0.01 seconds.

  7) We do an all-AMPs RETRIEVE step from Spool 5 (Last Use) by way of an all-rows scan into Spool 1 (group_amps), which is built locally on the AMPs.  The size of Spool 1 is estimated with high confidence to be 1 row (25 bytes).  The estimated time for this step is 0.00 seconds.

  8) Finally, we send out an END TRANSACTION step to all AMPs involved in processing the request.

  -> The contents of Spool 1 are sent back to the user as the result of

     statement 1.  The total estimated time is 0.05 seconds.

Here estimated time of 0.05 seconds is simply unrealistic; as statistics are not accurate or complete, explain reports low level confidence. Can you guess what is wrong with NULLs in primary index; recall, primary index is used by Teradata to distribute data by hashing. If this column is allowed to have NULL values, all records go into the same AMP. These results can result in skewing and this is a major performance bottleneck. Let me reiterate, traditional data appliances performance is heavily reliant on data distribution. This is equally applicable to all operations that we perform on these appliances.

One more reason for Unique Secondary Index

Here is a scenario where unique secondary index is useful. Imagine a table with a Non-unique primary index on customer ID (think of duplicate customer records with the same customer ID) and we received an exactly the same record as this table is defined as a multiset table (see above for definition of a multiset table). In the same record, we have an order number, which is unique within this table. How do we ensure that this record is thrown out, since we do not like to have duplicate records with an order number.

Customer ID (NUPI) Customer Name Customer Region ID Order number (?) Date and other info

 

One solution: Create a unique secondary index on Order number.

Note that Teradata 13.x onwards, we can create a table without specifying a primary index (similar thing exists in Netezza, basically rows are inserted in a round robin); data is distributed on AMP randomly, with relatively decent distribution.

Multi Load versus Fast Load versus Tpump in Teradata

Feature Multi Load Fast Load Tpump
Load Process Similar to bulk loading for DML (insert, delete and update). Multiload and Fastload move data in big blocks (64K or more). Get data to AMP; Hash and redistribute data among AMPs. Need to drop secondary indexes and recreate them (this is also preferred way in Oracle and other databases). FastLoad creates a session on each AMP. Also known as trigger pump is based on insert / delete / update SQL operations on data. Moves one row at a time using hash level locks. Obviously, allows empty tables before start of load.
Useful for Already populated tables. Runs at block level and thus very quick. Allows multiset (duplicate rows). When table is empty or daily full loads. FastLoad can convert some data types on the fly (text to numeric, etc). Fastload is fastest(?) of all. Batch loads with small volume. Instead of ETL and batch process update, TPump allows slow and continuous updates. Does not lock table for loading purpose, can be run even while updates are going on. Allows RI to exist.
Tables Can load up to 5 tables at a time. Loads one table at a time. It can load up to 60 tables at time, if data is from a single source.
Number of connections Up to 15 multiloads at the same time. Up to 15 fastloads at the same time. Unlimited
NUSI and USI Only NUSI allowed Neither allowed Allows both
Limitations Does not support selects. Does not support most functions. Requires use of primary index. Table need to be empty. No secondary indexes allowed. No support for RI, triggers, multiset (duplicate rows). SQL functions are mostly not supported;
Data formats Text or delimited or fixed formats. Text or delimited or fixed formats. Supports  binary and binary with market, text, unformat and vartext.
Follow

Get every new post delivered to your Inbox.