How rows are stored in Teradata.
Teradata uses primary index (Teradata also uses primary key; this is not to be confused with primary index. Primary key is related to PK / FK type of relationships, whereas primary index is for accessing data. Primary index is essential for Teradata. Primary key is optional) to store data. As a new row is inserted into Teradata table, PE (parsing engine or PE; mostly used for preliminary checks on SQL and generates query plan based on table statistics maintained by Teradata) hashes each row based on primary index. For example, if primary index is customer number, a corresponding hash is generated. Since each row is only stored on one AMP (similar to SPU in Netezza), a simple calculation is done to see on which this row is to be stored.

Teradata data distribution on AMPs
If a table uses multiset, exact same data in a new row are tolerated while inserting that data into a table. If the table is defined with set, Teradata takes that row to an AMP where it could be inserted. Upon finding that there is another row with the same data, it will drop this duplicate data.
Why do we need spool in Teradata
In Teradata, whenever a query is executed, it goes to AMP to get data. This data is always assembled on spool on AMP. If two tables are joined, one a small one and the other pretty large one; in this case, small table is copied on all AMPs so that larger table can be joined without redistributing data. Why distribute?
Think of a Teradata with 32 AMPs, each AMP holds a small set (1/32 of a table roughly). Imagine a large table (Table A) with millions of rows and another table with relatively large rows (Table B). If both tables are distributed on AMP using same column(s), there is no problem in joining these two tables irrespective of join condition. This is obvious from the fact that corresponding data is on the same AMP. If a join has to take place, depending on the size of a table, joining is very fast. If Table A and B are distributed using different primary index, these table data is most likely not on the same AMP (because of distribution key, in this case primary key, is different for these tables). In this case, data has to get redistributed on corresponding AMP spool spaces, this is very expensive. If one of the tables (Table B) is small, Teradata may decide to copy on all AMPs as mentioned above; this is relatively quick. Note that this principle applies to most data appliances, including Netezza and Datallegro.
From explain plan below, you will notice that records are duplicated on all AMPs. You will also notice local processing as opposed to distributed processing of data on all AMPs. Sample explain plan, output with colored text from Teradata SQL assistant.:
SQL: explain select count(*) from Heicht.manufacturer a, Heicht.category b where a.category_id=b.category_id
1) First, we lock a distinct Heicht.”pseudo table” for read on a RowHash to prevent global deadlock for Heicht.a.
2) Next, we lock a distinct Heicht.”pseudo table” for read on a RowHash to prevent global deadlock for Heicht.b.
3) We lock Heicht.a for read, and we lock Heicht.b for read.
4) We do an all-AMPs RETRIEVE step from Heicht.b by way of an all-rows scan with a condition of (“NOT (Heicht.b.category_ID IS NULL)”) into Spool 4 (all_amps), which is duplicated on all AMPs. The size of Spool 4 is estimated with low confidence to be 1,024 rows (17,408 bytes). The estimated time for this step is 0.01 seconds.
5) We do an all-AMPs JOIN step from Spool 4 (Last Use) by way of an all-rows scan, which is joined to Heicht.a by way of an all-rows scan. Spool 4 and Heicht.a are joined using a single partition hash_ join, with a join condition of (“Heicht.a.category_ID = category_ID”). The result goes into Spool 3 (all_amps), which is built locally on the AMPs. The size of Spool 3 is estimated with index join confidence to be 156,800 rows (2,352,000 bytes). The estimated time for this step is 0.03 seconds.
6) We do an all-AMPs SUM step to aggregate from Spool 3 (Last Use) byway of an all-rows scan. Aggregate intermediate Results are computed globally, then placed in Spool 5. The size of Spool 5 is estimated with high confidence to be 1 row (23 bytes). The estimated time for this step is 0.01 seconds.
7) We do an all-AMPs RETRIEVE step from Spool 5 (Last Use) by way of an all-rows scan into Spool 1 (group_amps), which is built locally on the AMPs. The size of Spool 1 is estimated with high confidence to be 1 row (25 bytes). The estimated time for this step is 0.00 seconds.
8) Finally, we send out an END TRANSACTION step to all AMPs involved in processing the request.
-> The contents of Spool 1 are sent back to the user as the result of
statement 1. The total estimated time is 0.05 seconds.
Here estimated time of 0.05 seconds is simply unrealistic; as statistics are not accurate or complete, explain reports low level confidence. Can you guess what is wrong with NULLs in primary index; recall, primary index is used by Teradata to distribute data by hashing. If this column is allowed to have NULL values, all records go into the same AMP. These results can result in skewing and this is a major performance bottleneck. Let me reiterate, traditional data appliances performance is heavily reliant on data distribution. This is equally applicable to all operations that we perform on these appliances.
One more reason for Unique Secondary Index
Here is a scenario where unique secondary index is useful. Imagine a table with a Non-unique primary index on customer ID (think of duplicate customer records with the same customer ID) and we received an exactly the same record as this table is defined as a multiset table (see above for definition of a multiset table). In the same record, we have an order number, which is unique within this table. How do we ensure that this record is thrown out, since we do not like to have duplicate records with an order number.
| Customer ID (NUPI) |
Customer Name |
Customer Region ID |
Order number (?) |
Date and other info |
One solution: Create a unique secondary index on Order number.
Note that Teradata 13.x onwards, we can create a table without specifying a primary index (similar thing exists in Netezza, basically rows are inserted in a round robin); data is distributed on AMP randomly, with relatively decent distribution.
Multi Load versus Fast Load versus Tpump in Teradata
| Feature |
Multi Load |
Fast Load |
Tpump |
| Load Process |
Similar to bulk loading for DML (insert, delete and update). Multiload and Fastload move data in big blocks (64K or more). |
Get data to AMP; Hash and redistribute data among AMPs. Need to drop secondary indexes and recreate them (this is also preferred way in Oracle and other databases). FastLoad creates a session on each AMP. |
Also known as trigger pump is based on insert / delete / update SQL operations on data. Moves one row at a time using hash level locks. Obviously, allows empty tables before start of load. |
| Useful for |
Already populated tables. Runs at block level and thus very quick. Allows multiset (duplicate rows). |
When table is empty or daily full loads. FastLoad can convert some data types on the fly (text to numeric, etc). Fastload is fastest(?) of all. |
Batch loads with small volume. Instead of ETL and batch process update, TPump allows slow and continuous updates. Does not lock table for loading purpose, can be run even while updates are going on. Allows RI to exist. |
| Tables |
Can load up to 5 tables at a time. |
Loads one table at a time. |
It can load up to 60 tables at time, if data is from a single source. |
| Number of connections |
Up to 15 multiloads at the same time. |
Up to 15 fastloads at the same time. |
Unlimited |
| NUSI and USI |
Only NUSI allowed |
Neither allowed |
Allows both |
| Limitations |
Does not support selects. Does not support most functions. Requires use of primary index. |
Table need to be empty. No secondary indexes allowed. No support for RI, triggers, multiset (duplicate rows). |
SQL functions are mostly not supported; |
| Data formats |
Text or delimited or fixed formats. |
Text or delimited or fixed formats. |
Supports binary and binary with market, text, unformat and vartext. |