Oracle to Netezza migration

Major steps involved in Oracle to Netezza migration are these:
1) Capture DDL, you may use “dbms_metadata.get_ddl” to capture all table definitions. Change varchar2 to varchar, does not really matter if char is more than 16 characters; stored the same way if more than 16 characters. Number is either numeric or float type. Timestamp can be either data or time. This is the most time consuming part and mostly done manually though tools are available, such as ErWin reverse engineering.

2) Most important for Netezza is distribution of data on SPU or disk. Choose distribution for each table, you may like to pick up only a handful fields for distribution of data for tables across the entire database. Recommended to use only one distribution key (field) per table, not multiple fields. For small tables, Netezza broadcasts to host where data gets sent to all SPU or nodes; this is similar to replicate in other appliances.

3) With steps (1) and (2) above, DDL are created on Netezza. Create a user and / or group and assign permissions.

4) You may like to NFS mount Netezza binaries on Oracle host (OR) Oracle file system on Netezza. I prefer the first one, if Oracle system folks do not object.
Mouting NFS from NZ host to Oracle:
Oracle to Netezza migration typically involves spooling data out as a flat text file and using nzload to seed data. 
A) These are contents of /etc/export after making changes for allowing Oracle system to access NFS mount. Last line is added for this purpose. IP address for illustration purpose: Oracle (10.11.12.13) & Netezza (10.10.11.11).
[root@NZ_host1 ~]# cat /etc/exports
# NPS setup — do not edit below this line — NFS
/nz/export      10.0.0.2/16(rw,async,insecure,anonuid=500,anongid=500)
# NPS setup — do not edit above this line — NFS
# For exporting as NFS mount on to Oracle
/nz/kit         10.11.12.13/255.255.255.255(sync,no_root_squash,rw)

B) Issue this command at Netezza host
[root@NZ_host1 ~]# exportfs -ra

C) Make a new directory on Oracle, for example: /home/oracle/NZ_bins

D) On Oracle system, do a mount command
[root@ora_dw1 /]# mount -t nfs 10.10.11.11:/nz/kit /home/oracle/NZ_bins

Once mounted, you should see binaries under bin directory (/home/oracle/NZ_bins). If you are trying to load using nzload, you may encounter a problem with password. You need to cache password on the local host first.

[oracle@ora_dw1 bin]$ nzpassword add -u admin -pw password -host 10.10.11.11
[oracle@ora_dw1 bin]$ nzpassword

Host User
————- —–
10.10.11.11 admin
[oracle@ora_dw1 bin]$ ./nzload -host 10.10.11.11 -u nzadmin1 -pw zzadmin1 -db dw_db1 -t day_time_t -delim ‘|’ -df /tmp/1

Above should load data into remote netezza database. If done, go to Netezza host and use nzsql to find number of rows inserted. Check on Oracle row count
5) NZ load can be streamed using UNIX named pipes. That is, first do ‘mkfifo /oraExport/fifo.tmp’ for example. Then, spool Oracle data and write to this file (spool /oraExport/fifo.tmp). Assuming Netezza binaries are available (see above steps), load directly on to Netezza host. Though these steps can be put in a batch file or shell command, due to nature of loads that they may not go perfectly if scripted.

Shortcut to phenomenal Oracle 11g performance (?)

As you know, Oracle provides way too many parameters, pretty much infinite combinations if you consider all possibilities of normal parameters and hidden ones. No one probably considers more than a handful to see how Oracle delivers results. After grappling with various settings and tuning tricks, I tried the following parameters. Each of them, based on the query, reduced response from days (yes, days) to seconds of response time. My environment is a mid sized DW database close to 5TB data. Good thing about them is that we can change these parameters and check results at session level:   

 
 
 
 

  • alter session set “_gby_hash_aggregation_enabled” = FALSE;

(Hash group by aggregation is not chosen in this case). 

  • alter session set optimizer_features_enable=’10.2.0.4′;

(To see acceptable values, give a wrong version like ’1.1′ and Oracle will suggest version numbers). 

  • alter session set “_query_rewrite_fudge” = 75;

(This favors materialized view query rewrite; of course, we can also give a hint to use materialized view, that is manually choosing which MV to specify. Whereas a general hint will favor appropriate MV, if available). 

  • alter session set optimizer_cost_index_cost_adj=50;

(Favors use of indexes as opposed to full table scanning. In this case, use of indexes is preferred twice the usual cost of index usage). 

  • alter session set “_optimizer_cost_based_transformation”=off;

(Disables subquery view, in all cases that I came across resulted in severe degradation of performance). 

Switching with these parameters is just a matter of seconds and we can quickly assess impact by either running the query itself or by pulling explain plan. I prefer this way of getting explain plan (run these commands at SQL prompt): 

  • explain plan for your_sql_query;
  • select * from table(dbms_xplan.display); 

 

Make sure to switch back to previous parameter settings. Most tools also allow before and after switching of parameters at a session level. I never used more than one parameter switching for any troubling query. As you guessed, there is no one parameter that worked for all queries; it all depends on the plan. 

Another cool feature that I experimented with in Oracle 11g is SPM (SQL Plan Management). More in on this in my next blog item.

Yet another advantage of Bitmap Index: No FTS for NULL values

We know that bitmap indexes are very useful in a large database environment. Apart from usual benefits, I realized that some of tables contained NULL values on indexed columns. This resulted in full table scans (FTS) and in some cases resulted in unacceptable response time. I created a bitmap index, which was a b-tree index, this has avoided FTS and has responded immediately.

Testing:
Very simple. Just create a table from dba_objects. Create an index and search for a value, it is likely to use index. Explain plan or trace file gives this info. Insert a NULL value and check if NULL is searched, it will do an FTS. Drop this B-tree index and create a bitmap index on the same column. You will see that an index scan is done on this column.

One disadvantage:
Avoid mixing Bitmap and B-tree indexes. This can result in performance degradation as bitmap indexes are converted into rowid scans in your queries. This slows down the process.

Operating Oracle 11gR2 as Oracle 10g or 9i

Oracle 11gR2 is very sophisticated with various options. Sometimes, Oracle 11gR2 or even Oracle 10g does not produce optimum query plan. Though there are a few ways to get around like using hints or changing parameter settings at database level, one of the quick and easy way to check if a query gets completed is by switching optimizer to 10g or even 9i.

From my observation, DW queries involving 3NF or transactional table layout benefit by this switch over. Here is why this is desirable in some cases:

  • End user can simply ‘alter session’ command to switch over, no database setup changes required. No systems administration need to get involved.
  • Though Oracle 11g sometimes skips temporary table creation during query processing (new optimizer feature), in some cases, Oracle 9i optimization works well.
  • One can also stablize query plan (details outside scope of this posting).
  • Using materialized view (MV) sometimes is not useful as Oracle 11g optimizer thinks that ignoring MV is a better alernative. 

SQL> show parameter optimizer_features_enable

NAME                                 TYPE                             VALUE
———————————— ——————————– ——————————
optimizer_features_enable            string                           11.2.0.1

Let us check what values we can set

SQL> alter session set optimizer_features_enable=’9.5′;
ERROR:
ORA-00096: invalid value 9.5 for parameter optimizer_features_enable, must be from among 11.2.0.1.1, 11.2.0.1, 11.1.0.7, 11.1.0.6, 10.2.0.5, 10.2.0.4, 10.2.0.3, 10.2.0.2, 10.2.0.1, 10.1.0.5, 10.1.0.4, 10.1.0.3, 10.1.0, 9.2.0.8, 9.2.0, 9.0.1, 9.0.0, 8.1.7, 8.1.6, 8.1.5, 8.1.4, 8.1.3, 8.1.0, 8.0.7, 8.0.6, 8.0.5, 8.0.4, 8.0.3, 8.0.0

SQL>   alter session set optimizer_features_enable=’9.2.0′;

Session altered.

SQL> show parameter optimizer_f

NAME                                 TYPE                             VALUE
———————————— ——————————– ——————————
optimizer_features_enable            string                           9.2.0

After running your queries, we can set it back either by altering session back or just terminating session. Second choice is not a good option if you are planning to do this often through your tool. Each session creation is resource intensive and introduces latency in overall query performance.

Oracle versus Teradata

I recently worked with both Oracle 11g and Teradata 12. Here is my opinion (see disclaimer under ‘Blog Owner’ link):

Oracle 11g is arguably most sophisticated widely available database software. It supports many features that are (possibly) not present in any database.

Teradata is more like Netezza or Datallegro in its appearance, from architecture point of view. After working for a while on Teradata, I am under the impression Oracle or Netezza is superior. More importantly, Teradata is pretty much a black box; whereas in Netezza, you can see what is really going on to a greater extent.

 

Oracle 11g

Teradata 12

Information

Just do google on any Oracle topics, you will find thousands of
relevant web pages. Some of them with thorough analysis.

Almost no other databse comes close to it in this respect. Teradata has almost no information available easily.

Technical pool

Afer SQL server, a large number of people are
experienced with Oracle. Most of Oracle experienced folks have much deeper
understanding than their counterparts.

No comparison here.

Workload management

Easy to implement,
feature rich, allows very complex settings within a reasonable time limit.

Workload management is
complex, will probably take much more time to implement.

Parallel Processing

Oracle 11g with OEM
(Oracle Enterprise Manager) makes easy to monitor queries at a very low
level. It is an amazing product, no other database
provider has anything close to it.

Parallel by design. View
point is a similar tool, not any where close to the
sophistication level Oracle has implemented. OEM is not free, licensing required from Oracle.

Explain plan or query
plan

Though intimidating at
first few times, once some one pays attention to
details, you will know at a very detailed level. Almost all particulars are
pretty clear and can be monitored as a query makes progress.

Explain plan gives good
timing information, easier to read & understand compared to Oracle’s
explain plan. Query monitoring is not as good. Lots of times, monitored timing is out of synch with explain plan; same as Oracle in this respect.

Joins

Hash, sort / merge and
nested loops, mostly.

Hash joins. Have not seen much of merge or nested loops.

Query

Extensive query features.
Oracle has added a vast query capabilities,
including a very large set of analytics. Oracle SQL is like Lexus of
databases.

Teradata query capabilities are quite thorough too, not close to
Oracle’s by any means. Documentation is poor, with limited number of
examples.

Inside out

With Oracle, you can find
pretty much everything at every level. This makes it more lot more fun and
fulfilling. Enormous information available.

To a large extent a black
box. You do not even have access to AMP or node (this is similar to SPU in Netezza. On Netezza and Datallegro, you can go to SPU level and even run SQL
commands locally).

Materialized view

Oracle supports wide
variety of MV, including features to limit refresh. Almost any SQL can be
used to create an MV.

Aggregate Join Index is
equivalent to MV. Only supports sum and avg
functions. Gets refreshed automatically. AJI performance is far superior
compared to Oracle’s, as AJI is accessed directly by Teradata.
In Oracle’s case, MV has to still go through Oracle (relatively large) SGA,
etc.

Result Cache

This is where Oracle
catches up with most data appliances. Result cache is amazing in its
performance.

No cache, by default
everything is parallelized. No need for result cache.

Partitioning

Way too many options and
features.

Available, not as
flexible as Oracle.

Query rewriting

Optimizer uses query
rewrite features to make use of underlying Materialized views,
this is sometimes not the case in Oracle.

Nice feature, effective
most of the time. Can reduces time response from
hours to seconds.

OLTP

Yes. In fact, it is even
possible to use Oracle database for both OLTP and OLAP (DW) at the same time.

OLTP is not simply
possible.

Table locking

Uses version control
(read consistency). We can even write to the same table that we are reading
from, in the same session. No problem.

Very inflexible, poor
locking. If we are creating an index, that table gets locked out.

Skewing

Not applicable. Queries
typically always get completed, they may take more time.

Skewing can really screw
up Teradata performance, in many cases, query
fails. There is no easy way to proceed, rearranging of data or rewriting of
the query is the only way.

Compression

Easy to use at table
level with just one keyword ‘for all’. Other options available, I did not use
any others.

We need to find repeating
values (cardinality) and use it a table definition. Not sure if any other
option is available.

 

Teradata is yet to make compression available that is easy to use, which is a pretty standard in many databases. Compression syntax with Teradata is difficult to use and we have to know list of values that are likely to repeat. Aggregate Join Index is a bit lame compared to what users expect or need. Everything in Teradata looks like hash based (yet to spend more time with Teradata), Oracle has quite a number of index types.

To be fair, Teradata Aggregate Join Index (AJI) is extremely fast compared to non-AJI. It also supports Oracle OLAP like feature automatically. As mentioned above, AJI gets refreshed whenever base tables are updated. One thing Teradata has pulled out very well is performance, quite a few times it can beat Oracle in terms of both response and throughput.

I admit that these observations are based based on relatively short span of attention with Teradata. One thing that amazes me is that Teradata delivers pretty good performance for some queries where Oracle does not match. To get good Oracle response, we have to tune queries or change parameters; see my recent postings above.

Just like other data appliances, Teradata expects us to treat database as a black box. Ask any one who worked on an Oracle RAC with 3 nodes or more. They must have encountered numerous system issues, including gc contention and stabilityproblems. Teradata is very stable, never crashed or even paused. As a user or as an admin, you are not expected to know system particulars; Oracle is almost exactly opposite of it!

Exadata2

Recently Oracle announced a number database releases, including Oracle Exadata2. Apparently, it can compress up to 10 times DW data. This means, up to 50 tera bytes can fit in flash memory at storage side. OLTP compression is about 3 times. I am not sure how extensively their result sets were, I thought compression beyond 5 is difficult to achieve. Here is an architecture of a single rack exadata.

Exadata2 architecture

Exadata2 architecture

As I pointed out earlier, there are no big customers using Exadata. I wonder why this was not sold (or given for free) to a high profile customer and get them present their findings.

Oracle is trying to sell Exadata as *the* database, supporting both OLTP and DW. This will not fly. In many medium to large companies, OLTP and OLAP departments live in different worlds. We can not even get OLTP folks standardize on data definitions and source data. Try to get them use a single database for all their needs?

Exadata: Quite a lemon?

Oracle has one of the largest followings, I am proud to be one of them. Oracle has about half of all enterprise database installations and their documentation is pretty extensive. Online support is incredible, probably not many software companies match.

We know I/O is the choke point for a large databases; CPU, hardware and even software caught up. Data appliances tried many ways to minimize this trouble spot. Database design now going closer to hardware implementations with dba or programmers thinking on how data is stored and how to access this data quickly.

Some pointers:

  • Exadata uses hardware and storage (V1 based on HP; V2, just released, is based on Sun), with Oracle as a database engine. Oracle is burdened with so many features, I wonder if it is possible to make it as nimble as other databases with smaller footprint. I was involved in a POC (proof of concept) for Oracle versus mySQL for a web implementation. mySQL outperformed in many benchmark tests; later we fine tuned Oracle and it matched or exceeded. One might argue that Oracle features are not to be compared with mySQL, I agree.
  • A couple of my observations on Exadata:
    Oracle Exadata is not ready. Only two customers for Exadata V1 (?), no major US installations. Since Oracle closely guards their marketing success, this is my assumption. It is possible companies like Sears, which has a large collection of database appliances, might be exploring Exadata for a production rollout.
  • Oracle is well endowed, that may be the problem. Their recent acquisition of mySQL and enterprise DB should change a bit. None of these databases are known to be a good fit for DW databases.
  • Netezza has one more point, Oracle did not quite catch up. I have not seen any bench marks of Exadata against other appliances, all are Exadata versus (presumably) Oracle. I looked at Exadata V2 FAQ, it looked same except flash memory is added and now switched to Sun hardware.
  • Netezza has 1:1 database and storage mapping. That is, each SPU has everything on it. Exadata follows datallegro and other architectures by having 1:12 database and storage mapping. This looks like a higher number and scalability is in doubt.
  • Oracle is considered as too complicated and complex by many. In fact, there is a claim that Oracle database documentation can fill up a couple of large size rooms to the ceiling! Oracle’s database schema is larger than many white boards and that is only for core tables. We heard almost all this from Netezza, more smiles. Did you Netezza requires “less than one dba” to manage their appliance (from Netezza users guide).
  • Exadata supports up to a peta database or more. Do you really think that we need that much data on any one production database.

Exadata is another branding for Oracle on steroids. Oracle will certainly make extensive improvements and their marketing teams will push this product pretty well. That takes time, so Exadata is a lemon for now :-)

SQL Tuning

Here are a few observations that I made, some are based on books & online articles. These are mostly applicable to Oracle:

  • Make the most often used column in a where clause as the leading column in a concatenated index. Some times, we may need to use a dummy condition such as dept_id>0 and other conditions.
  • Avoid using upper or lower functions on columns which are indexed. Oracle ignores indexes and does a full table scan. Also, avoid substr function on indexed columns, instead use SQL like ‘%string%’.
  • EXISTS is a faster and a cheaper alternative because the optimizer realizes that when the sub-query has been satisfied once, there is no need to proceed any further, and the next driving row can be fetched.
  • UNION ALL returns all rows (no duplicates eliminated) and thus faster, compared to UNION (fetches unique rows).
  • HAVING filter “after” rows are returned by query, try to put this in WHERE clause so that condition is applied without returning rows.
  • Use cluster columns, if they go together as one, for instance emp id and emp name.
  • Use dummy where clause to force use of indexes on character columns, see example here. SELECT ename FROM emp ORDER BY ename;   // This uses full table access// Use this instead. SELECT ename FROM emp WHERE ename > TO_CHAR(0);  // This uses fast full index scan.
  • Use function based indexes on columns which are often used with a function, such as nvl(comm,o) function in queries.
  • Avoid using functions on indexed columns, instead try using between or some other where clause, if possible.
  • Use BIND variable where possible. We could also set ‘FORCE’ for cursor sharing, see below.
  • Create indexes on columns, which are often used for sorting purpose.
  • Use decode instead of sub query joins, etc.
  • Use partitions for large tables.
  • Use index organized tables for queries involving exact match and range searches especially on the primary key.
  • Set optimizer_index_cost_adj to a lower OR higher number depending on our desire to increase or decrease the affinity for indices.
  • Force cursor sharing if possible.

Hash Joins: Hash joins are used for joining large data sets. The optimizer uses the smaller of the two tables/data sources to build a hash table on the join key in memory.It then scans the larger table, probing the hash table to find the joined rows.This operation involves joining two sets of rows and returning the result. Apart from this the result needs to be SORTED for unique values i.e. an operation involving sorting a set of rows to eliminate duplicates.If the table is small enough to fit into the memory,then the cost is limited to a single read pass over the data for the two tables.But if,the hash table is too big to fit in to the memory,optimizer chooses partitioning.

When the Optimizer Uses Hash Joins:

The optimizer uses a hash join to join two tables if they are joined using an equijoin and if either of the following are true:

  1. A large amount of data needs to be joined.
  2. A large fraction of the table needs to be joined.

Clusters

You can create a CLUSTER as follows,

Suppose you want to create a cluster named EMP_DEPT for EMP and DEPT tables.

  • First create a cluster named EMP_DEPT.

CREATE CLUSTER emp_dept(deptno number(2));

  • Now create the cluster tables as follows,

CREATE TABLE dept (

deptno NUMBER(2) PRIMARY KEY, . . . )

CLUSTER emp_dept (deptno);

CREATE TABLE emp (

empno NUMBER(5) PRIMARY KEY,

ename VARCHAR2(15) NOT NULL,

. . .

deptno NUMBER(2) REFERENCES dept)

CLUSTER emp_dept (deptno);

  • In this cluster DEPTNO is the cluster key. Be sure to choose your cluster key carefully and it should represent the join condition between the cluster tables.

Some Points to remember about Clusters:

  • Clusters should be used for tables predominantly used in join queries, they help in increasing the performance. They should be preferred in Data Warehouse Environments where the data in the cluster tables is not likely to change often.
  • Clusters can degrade the performance of your database when used in OLTP (Online Transaction Processing) environments where the data in the cluster tables is likely to be modified often.

Function based Indexes

Function Based Index can be created as shown below:

CREATE INDEX sal_comm_finx ON EMP ( sal * nvl(comm,0));

Function-based indexes facilitate queries that qualify a value returned by a function or expression. The value of the function or expression is pre-computed and stored in the index. Specific features of function-based indexing include:

  1. Indexes can be created where the search key is an expression.
  2. They provide an efficient mechanism for evaluating predicates involving functions.
  3. Case-insensitive sorts can be Performed.
  4. Descending order indexes can be created. They are treated as a special case of function-based indexes.

Index Organized Tables (IOT)

You can create index-organized tables in the following way.

CREATE TABLE docindex( token CHAR(20),

doc_id NUMBER,

token_frequency NUMBER,

token_offsets VARCHAR2(512),

CONSTRAINT pk_docindex PRIMARY KEY (token, doc_id))

ORGANIZATION INDEX TABLESPACE ind_tbs;

Index-organized tables are like regular tables with a primary key index on one or more of its columns. However, instead of maintaining two separate storage spaces for the table and B*tree index, an index-organized table only maintains a single B*tree index containing the primary key of the table and other column values.

Index-organized tables provide fast key-based access to table data for queries involving exact match and range searches especially on the primary key as seen above. Changes to the table data (such as adding new rows, updating rows, or deleting rows) result only in updating the index structure (because there is no separate table storage area). Index-organized tables are suitable for accessing data by way of primary key or any key that is a valid prefix of the primary key. There is no duplication of key values and storage requirements are reduced because a separate index structure containing the key values and ROWID is not created.

* Oracle evaluates AND condition from right to left, other databases do from left to right.

* While coding SQL, use consistent spacing, capitalization, etc. This way, Oracle uses existing cached queries (soft parsing versus hard parsing).

* Change SQL condition from NOT to greater sign as in this ‘select * from emp where grade not 1’ to ‘select * from emp where grade > 1’;

* Change OR to IN condition, this is typically faster.

* Like ‘A’ and = ‘A’ are different as like allows trailing spaces in where condition.

* Use LIKE, instead of substring or partial string comparison.

* Use UNION ALL, instead of UNION as UNION uses distinct.

* Use OR condition instead of UNION, this is faster as UNION scans table for every sub select statement. If we have a query “sub query A UNION sub query B”, UNION condition scans table twice. Going by the same logic, “select * from emp where emp_name = ‘Mike’ and not emp_no = ‘007’;” is  better than “select * from emp where emp_name = ‘Mike’ MINUS select * from emp where emp_no = ‘007’;

* Some times negative conditions are confusing, resolve them by rewriting. For instance, ‘NOT (field_A and field_B)’ is same as ‘(NOT A) OR (NOT B)’.

Java code through JDBC does an auto commit for each insert, instead we can do “connection.setAutoCommit(False);”

Avoid creation of index if queries or application code can be developed differently.

When use ‘Explain plan’ waits can be identified this way: Dbfile sequential read usually refers to index access; db file scattered read refers to table scans.

Two things that we could quickly check for Oracle performance problems: are statistics up to date and are indexes valid.

For unique index, statistics info is not useful. For non unique index, selectivity is an important aspect. If we are trying to retrieve a small number of rows, it is a good idea to have index on that column. If we are trying to get lots of rows, indexing can have negative performance effect. Since it is difficult to predict selectivity for distributed data, histograms can be used.

Code optimization: Lots of times it is better to use “exists”, compared to join. Having is probably the worst as it filters returned results in the last.

Nested loop: Used when a small number of rows to get joined to a large number of rows. Nested loop is usually preferred by Oracle. First table having a small number of rows, second table having a large number of sorted rows. We need to have an index on larger table (for sorting), smaller table gets (typically also sorted) gets joined.

Hash join: When both tables are about the same. Hash join typically requires one table in a sorted order. Hash join creates a temporary table in memory while constructing a hash join, so hash join is less preferred compared to nested loops. Netezza and other data appliances use Hash joins extensively.

Sort Merge: Most inefficient as it involves sorting both tables first.

Semi join: Uses EXISTS or IN, which are not really joins as such.

Typically, EXISTS is faster than IN. Try avoiding != or not in. Cartesian join is same as ANSI SQL cross join.

Avoid using functions in a where clause (better to use function based indexes).

IOT packs less number of rows in a block (this is a disadvantage), however, it allows better concurrency. For normal indexes, locks and latches are noticed if many users use same index to access data.

Dropping a partition invalidates a global index, whereas local partition index has no such problem.

B tree indexes mostly in OLTP, automatically restructured and space is not recovered.  Bitmap indexes mostly in DW, we need to rebuild them frequently.

IOT mostly in DW, some OLTP, use it in the order in which indexes are arranged. Clustered indexes are also used mostly in DW. Partitions are useful in both DW and OLTP.

Cluster may result in row may be else where placed when inserts are encountered. Materialized views may be created on OLTP database so that DW queries can use materialized views (instead of base tables).

Reverse index option when there is lots of contention for sequential insertion going to the same block. If data goes to same block, there is contention and results in waits. Reverse index eliminates this issue. Useful in a RAC environment. We can also “parallel” option to get index creation faster in a multi CPU environment.

For IOT tables, we need to specify IOT option at the time of table creation.

How to create a Cluster: Create a cluster, create a clustered index, add one or more tables to cluster.

To create a materialized view, have privileges for create any table and create any materialized view (explicit privileges are required on underlying table)

Fast Full Index scan: This does not guarantee getting values in any order, so do not expect an order while scanning an indexed table. Fast full index scan just reads all leaf nodes, instead of parsing up and down the tree.

If statistics are not available, Oracle uses dynamic sampling of data.

DB_FILE_MULTIBLOCK_READ_COUNT: Keep this a large number for DW databases.

SORT_AREA_SIZE: Determines maximum sort area for each session.

HASH_JOINED_ENABLED: For hash joins, typically used in OLTP databases

OPTIMIZER_INDEX_CACHING favors nested loops instead of hash joins or sort merge. Nested loops are efficient for one small table, another table having a large number of rows.

OPTIMIZER_INDEX_COST_ADJ: Parameter in favor or opposing index. If this value is set (to high?), it is likely to do full tables scans, instead of using an index.

OPTIMIZER_MAX_PERMUTATIONS: To have in a limited permutations computations.

Performance tuning: start with wait events (from OEM) or spotlight. Statspack allows snapshots at different times for comparison; statspack stores all information in its own schema and its own tablespace (perfstat), it has to be run on the database server, can not be executed from client;

Histograms give a better statistics for values that are spread out; To generate histograms, use ‘analyze table table_name compute statistics for columns col_name’;

DBMS statistics can be used to collect statistics for table, NOT for indexes (meaning, in parallel?) or clusters. DBMS stats runs in parallel. Since DBMS generates for CBO, it includes rows, blocks, etc., but not chained rows. We can gather statistics for Tables, indexes, schema, database or system.

Set first rows in Oracle for CRM applications or interactive applications. Cursor sharing to force and set timed_statistics to true. Time statistics measures wait events and has minimum impact on database performance.

Most database implementations use dedicated servers (as opposed to shared servers, which support connection pooling). This is not a concern as most application servers support connection pooling any way. Only thing that affects in a major way is that if user is away and a session is inactive, it still uses resources.

Database Technology

[Your comments are very much appreciated. Please feel free to suggest any topics of your interest in this domain].

Any major technology (IT) based solution will have database as a crucial component. Probably, this will remain that way for many years and it has been that way for decades. As we push towards more sophisticated, diverse and integrated solutions, database plays even more important role. On the application levels, one may use different languages (C, C++, Java, Ruby, etc) or techniques (Corba, Ajax or others) or protocols (TCP, HTTP, etc), one thing that is relatively constant is way in which you access data. The last mile has to be SQL, that is how it hits database. Databases will grow in every way (size, features, availability, business continuity, our expectations and even price). A few years ago, database with a few giga bytes in size is considered huge. Nowadays, Tera bytes and a few hundred Tera bytes is very common. I am creating this site to talk about database technology, giving overview from start to implementation. Before even we talk about technology, we need to look at problem that it intends to solve. We need a solution that would either improve existing one, create a new one or provide an alternative solution. I like to look at ‘holistic’ approach for such a solution:

  • Study business case & their requirements: Understand, analyze, design and implement a business solution. This is handled by conceptual design & logical database design.
  • Implement a database solution: This is where physically database gets created. Even before embarking on this journey, plan what is required and visualize system before anything is acquired. All technical details are ironed out in this phase.
  • Accessing database: This is where applications are developed and data is accessed. As this is the last stage of any development and deployment, most folks give little attention to this phase. Unfortunately, this is the one that directly impacts users, including business sponsors.

Now let us dig into details a bit more.

Business Facing

Just like any major investments, DB technology and its implementation is very much business driven. Traditionally there has been so much disconnect between business and techie world, there is a discomfort between them even during any meetings! More and more people are tech savvy nowadays. However, normal users are more exposed to Internet based technology, not so much as database side.

Get business needs

Imagine that we are given a task of coming up with a solution, for sake of simplicity, let us assume it is a new system you need to implement this solution. We need to have some idea what is involved. Is this about customers, products, pricing, et cetera. What are important aspects you like to capture in this process. Similar to ‘memory maps’ or ‘flow charts’, database technology uses “Entity Relationship” diagram (ERD, for short). However, ERD does not imply flow or logical movement from one entity to another (meaning, all items in ERD are at same level, we do not start from one end and complete at another end. This kind of interpretation may make sense from functional implementation level, such as customer placing an order. This comes later in this topic).

Data Model

It is important to realize that any non-trivial project requires planning, analysis and clear thought out process. After a project details are spelled out to us, we need to plan for two major components: process & data . Though they are dependent (process uses data, data used by process), they can be treated independent

Conceptual => Logical => Physical

It is important to appreciate that each phase is completed before embarking on next phase of data model. This is often iterative process.


Performance


This is perhaps the most discussed & debated topic in database implementation. Very often customers and business users blame it on person who managing databases. Poor DBA folks are taciturn in accepting blame, even though any system that is designed is always based on so many compromises.

  • Business users often see poor performance when they do not get results back.
    • Management with (not much) technical background would conclude: Let us upgrade CPU, bigger and better horse power! Think just for a moment, is that a good solution? Some times it is, lots of times it may not be. When you have high processing speed, it could process results quickly resulting in much worse I/O. That is, if Input / Output systems are not scaled properly, we will have lots more data than it can efficiently handle resulting in a huge performance downgrade. There could be more contention for resources as CPU is churning out more instructions. When some one gives this crazy idea of upgrading to better CPU, adjust their expectations.
  • Throughput is another way to notice poor performance. Just like a water faucet with larger pipe attached gives more water, though system giving data out may be so slow that you could see data getting refreshing on your screen slowly.
    • In Data warehouse or Decision Support Systems, performance is enhanced greatly based on two important factors: Parallel Query Processing and Parallel Disk Access. This is where Netezza greatly surpasses any expectations! Each query given to Netezza is divided into ‘sub’ query to hundreds of mini-hosts (each have their own hard drive, memory, CPU and even their own OS). Each of these units process query at the same time from their pool of data and collate all that back for completion of query. This results in austounding response time. Even a table with Tera bytes of data returns results in less than one second, always!! Even well tuned any other database system could take minutes or half-hour. In this respect, Netezza is faster by 1000 times or more. See more about Netezza below.
  • (Oracle has a cool feature, that allows us to ‘hint’ at first nrows)

Database Hardware & Systems Architecture: Netezza, Oracle, Sybase and Red Brick


Netezza

We have recently purchased Netezza and a number of DW projects are implemented on this platform. Netezza is super in saving costs, performance, admin (DBA & sys admin) support, ease of operation, simplicity (in understanding system) and easier upgrades. Any person (even without IT experience) can understand system at higher level. If there is only one reason, why any one should go for Netezza, it’s performance is exceptional among all database products.

Netezza uses many small units (known as SPU) with their own hard drive, memory, OS (DBOS) and network facility. These SPUs are interconnected, so that they can all share data and can be controlled by main host unit. Host runs on Linux and sends / receives queries. Netezza does not cache anything. Every query is pretty much a full table scan. Think about a mesh with each connecting point of horizontal & vertical lines is where a SPU resides. Imagine, if you have 100 SPUs on your Netezza system, data is usually spread across evenly against all SPUs. If you run a query to get data, all 100 SPUs send that data back. As you can imagine, this system *can be* up to 100 times faster, in this case.

SPU is divided into 3 areas for storage purpose. Main, back up and temp space. Same SPU never acts as a backup for same main storage. Reason is simple: if hard drive goes down, we like to have backup else where (on some other SPU), not on the same hard drive (then, essentially you lose entire data located on that SPU, as both online and backup are not available).

Let me give an analogy while performance of Netezza so amazing. Think of a hypothetical customer care scenario where a single person supports phone calls. She refers to a massive book catalog to find price and details of any product (think of this as a traditional database technology). She is now given ten interns, all are given a small section of the catalog. Any customer calls in, same customer care person assigns same work to all ten interns. They go through their small sections and give details back to her. She collates all that information and gives back to customer. Theoretically (and lots of times practically too), this is ten times faster than one person doing all this alone. Note that there is some time spent while collating (or aggregating) those details back. This is how Netezza works.

  • It is not that easy to divide queries into mini-queries and forward them to all SPUs. Need to have a pretty good query processor.
  • It is easy to collate or aggregate results back, especially those involving arithmetic operations. Suppose we are summing all sales, if each SPU returns sales_spu1, sales_spu2, sales_spu3…..sales_spuN. Even if any sales summation returned from any SPU can be added in any combination to get grand total (of course, we have to get results from all of them). Another way of saying it is: s1+s2+s3+…+sn is also same as s1+sn+s4+s2+s3+….s(n-1). Thus, we do not have to arrange these results from each SPU in any particular manner (unless operations such as sorting is requested in any query).

For some queries that were taking many hours on another system was tested on Netezza. Netezza just completed in a few minutes! Netezza is based on Postgres and uses only a small set of it. There is also another competing product: Datallegro (www.datallegro.com). There is even a rumor that many folks in datallegro came from Netezza. Netezza apparently has very small number of employees (a couple of hundred), yet sells & supports products that are worth a million dollars (smaller configuration costs around 500,000 or so).

However, there are a few things that do not impress many of us:

  1. It crashes way too often. This is NOT a problem as such, since entire database comes back online within a few minutes without even any one noticing. This may make lots of people discredit this product as a serious competitor.
  2. There is no good support for all database functionality. There are no triggers, no SP, no RI (yes, no referrential integriy; You can declare RI for a table and used only to enforce it for DDL. No data integrity checks while inserting / updating data), no other esoteric database objects and native code.
  3. Same goes with database admin level: no back up (provides binary, for all practical purposes, very limited), no granular admin of users & roles, etc.
  4. Overall, a bit more unstable product.

At the same time, it allows administrators to check various database related activities. It has even a windows based tool to check & administer various aspects (users, roles, tables, SPU, etc). You could check (SQL, database activity and various other db related) logs, including dynamically created C/C++ code Netezza uses for all queries. It is amazing that a query plan that Netezza creates is pretty extensive and detailed.

Netezza apparently has more competitors now: including Datallegro, IBM, etc.

Follow

Get every new post delivered to your Inbox.