Pages

Wednesday, 29 May 2019

How to delete duplicate records in teradata table ?

How to delete duplicate records in  teradata table ?

For deleting duplicate records we have different ways.I have a Product_table.

Table name: product_table

product_id product_name
       2             pqr
       2             pqr
       8             klm
       8             klm
       99           qqqq

Using rowid remove the duplicate records.This query will work only SQL

Syntax: delete from where rowid not in( select max(rowid) from group by );

Query: delete from product_table where rowid not in(select max(rowid) from
product_table group by product_id);

This is another way to delete duplicates. This will work in teradata.
here first create a set table with the definition  then insert the records into set table. Al ready we know the set table does not allow duplicate records. 

create set table product_table_t as (select * from product_table) with no data;

insert into product_table_t select * from product_table;

drop table product_table;

rename table product_table_t as product_table

using distinct, eliminating duplicate records

create temporary table with no data

Query : create table product_t as (select * from product_table) with no data

insert records in to table product_t with using distinct

insert into product_t select distinct * from product_table

drop the the product_table

rename table

rename table product_t as product

Please comment here if you know any other method of removing duplicates.

Difference between ROW_NUMBER(),RANK(),DENSE_RANK

Difference between ROW_NUMBER(),RANK(),DENSE_RANK

Let us assume, i have a emp table columns ename,sal. The emp table shows below.

ENAME
SAL
SMITH800
ALLEN1600
WARD1250
JONES2975
MARTIN1250
BLAKE2850
CLARK2450
SCOTT3000
KING5000
TURNER1500
ADAMS1100
JAMES950
FORD3000
MILLER1300
Now lets query the table to get the salaries of all employee name with their salaries in descending order.
Query like this.
SELECT ENAME,SAL,ROW_NUMBER() OVER(ORDER BY SAL DESC) ROW_NUMBER, RANK() OVER(ORDER BY SAL DESC)RANK, DENSE_RANK() OVER(ORDER BY SAL DESC) DENSE_RANK FROM EMP
Out put

ENAMESALROW_NUMBERRANKDENSE_RANK
KING5000111
SCOTT3000222
FORD3000322
JONES2975443
BLAKE2850554
CLARK2450665
ALLEN1600776
TURNER1500887
MILLER1300998
WARD125010109
MARTIN125011109
ADAMS1100121210
JAMES950131311
SMITH800141412
So question is which one to use? 
Its all depends on your requirement and business rule you are following.
1. Row_number to be used only when you just want to have serial number on result set. It is not as intelligent as RANK and DENSE_RANK.
2. Choice between RANK and DENSE_RANK depends on business rule you are following. Rank leaves the gaps between number when it sees common values in 2 or more rows. DENSE_RANK don't leave any gaps between ranks.
So while assigning the next rank to the row RANK will consider the total count of rows before that row and DESNE_RANK will just give next rank according to the value.
So If you are selecting employee’s rank according to their salaries you should be using DENSE_RANK and if you are ranking students according to there marks you should be using RANK(Though it is not mandatory, depends on your requirement.)

Sunday, 31 March 2019

Full Load vs. Incremental Load in Datawarehousing

Full Load vs. Incremental Load



Full Load:
Truncates all rows and loads from scratch.
Requires more time.
Can easily be guaranteed
Can be lost.



Incremental Load:
New records and updated ones are loaded.
Requires less time.
Difficult. ETL must check for new/updated rows.
Retained.