Pages

Wednesday, 29 May 2019

How to delete duplicate records in teradata table ?

How to delete duplicate records in  teradata table ?

For deleting duplicate records we have different ways.I have a Product_table.

Table name: product_table

product_id product_name
       2             pqr
       2             pqr
       8             klm
       8             klm
       99           qqqq

Using rowid remove the duplicate records.This query will work only SQL

Syntax: delete from where rowid not in( select max(rowid) from group by );

Query: delete from product_table where rowid not in(select max(rowid) from
product_table group by product_id);

This is another way to delete duplicates. This will work in teradata.
here first create a set table with the definition  then insert the records into set table. Al ready we know the set table does not allow duplicate records. 

create set table product_table_t as (select * from product_table) with no data;

insert into product_table_t select * from product_table;

drop table product_table;

rename table product_table_t as product_table

using distinct, eliminating duplicate records

create temporary table with no data

Query : create table product_t as (select * from product_table) with no data

insert records in to table product_t with using distinct

insert into product_t select distinct * from product_table

drop the the product_table

rename table

rename table product_t as product

Please comment here if you know any other method of removing duplicates.

Difference between ROW_NUMBER(),RANK(),DENSE_RANK

Difference between ROW_NUMBER(),RANK(),DENSE_RANK

Let us assume, i have a emp table columns ename,sal. The emp table shows below.

ENAME
SAL
SMITH800
ALLEN1600
WARD1250
JONES2975
MARTIN1250
BLAKE2850
CLARK2450
SCOTT3000
KING5000
TURNER1500
ADAMS1100
JAMES950
FORD3000
MILLER1300
Now lets query the table to get the salaries of all employee name with their salaries in descending order.
Query like this.
SELECT ENAME,SAL,ROW_NUMBER() OVER(ORDER BY SAL DESC) ROW_NUMBER, RANK() OVER(ORDER BY SAL DESC)RANK, DENSE_RANK() OVER(ORDER BY SAL DESC) DENSE_RANK FROM EMP
Out put

ENAMESALROW_NUMBERRANKDENSE_RANK
KING5000111
SCOTT3000222
FORD3000322
JONES2975443
BLAKE2850554
CLARK2450665
ALLEN1600776
TURNER1500887
MILLER1300998
WARD125010109
MARTIN125011109
ADAMS1100121210
JAMES950131311
SMITH800141412
So question is which one to use? 
Its all depends on your requirement and business rule you are following.
1. Row_number to be used only when you just want to have serial number on result set. It is not as intelligent as RANK and DENSE_RANK.
2. Choice between RANK and DENSE_RANK depends on business rule you are following. Rank leaves the gaps between number when it sees common values in 2 or more rows. DENSE_RANK don't leave any gaps between ranks.
So while assigning the next rank to the row RANK will consider the total count of rows before that row and DESNE_RANK will just give next rank according to the value.
So If you are selecting employee’s rank according to their salaries you should be using DENSE_RANK and if you are ranking students according to there marks you should be using RANK(Though it is not mandatory, depends on your requirement.)

Sunday, 31 March 2019

Full Load vs. Incremental Load in Datawarehousing

Full Load vs. Incremental Load



Full Load:
Truncates all rows and loads from scratch.
Requires more time.
Can easily be guaranteed
Can be lost.



Incremental Load:
New records and updated ones are loaded.
Requires less time.
Difficult. ETL must check for new/updated rows.
Retained.

Tuesday, 6 November 2018

What is Software?

A Set of  executable programmes or collection of programmes in a computer is called Software.

Friday, 24 August 2018

Difference between Data warehouse and Business Intelligence

Difference between Data warehouse and Business Intelligence 
Data warehouse is a way of storing data and creating information through leveraging data marts. Data marts are segments or categories of information and/or data that are grouped together to provide 'information' into that segment or category. Data warehouse does not require Business Intelligence to work. Reporting tools can generate reports from the DW.

Business Intelligence is the leveraging of DW to help make business decisions and recommendations. Information and data rules engines are leveraged here to help make these decisions along with statistical analysis tools and data mining tools.
The Business Intelligence tool that we are going to learn is SSIS which uses SQL as the Backend, Sonow let us learn the SQL BasicsDifference between Data warehouse and Business Intelligence
Data warehouse is a way of storing data and creating information through leveraging data marts. Data marts are segments or categories of information and/or data that are grouped together to provide 'information' into that segment or category. Data warehouse does not require Business Intelligence to work. Reporting tools can generate reports from the DW.

Business Intelligence is the leveraging of DW to help make business decisions and recommendations. Information and data rules engines are leveraged here to help make these decisions along with statistical analysis tools and data mining tools.
The Business Intelligence tool that we are going to learn is SSIS which uses SQL as the Backend, Sonow let us learn the SQL Basics

Sunday, 29 July 2018

What is a Dimension Table and Types of Dimension table


What is a Dimension Table:

If a Table contains “primary keys” and it gives the detailed information about business then such a table is called dimension table.
A Dimension table is a table which holds a list of attributes or qualities of the dimension most often used in queries and reports.
E.g. “Store” dimension can have attributes
Street,
Block Number,
City,
Region,
Country where it is located in addition to its name.

Dimension tables are ENTRY POINTS into the fact table.

1. The number of rows selected and processed from the fact table depends on the conditions (“WHERE” clauses) the user applies on the dimensional attributes selected.

2. Dimension tables are typically DE-NORMALIZED in order to reduce the number of joins in resulting queries. Dimension table attributes are generally STATIC, DESCRIPTIVE fields describing aspects of the dimension Dimension tables typically designed to hold IN-FREQUENT CHANGES to attribute values over time using SCD concepts Dimension tables are TYPICALLY used in GROUP BY SQL queries Every column in the dimension table is TYPICALLY either the primary key or a dimensional attribute Every non-key column in the dimension table is typically used in the GROUP BY clause of a SQL Query.

Types of Dimension table:

1.Conformed Dimension: 
If a dimension table is shared by multiple fact tables then that dimension is known as conformed dimension table.

2.Junk dimension: 
Junk dimensions are dimensions that contain miscellaneous data like flags, gender, text values etc and which are not useful to generate reports.



3.Slowly changing dimension: 
If the data values are changed slowly in a column or in a row over the period of time then that dimension table is called as slowly changing dimension. 

Ex: Interest rate, Address of customer etc 

There are three types of SCD’s:

Type – 1 SCD: A type-1 dimension keeps the most recent data in the target.

Type – II SCD: keeps full history in the target. For every update it keeps a new record in the target.

Type – III SCD: keeps the current and previous information in the target (partial history).

Note: To implement SCD2 we use Surrogate key.

Example1:
Example2:



        Surrogate key:

1. Surrogate Key is an artificial identifier for an entity. In surrogate key values are generated by the system sequentially (Like Identity property in SQL Server and Sequence in Oracle). They do not describe anything.
2. Joins between fact and dimension tables should be based on surrogate keys
3. Surrogate keys should not be composed of natural keys glued together
4. Users should not obtain any information by looking at these keys
5. These keys should be simple integers
6. Using surrogate key will be faster
7. Can handle Slowly Changing dimensions



4.Degenerated dimension: 
A degenerate dimension is data that is dimensional in nature but stored in a fact table. For example, if you have a dimension that only has Order Number and Order Line Number, you would have a 1:1 relationship with the Fact table. Therefore, this would be a degenerate dimension and Order Number and Order Line Number would be stored in the Fact table. Fast Changing Dimension: A fast changing dimension is a dimension whose attribute or attributes for a record (row) change rapidly over time. Example: Age of associates, Income, Daily balance etc.



5.Roll playing dimension:
One dimension plays multiple roles to retrieve data from fact tables.


6.Dirty dimension:
In this dimension table records are maintained more than once by the difference of non-key attributes.




Surrogate key in Datawarehousing


Surrogate key:

1. Surrogate Key is an artificial identifier for an entity. In surrogate key values are generated by the system sequentially (Like Identity property in SQL Server and Sequence in Oracle). They do not describe anything. 
2. Joins between fact and dimension tables should be based on surrogate keys 
3. Surrogate keys should not be composed of natural keys glued together 
4. Users should not obtain any information by looking at these keys 
5. These keys should be simple integers 
6. Using surrogate key will be faster 
7. Can handle Slowly Changing dimensions well