Pages

Sunday, 14 September 2025

Scenarios for ETL testing

Scenarios in ETL Testing and SQL Examples 1. Validating Row Counts: o Verify that the number of records matches between source and target tables. SELECT COUNT(*) AS source_count FROM source_table; SELECT COUNT(*) AS target_count FROM target_table; 2. Verifying Data Completeness: • Ensure all records in the source exist in the target. SELECT source.id FROM source_table source LEFT JOIN target_table target ON source.id = target.id WHERE target.id IS NULL; 3. Checking Data Transformation: • Validate transformations like currency conversions or date formats. SELECT source.id, source.salary * 0.85 AS expected_salary, target.salary FROM source_table source JOIN target_table target ON source.id = target.id WHERE target.salary != source.salary * 0.85; 4. Testing Aggregated Data: • Compare sums, averages, and other aggregates. SELECT SUM(sales_amount) AS source_total FROM source_table; SELECT SUM(sales_amount) AS target_total FROM target_table; 5. Validating Data Integrity: • Check for duplicates or null values in critical fields. SELECT id, COUNT(*) FROM target_table GROUP BY id HAVING COUNT(*) > 1; SELECT id FROM target_table WHERE critical_field IS NULL; 6. Handling Complex Joins: • Join multiple tables to validate hierarchical data. SELECT o.order_id, c.customer_name, p.product_name FROM orders o JOIN customers c ON o.customer_id = c.customer_id JOIN products p ON o.product_id = p.product_id; 7. Using Subqueries: • Extract intermediate results for validation. SELECT * FROM (SELECT order_id, SUM(order_amount) AS total FROM orders GROUP BY order_id) sub WHERE total > 10000;

ETL testing challenges

What is ETL Testing? ETL testing validates the processes involved in Extracting data from various sources, Transforming it into a usable format, and Loading it into a target system such as a data warehouse. The goal is to ensure that the data is accurate, complete, and consistent throughout its lifecycle. Importance of Data Quality in ETL Data quality is the cornerstone of effective decision-making. Poor-quality data can lead to incorrect insights, financial losses, and compliance issues. ETL testing ensures: • Accuracy: Data is correct and free from errors. • Completeness: No data is lost or truncated during the ETL process. • Consistency: Data adheres to predefined rules and formats. • Timeliness: Data is available within the required timeframes. Key Areas Where ETL Testing Ensures Data Quality 1. Source-to-Target Validation o Verifies that data extracted from the source matches the data loaded into the target system. o Ensures no records are lost, duplicated, or mismatched. 2. Data Transformation Validation o Confirms that business rules and logic applied during transformation are executed correctly. o Validates data mappings and derived fields. 3. Data Integrity Checks o Tests for referential integrity to ensure relationships between tables are preserved. o Verifies constraints such as primary keys, foreign keys, and unique keys. 4. Performance and Scalability o Ensures that the ETL process performs well under varying data volumes. o Checks that systems can handle increased loads without compromising quality. 5. Error Handling and Logging o Validates the system's ability to identify, log, and recover from errors during the ETL process. o Ensures data anomalies are flagged for review. 6. Metadata Validation o Ensures that metadata, such as data types and lengths, are consistent across systems. Tools and Techniques for ETL Testing Various tools assist in ETL testing, such as: • Manual Testing: Useful for small datasets but time-intensive for large-scale ETL processes. • Automated Testing Tools: Tools like Apache Nifi, Talend, Informatica, and QuerySurge facilitate automated validations. • SQL Queries: Widely used for source-to-target validation and transformation checks. Challenges in ETL Testing Despite its importance, ETL testing comes with challenges: • Large data volumes make manual testing impractical. • Diverse data sources and formats require complex validation techniques. • Identifying and addressing errors in real-time during the ETL process.

Wednesday, 29 May 2019

How to delete duplicate records in teradata table ?

How to delete duplicate records in  teradata table ?

For deleting duplicate records we have different ways.I have a Product_table.

Table name: product_table

product_id product_name
       2             pqr
       2             pqr
       8             klm
       8             klm
       99           qqqq

Using rowid remove the duplicate records.This query will work only SQL

Syntax: delete from where rowid not in( select max(rowid) from group by );

Query: delete from product_table where rowid not in(select max(rowid) from
product_table group by product_id);

This is another way to delete duplicates. This will work in teradata.
here first create a set table with the definition  then insert the records into set table. Al ready we know the set table does not allow duplicate records. 

create set table product_table_t as (select * from product_table) with no data;

insert into product_table_t select * from product_table;

drop table product_table;

rename table product_table_t as product_table

using distinct, eliminating duplicate records

create temporary table with no data

Query : create table product_t as (select * from product_table) with no data

insert records in to table product_t with using distinct

insert into product_t select distinct * from product_table

drop the the product_table

rename table

rename table product_t as product

Please comment here if you know any other method of removing duplicates.

Difference between ROW_NUMBER(),RANK(),DENSE_RANK

Difference between ROW_NUMBER(),RANK(),DENSE_RANK

Let us assume, i have a emp table columns ename,sal. The emp table shows below.

ENAME
SAL
SMITH800
ALLEN1600
WARD1250
JONES2975
MARTIN1250
BLAKE2850
CLARK2450
SCOTT3000
KING5000
TURNER1500
ADAMS1100
JAMES950
FORD3000
MILLER1300
Now lets query the table to get the salaries of all employee name with their salaries in descending order.
Query like this.
SELECT ENAME,SAL,ROW_NUMBER() OVER(ORDER BY SAL DESC) ROW_NUMBER, RANK() OVER(ORDER BY SAL DESC)RANK, DENSE_RANK() OVER(ORDER BY SAL DESC) DENSE_RANK FROM EMP
Out put

ENAMESALROW_NUMBERRANKDENSE_RANK
KING5000111
SCOTT3000222
FORD3000322
JONES2975443
BLAKE2850554
CLARK2450665
ALLEN1600776
TURNER1500887
MILLER1300998
WARD125010109
MARTIN125011109
ADAMS1100121210
JAMES950131311
SMITH800141412
So question is which one to use? 
Its all depends on your requirement and business rule you are following.
1. Row_number to be used only when you just want to have serial number on result set. It is not as intelligent as RANK and DENSE_RANK.
2. Choice between RANK and DENSE_RANK depends on business rule you are following. Rank leaves the gaps between number when it sees common values in 2 or more rows. DENSE_RANK don't leave any gaps between ranks.
So while assigning the next rank to the row RANK will consider the total count of rows before that row and DESNE_RANK will just give next rank according to the value.
So If you are selecting employee’s rank according to their salaries you should be using DENSE_RANK and if you are ranking students according to there marks you should be using RANK(Though it is not mandatory, depends on your requirement.)

Sunday, 31 March 2019

Full Load vs. Incremental Load in Datawarehousing

Full Load vs. Incremental Load



Full Load:
Truncates all rows and loads from scratch.
Requires more time.
Can easily be guaranteed
Can be lost.



Incremental Load:
New records and updated ones are loaded.
Requires less time.
Difficult. ETL must check for new/updated rows.
Retained.

Tuesday, 6 November 2018

What is Software?

A Set of  executable programmes or collection of programmes in a computer is called Software.

Friday, 24 August 2018

Difference between Data warehouse and Business Intelligence

Difference between Data warehouse and Business Intelligence 
Data warehouse is a way of storing data and creating information through leveraging data marts. Data marts are segments or categories of information and/or data that are grouped together to provide 'information' into that segment or category. Data warehouse does not require Business Intelligence to work. Reporting tools can generate reports from the DW.

Business Intelligence is the leveraging of DW to help make business decisions and recommendations. Information and data rules engines are leveraged here to help make these decisions along with statistical analysis tools and data mining tools.
The Business Intelligence tool that we are going to learn is SSIS which uses SQL as the Backend, Sonow let us learn the SQL BasicsDifference between Data warehouse and Business Intelligence
Data warehouse is a way of storing data and creating information through leveraging data marts. Data marts are segments or categories of information and/or data that are grouped together to provide 'information' into that segment or category. Data warehouse does not require Business Intelligence to work. Reporting tools can generate reports from the DW.

Business Intelligence is the leveraging of DW to help make business decisions and recommendations. Information and data rules engines are leveraged here to help make these decisions along with statistical analysis tools and data mining tools.
The Business Intelligence tool that we are going to learn is SSIS which uses SQL as the Backend, Sonow let us learn the SQL Basics