Testing
For Testers
Sunday, 14 September 2025
Scenarios for ETL testing
Scenarios in ETL Testing and SQL Examples
1. Validating Row Counts:
o Verify that the number of records matches between source and target tables.
SELECT COUNT(*) AS source_count FROM source_table;
SELECT COUNT(*) AS target_count FROM target_table;
2. Verifying Data Completeness:
• Ensure all records in the source exist in the target.
SELECT source.id
FROM source_table source
LEFT JOIN target_table target
ON source.id = target.id
WHERE target.id IS NULL;
3. Checking Data Transformation:
• Validate transformations like currency conversions or date formats.
SELECT source.id, source.salary * 0.85 AS expected_salary, target.salary
FROM source_table source
JOIN target_table target
ON source.id = target.id
WHERE target.salary != source.salary * 0.85;
4. Testing Aggregated Data:
• Compare sums, averages, and other aggregates.
SELECT SUM(sales_amount) AS source_total
FROM source_table;
SELECT SUM(sales_amount) AS target_total
FROM target_table;
5. Validating Data Integrity:
• Check for duplicates or null values in critical fields.
SELECT id, COUNT(*)
FROM target_table
GROUP BY id
HAVING COUNT(*) > 1;
SELECT id
FROM target_table
WHERE critical_field IS NULL;
6. Handling Complex Joins:
• Join multiple tables to validate hierarchical data.
SELECT o.order_id, c.customer_name, p.product_name
FROM orders o
JOIN customers c ON o.customer_id = c.customer_id
JOIN products p ON o.product_id = p.product_id;
7. Using Subqueries:
• Extract intermediate results for validation.
SELECT *
FROM (SELECT order_id, SUM(order_amount) AS total
FROM orders
GROUP BY order_id) sub
WHERE total > 10000;
ETL testing challenges
What is ETL Testing?
ETL testing validates the processes involved in Extracting data from various sources, Transforming it into a usable format, and Loading it into a target system such as a data warehouse. The goal is to ensure that the data is accurate, complete, and consistent throughout its lifecycle.
Importance of Data Quality in ETL
Data quality is the cornerstone of effective decision-making. Poor-quality data can lead to incorrect insights, financial losses, and compliance issues. ETL testing ensures:
• Accuracy: Data is correct and free from errors.
• Completeness: No data is lost or truncated during the ETL process.
• Consistency: Data adheres to predefined rules and formats.
• Timeliness: Data is available within the required timeframes.
Key Areas Where ETL Testing Ensures Data Quality
1. Source-to-Target Validation
o Verifies that data extracted from the source matches the data loaded into the target system.
o Ensures no records are lost, duplicated, or mismatched.
2. Data Transformation Validation
o Confirms that business rules and logic applied during transformation are executed correctly.
o Validates data mappings and derived fields.
3. Data Integrity Checks
o Tests for referential integrity to ensure relationships between tables are preserved.
o Verifies constraints such as primary keys, foreign keys, and unique keys.
4. Performance and Scalability
o Ensures that the ETL process performs well under varying data volumes.
o Checks that systems can handle increased loads without compromising quality.
5. Error Handling and Logging
o Validates the system's ability to identify, log, and recover from errors during the ETL process.
o Ensures data anomalies are flagged for review.
6. Metadata Validation
o Ensures that metadata, such as data types and lengths, are consistent across systems.
Tools and Techniques for ETL Testing
Various tools assist in ETL testing, such as:
• Manual Testing: Useful for small datasets but time-intensive for large-scale ETL processes.
• Automated Testing Tools: Tools like Apache Nifi, Talend, Informatica, and QuerySurge facilitate automated validations.
• SQL Queries: Widely used for source-to-target validation and transformation checks.
Challenges in ETL Testing
Despite its importance, ETL testing comes with challenges:
• Large data volumes make manual testing impractical.
• Diverse data sources and formats require complex validation techniques.
• Identifying and addressing errors in real-time during the ETL process.
Wednesday, 29 May 2019
How to delete duplicate records in teradata table ?
How to delete duplicate records in teradata table ?
For deleting duplicate records we have different ways.I have a Product_table.
Table name: product_table
product_id product_name
2 pqr
2 pqr
8 klm
8 klm
99 qqqq
Using rowid remove the duplicate records.This query will work only SQL
Syntax: delete from where rowid not in( select max(rowid) from group by );
Query: delete from product_table where rowid not in(select max(rowid) from
product_table group by product_id);
This is another way to delete duplicates. This will work in teradata.
here first create a set table with the definition then insert the records into set table. Al ready we know the set table does not allow duplicate records.
create set table product_table_t as (select * from product_table) with no data;
insert into product_table_t select * from product_table;
drop table product_table;
rename table product_table_t as product_table
using distinct, eliminating duplicate records
create temporary table with no data
Query : create table product_t as (select * from product_table) with no data
insert records in to table product_t with using distinct
insert into product_t select distinct * from product_table
drop the the product_table
rename table
rename table product_t as product
Please comment here if you know any other method of removing duplicates.
create set table product_table_t as (select * from product_table) with no data;
insert into product_table_t select * from product_table;
drop table product_table;
rename table product_table_t as product_table
using distinct, eliminating duplicate records
create temporary table with no data
Query : create table product_t as (select * from product_table) with no data
insert records in to table product_t with using distinct
insert into product_t select distinct * from product_table
drop the the product_table
rename table
rename table product_t as product
Please comment here if you know any other method of removing duplicates.
Difference between ROW_NUMBER(),RANK(),DENSE_RANK
Difference between ROW_NUMBER(),RANK(),DENSE_RANK
Let us assume, i have a emp table columns ename,sal. The emp table shows below.
ENAME
|
SAL
|
---|---|
SMITH | 800 |
ALLEN | 1600 |
WARD | 1250 |
JONES | 2975 |
MARTIN | 1250 |
BLAKE | 2850 |
CLARK | 2450 |
SCOTT | 3000 |
KING | 5000 |
TURNER | 1500 |
ADAMS | 1100 |
JAMES | 950 |
FORD | 3000 |
MILLER | 1300 |
Query like this.
SELECT ENAME,SAL,ROW_NUMBER() OVER(ORDER BY SAL DESC) ROW_NUMBER, RANK() OVER(ORDER BY SAL DESC)RANK, DENSE_RANK() OVER(ORDER BY SAL DESC) DENSE_RANK FROM EMP
Out putENAME | SAL | ROW_NUMBER | RANK | DENSE_RANK |
---|---|---|---|---|
KING | 5000 | 1 | 1 | 1 |
SCOTT | 3000 | 2 | 2 | 2 |
FORD | 3000 | 3 | 2 | 2 |
JONES | 2975 | 4 | 4 | 3 |
BLAKE | 2850 | 5 | 5 | 4 |
CLARK | 2450 | 6 | 6 | 5 |
ALLEN | 1600 | 7 | 7 | 6 |
TURNER | 1500 | 8 | 8 | 7 |
MILLER | 1300 | 9 | 9 | 8 |
WARD | 1250 | 10 | 10 | 9 |
MARTIN | 1250 | 11 | 10 | 9 |
ADAMS | 1100 | 12 | 12 | 10 |
JAMES | 950 | 13 | 13 | 11 |
SMITH | 800 | 14 | 14 | 12 |
Its all depends on your requirement and business rule you are following.
1. Row_number to be used only when you just want to have serial number on result set. It is not as intelligent as RANK and DENSE_RANK.
2. Choice between RANK and DENSE_RANK depends on business rule you are following. Rank leaves the gaps between number when it sees common values in 2 or more rows. DENSE_RANK don't leave any gaps between ranks.
So while assigning the next rank to the row RANK will consider the total count of rows before that row and DESNE_RANK will just give next rank according to the value.
So If you are selecting employee’s rank according to their salaries you should be using DENSE_RANK and if you are ranking students according to there marks you should be using RANK(Though it is not mandatory, depends on your requirement.)
1. Row_number to be used only when you just want to have serial number on result set. It is not as intelligent as RANK and DENSE_RANK.
2. Choice between RANK and DENSE_RANK depends on business rule you are following. Rank leaves the gaps between number when it sees common values in 2 or more rows. DENSE_RANK don't leave any gaps between ranks.
So while assigning the next rank to the row RANK will consider the total count of rows before that row and DESNE_RANK will just give next rank according to the value.
So If you are selecting employee’s rank according to their salaries you should be using DENSE_RANK and if you are ranking students according to there marks you should be using RANK(Though it is not mandatory, depends on your requirement.)
Sunday, 31 March 2019
Full Load vs. Incremental Load in Datawarehousing
Full Load vs. Incremental Load
Full Load:
Truncates all rows and loads from scratch.
Requires more time.
Can easily be guaranteed
Can be lost.
Incremental Load:
New records and updated ones are loaded.
Requires less time.
Difficult. ETL must check for new/updated rows.
Retained.
Tuesday, 6 November 2018
What is Software?
A Set of executable programmes or collection of programmes in a computer is called Software.
Friday, 24 August 2018
Difference between Data warehouse and Business Intelligence
Difference between Data warehouse and Business Intelligence
Data warehouse is a way of storing data and creating information through leveraging data marts. Data marts are segments or categories of information and/or data that are grouped together to provide 'information' into that segment or category. Data warehouse does not require Business Intelligence to work. Reporting tools can generate reports from the DW.
Business Intelligence is the leveraging of DW to help make business decisions and recommendations. Information and data rules engines are leveraged here to help make these decisions along with statistical analysis tools and data mining tools.
The Business Intelligence tool that we are going to learn is SSIS which uses SQL as the Backend, Sonow let us learn the SQL BasicsDifference between Data warehouse and Business Intelligence
Data warehouse is a way of storing data and creating information through leveraging data marts. Data marts are segments or categories of information and/or data that are grouped together to provide 'information' into that segment or category. Data warehouse does not require Business Intelligence to work. Reporting tools can generate reports from the DW.
Business Intelligence is the leveraging of DW to help make business decisions and recommendations. Information and data rules engines are leveraged here to help make these decisions along with statistical analysis tools and data mining tools.
The Business Intelligence tool that we are going to learn is SSIS which uses SQL as the Backend, Sonow let us learn the SQL Basics
Data warehouse is a way of storing data and creating information through leveraging data marts. Data marts are segments or categories of information and/or data that are grouped together to provide 'information' into that segment or category. Data warehouse does not require Business Intelligence to work. Reporting tools can generate reports from the DW.
Business Intelligence is the leveraging of DW to help make business decisions and recommendations. Information and data rules engines are leveraged here to help make these decisions along with statistical analysis tools and data mining tools.
The Business Intelligence tool that we are going to learn is SSIS which uses SQL as the Backend, Sonow let us learn the SQL BasicsDifference between Data warehouse and Business Intelligence
Data warehouse is a way of storing data and creating information through leveraging data marts. Data marts are segments or categories of information and/or data that are grouped together to provide 'information' into that segment or category. Data warehouse does not require Business Intelligence to work. Reporting tools can generate reports from the DW.
Business Intelligence is the leveraging of DW to help make business decisions and recommendations. Information and data rules engines are leveraged here to help make these decisions along with statistical analysis tools and data mining tools.
The Business Intelligence tool that we are going to learn is SSIS which uses SQL as the Backend, Sonow let us learn the SQL Basics
Subscribe to:
Posts (Atom)