Project 08

Data Deduplication Tool

This project focuses on cleaning structured data by identifying and removing duplicate records from a CSV file. The tool supports configurable deduplication logic, allowing duplicates to be detected based on full-row matches or selected key columns. It is designed to improve data quality prior to downstream analysis, import, or reporting.

C# (.NET)

System.IO

LINQ

Technical

Highlights

Read CSV data into in-memory structures to support row-level comparison and processing.
Implemented flexible deduplication strategies based on entire rows or specific column combinations.
Used hash-based lookups (HashSet<string>) to efficiently track and eliminate duplicate records.
Leveraged LINQ for grouping, comparison, and filtering where appropriate.
Generated a clean CSV output containing only unique rows while preserving the original schema.

Project

Takeaways

Gained practical experience performing data-cleaning operations programmatically.
Learned how hashing enables fast duplicate detection at scale.
Explored how LINQ can simplify comparison and aggregation logic in structured data processing.