How to code your own JavaScript de-duplicator

in JavaScript Arrays & Objects | Updated

Duplicate data is one of those problems that shows up constantly in real-world development: a user submits the same email address twice, an API returns overlapping records across paginated responses, a CSV export contains repeated IDs, or a text list has been copied and pasted one too many times. In all these cases, you need a reliable way to remove duplicates — ideally without a server round-trip, without loading a library, and with full control over the logic.

This article covers two approaches. First, a full-featured de-duplicator tool built in vanilla JavaScript — useful for text lists, email lists, URLs, and similar line-by-line data. Second, the modern JavaScript methods for deduplicating arrays in code: Set, filter(), and reduce(). Both are purely client-side with no external dependencies.

When do you need to deduplicate data?

Before reaching for a solution, it’s worth understanding the common scenarios where deduplication is needed:

  • Email list cleaning. Mailing lists accumulate duplicates over time from imports, form submissions, and manual additions. Sending to duplicates wastes send credits and skews analytics.
  • API response merging. When combining results from multiple paginated API calls or merging data from two endpoints, duplicate records are common.
  • Form input validation. If a user can add items to a list (tags, skills, products), you’ll want to prevent or clean duplicate entries before saving.
  • ID deduplication. Database queries, export scripts, and ETL pipelines frequently produce datasets where the same record ID appears multiple times.
  • URL normalisation. Crawl data, sitemap generators, and link checkers often produce lists containing the same URL in slightly different forms.

Full-featured de-duplicator tool

The tool below handles line-by-line text data — paste in a list of emails, URLs, IDs, or names, choose your options, and get a clean deduplicated output. Test the de-duplicator live here.

HTML

The markup is minimal: two <textarea> elements (input and output), three checkboxes for options, and a button to trigger the process.

<p>
    <textarea name="masterlist" id="masterlist" rows="16" style="width:100%"></textarea>
</p>
<p>
    <label><input type="checkbox" name="caps" id="caps" value="" checked> Ignore capitals (results in lower case)</label><br>
    <label><input type="checkbox" name="kpblanks" id="kpblanks" value=""> Keep blanks at line starts</label><br>
    <label><input type="checkbox" name="sort" id="sort" value=""> Sort results</label>
</p>
<input type="submit" class="button" value="De-duplicate" onclick="deduplicate()">
<a name="startresults"></a>
<p name="removed" id="removed"></p>
<textarea name="output" id="output" rows="16" style="display:none;width:100%" onclick="this.focus();this.select()"></textarea>

JavaScript — line by line

The deduplicate() function works by exploiting a property of JavaScript objects: object keys are unique. By using the processed line as both the key and value of an associative structure, any line that appears more than once simply overwrites its previous entry rather than adding a new one. Here’s the full function followed by a walkthrough of each stage:

function deduplicate() {
    var txt = document.getElementById( 'masterlist' ).value;

    // Escape any HTML entities in the input to prevent XSS
    txt = txt.replace( new RegExp( '>', 'g' ), '>' );
    txt = txt.replace( new RegExp( '<', 'g' ), '<' );

    // Split the textarea value into an array, one entry per line
    var masterarray  = txt.split( '\n' );
    var itemsInArray = masterarray.length;

    var dedupe      = new Array();
    var i           = 0;
    var editedArray = new Array();

    while ( i < itemsInArray ) {
        // Strip trailing whitespace from each line
        masterarray[ i ] = masterarray[ i ].replace( /\s+$/, '' );

        // Normalise tabs to single spaces
        masterarray[ i ] = masterarray[ i ].replace( new RegExp( '\t', 'g' ), ' ' );

        // Handle leading whitespace based on the "keep blanks" option
        if ( ! document.getElementById( 'kpblanks' ).checked ) {
            masterarray[ i ] = masterarray[ i ].replace( /^\s+/, '' );
        } else {
            if ( masterarray[ i ].match( /^ +/ ) ) {
                var spc = masterarray[ i ].match( /^ +/ );
                spc[ 0 ] = spc[ 0 ].replace( / /g, ' ' );
                masterarray[ i ] = masterarray[ i ].replace( /^\s+/, spc[ 0 ] );
            }
        }

        // Optionally convert to lowercase for case-insensitive deduplication
        var ulc = document.getElementById( 'caps' ).checked
            ? masterarray[ i ].toLowerCase()
            : masterarray[ i ];

        // Using the processed value as an object key guarantees uniqueness
        editedArray[ ulc ] = ulc;
        dedupe[ ulc ]      = '0';

        i++;
    }

    // Collect the unique keys into a plain array
    i = 0;
    var uniques = new Array();

    for ( var key in dedupe ) {
        if ( editedArray[ key ] !== '' ) {
            uniques.push( editedArray[ key ] );
        }
        dedupe[ key ] = 'dontprint';
        i++;
    }

    // Optionally sort the results alphabetically (case-insensitive)
    if ( document.getElementById( 'sort' ).checked ) {
        uniques.sort( function( x, y ) {
            var a = String( x ).toUpperCase();
            var b = String( y ).toUpperCase();
            if ( a > b ) { return 1; }
            if ( a < b ) { return -1; }
            return 0;
        } );
    }

    // Display the results
    var ulen  = uniques.length;
    var rmvd  = itemsInArray - ulen;
    var thelist = uniques.join( '\n' );

    document.getElementById( 'removed' ).innerHTML = itemsInArray + ' original lines, ' + rmvd + ' removed, ' + ulen + ' remaining.';
    document.getElementById( 'output' ).value       = thelist;
    document.getElementById( 'output' ).style.display = 'block';

    window.location = '#startresults';
}

Stage 1 — HTML escaping. Before processing, any < and > characters in the input are escaped. This prevents any HTML or script tags in the input data from being interpreted by the browser when results are written back to the DOM.

Stage 2 — Splitting into lines. txt.split('\n') converts the textarea string into an array where each element is one line of input. This is the core data structure the rest of the function operates on.

Stage 3 — Normalisation. Each line has trailing whitespace stripped (/\s+$/) and tabs converted to spaces. If the "keep blanks at line starts" option is off, leading whitespace is also stripped. This normalisation step is important — without it, "hello" and "hello " would be treated as different values.

Stage 4 — Deduplication via object keys. The processed line (optionally lowercased) is used as a key in the editedArray and dedupe objects. Because JavaScript object keys must be unique, assigning the same key multiple times simply overwrites the existing entry. The result after the loop is that dedupe contains only one entry per unique line.

Stage 5 — Collecting and sorting. A for...in loop over dedupe pushes each unique value into the uniques array. If sorting is enabled, a case-insensitive alphabetical sort is applied using toUpperCase() for consistent comparison.

Stage 6 — Output. The results are joined back into a newline-separated string and written to the output textarea. A summary showing how many lines were removed is displayed above the results.

Alternate lightweight version

If you need a minimal version that preserves case and skips all the options, here's a single-function implementation. This was written as a code-golfing exercise — it's intentionally compact, not production-recommended, but useful to understand as an illustration of how concise the core deduplication logic actually is:

<p><textarea id="j" rows="16" style="width:100%"></textarea></p>
<p><button onclick="d();">Process</button></p>
<p><textarea id="l" rows="16" style="width:100%"></textarea></p>
function d() {
    var a = document.getElementById( 'j' ).value.split( '\n' );
    var b = [];
    a.forEach( function( c ) { b[ c ] = 1; } );
    a = [];
    for ( var k in b ) { a.push( k ); }
    document.getElementById( 'l' ).value = a.join( '\n' );
}

The same object-key trick is used here: each line from the split array is used as a key in b, overwriting duplicates. Then for...in collects the unique keys back into an array.

Modern JavaScript alternatives for array deduplication

If you're deduplicating a JavaScript array in code (rather than processing text input), the language gives you several cleaner options. All of the following work in every modern browser and in Node.js without any polyfills.

Using Set (simplest approach)

Set is a built-in JavaScript object that stores only unique values. Converting an array to a Set and back is the most concise deduplication approach available:

const input    = [ 'alice@example.com', 'bob@example.com', 'alice@example.com', 'carol@example.com' ];
const unique   = [ ...new Set( input ) ];

console.log( unique );
// ['alice@example.com', 'bob@example.com', 'carol@example.com']

new Set(input) creates a Set containing only the unique values from the array. The spread operator [...set] converts it back to a plain array. This preserves insertion order and handles strings, numbers, and booleans correctly. It uses strict equality (===) for comparison, so it won't deduplicate objects by their content — only by reference.

Using filter() and indexOf()

This approach uses Array.filter() to keep only the first occurrence of each value. It's slightly more verbose than Set but gives you more control and works in environments that predate ES6:

const input  = [ 'alice', 'bob', 'alice', 'carol', 'bob' ];
const unique = input.filter( function( value, index, self ) {
    return self.indexOf( value ) === index;
} );

console.log( unique );
// ['alice', 'bob', 'carol']

indexOf() returns the index of the first occurrence of a value. For duplicate values, the second occurrence will have a different index than what indexOf() returns, so the filter() callback returns false for it and it's excluded from the result.

This approach has O(n²) time complexity because indexOf() scans the entire array for each element. For large arrays (tens of thousands of items or more), the Set approach is significantly faster.

Using reduce()

Array.reduce() gives you full control over the accumulation logic, which is useful when you need to deduplicate while simultaneously transforming or filtering the data:

const input  = [ 'alice', 'bob', 'alice', 'carol', 'bob' ];
const unique = input.reduce( function( accumulator, current ) {
    if ( accumulator.indexOf( current ) === -1 ) {
        accumulator.push( current );
    }
    return accumulator;
}, [] );

console.log( unique );
// ['alice', 'bob', 'carol']

The accumulator starts as an empty array. For each item in the input, it's only pushed to the accumulator if it's not already present. Like the filter approach, this is O(n²) — fine for small datasets, but consider Set for large ones.

Deduplicating objects by a property

When working with arrays of objects, you'll typically want to deduplicate by a specific property (e.g. an ID or email). Set won't help here since it compares objects by reference. A Map is the cleanest solution:

const users = [
    { id: 1, name: 'Alice' },
    { id: 2, name: 'Bob' },
    { id: 1, name: 'Alice (duplicate)' },
    { id: 3, name: 'Carol' },
];

const unique = [ ...new Map( users.map( function( user ) {
    return [ user.id, user ];
} ) ).values() ];

console.log( unique );
// [{ id: 1, name: 'Alice' }, { id: 2, name: 'Bob' }, { id: 3, name: 'Carol' }]

Map uses the specified property (user.id) as the key. Because Map keys are unique, later entries with the same key overwrite earlier ones. .values() extracts the deduplicated objects, and the spread operator converts them back to an array. This runs in O(n) time.

Choosing the right approach

Here's a quick reference for picking the right method:

ScenarioBest approach
Deduplicating a text list (emails, URLs, IDs) in a UIFull-featured tool above
Deduplicating a simple array of strings or numbersSet
Deduplicating with custom comparison logicfilter() or reduce()
Deduplicating an array of objects by a propertyMap
Large dataset (>10,000 items), performance mattersSet or Map (both O(n))
Need ES5 compatibilityfilter() + indexOf()

Browser compatibility and performance notes

Set and Map are supported in all modern browsers and Node.js v4+. Array.filter(), reduce(), and indexOf() are available in IE9+ and every modern environment. The spread operator ([...set]) requires ES6 support (IE11 excluded), but can be replaced with Array.from(set) if you need IE11 compatibility.

For the line-by-line text tool, performance is limited primarily by the textarea input size and DOM manipulation, not by the deduplication algorithm itself. For pure array deduplication in code, Set and Map use hash-based lookups and run in O(n) time, making them the right choice for any dataset larger than a few hundred items.

Related Posts