Yet another interview question deep dive: Intersection of Two Arrays

Can we learn something practical from a pure computer science interview question? In this article we will solve an interview question and optimize it – but we will also look at a practical way to handle data in real life.

The problem of finding intersecting numbers between two arrays is simple. Assuming an input of two number arrays, return an array that is composed of numbers they have in common. A number might appear more than once. If it appears twice in each array, the resulting array will have two instances of the same number.

Let’s look at an example:

nums1 = [1, 3, 5, 1, 7, 9, 3]

nums2 = [2, 1, 9, 7, 1, 3]

The intersection between them is:

[1, 1, 9, 7, 3]

We have two instances of 1, once instance of 7 and one instance of 9. 3 appears only once in the result despite appearing twice in nums1, because nums2 has only one instance of 3.

Naïve Solution

As usual when tackling a new problem, we try to solve it as naively as possible. We will iterate over the first array and for each number in the first array we will search for a similar number in the second array. If we find one, we will push it in the new array we intend to return. We will also take it out of the second array, so we won’t count it twice…

It will look something like this:


function naiveIntersection(nums1: number[], nums2: number[]): number[] {
    // iterate over nums1
    return nums1.reduce((intersection, valueInNums1) => { 
        // search for the current value in nums2 
        const index = nums2.indexOf(valueInNums1); 
        // if we found one
        if (index > -1) { 
            // push the number to the intersection array
            intersection.push(valueInNums1); 
            // get the number out of nums2
            nums2.splice(index, 1); 
        }
        return intersection;
    }, []);
};

The complexity is O(n*m), n being the length of one array and m the length of the second array. That’s because we have a nested loop (reduce and indexOf are iterating over the data set).

Optimized Solution

You’d might say – let iterate over the longer array first and then the loop that repeats itself will run more efficiently. Well… that’s still O(n*m), so we won’t get much benefit from this strategy.

We can get an O(n+m) complexity by using a different approach. If we iterate over nums1 once, and keep a counter for each number in a hash table (hash table – fancy name for an object), we would be able to create the intersection array with one sweep over nums2. One loop for nums1 and another for nums2 – no nested loops, better complexity, right?

Here’s an implementation for this one:

function optimizedIntersection(nums1: number[], nums2: number[]): number[] {
    // create the frequency map in one sweep of nums1
    const nums1Counter = nums1.reduce((counterMap, value) => {
        counterMap.set(value, counterMap.get(value) ? counterMap.get(value) + 1 : 1);
        return counterMap;
    }, new Map());

    // create the new array in one sweep of nums2
    return nums2.reduce((intersection, val) => {
        if (nums1Counter.get(val)) {
            intersection.push(val);
            nums1Counter.set(val, nums1Counter.get(val) - 1);
        }
        return intersection;
    }, []);
}

This simple solution is more efficient by far when it comes to longer arrays. How long? When tested on nums1.length => 10 and nums2.length => 1000 the difference was that the optimized solution ran ~75% faster (figure 1).

Figure 1: Performance benchmark of the optimized solution vs the naive solution. Source code is here.

You can play with the performance benchmark here (change the sizes of the arrays for instance).

Small Numbers Solution

We could optimize more if we know certain things about our data. For instance, if we knew one of our datasets is small or if we knew we had lots of repeating numbers (e.g. the count maps would be much smaller than the lengths of the arrays) we could do something like this:

function frequencyMap(nums: number[]) {
    const map = new Map<number, number>();
    for (const num of nums) {
        map.set(num, map.has(num) ? map.get(num) + 1 : 1);
    }
    return map;
}

function intersect(nums1: number[], nums2: number[]): number[] {
    const nums1CounterMap = frequencyMap(nums1);
    const nums2CounterMap = frequencyMap(nums2);
    
    let intersection = [];
    
    for (const [num, count] of nums1CounterMap.entries()) {
        if (nums2CounterMap.has(num)) {
            const commonCount = Math.min(count, nums2CounterMap.get(num));
            intersection = [...intersection, ...new Array(commonCount).fill(num)];
        }
    }
    
    return intersection;
};

The magic here happens in the loop that iterates over nums1CounterMap.entries(). If nums2CounterMap also has this number, we find how many times it repeats and just add a whole array of repeating numbers in one go into the intersection array.

This way, if we have a lot of repeats or one of our arrays is small, we get much better results because there are much less iterations while building the array.

Note that we still go over the data O(m+n), so we didn’t improve the complexity – we just used a trick for characteristics of specific possible data.

Figure 2: Running the test for nums1 with 100 members and nums2 with 1000 members.
Figure 3: Running the test for nums1 with 10 members and nums2 with 1000 members
Figure 4: Running the test for nums1 and nums2 both with 100,000 members

Figures 2, 3 and 4 visualize the differences when optimizing for certain data characteristics. You can see that changing the data’s characteristics changes how better the different algorithms play. Eventually for bigger numbers, optimized solution and small numbers are roughly the same. You can play with the numbers here.

Very Big num2 Array

Another variation of this problem might be the following: what happens if nums2 is being fetched from a file on disk that is too big for memory?

One fun fact is that JavaScript is limited (by default) to 2 GB. You can play with it with certain flags, but it highly not recommended to load big data sets into memory in most cases.

What can we do then? We load nums2 in chunks and test each chunk vs. nums1.

That means 2 things:

  1. We cannot use the Small Numbers solution because in this solution, we count all of the numbers in nums2 in advance. If all the numbers in nums2 are unique, it will result in us storing all of nums2 in memory again (inside counterMap).
  2. We need to use a tool that allows us to get the file in chunks. The most prominent tool (and the only one I can think of because I never needed anything else) is: Nodejs Streams.

How to stream a file in Nodejs?

Using streams is pretty simple. In the case of reading a file, all you need to do is create a file readStream using node’s fs.createReadStream. Let’s first see the code without streaming:

const fs = require('fs');
const server = require('http').createServer();

server.on('request', async (req, res) => {
    res.writeHead(200);

    fs.readFile('./data2.file', async (err, data) => {
        if (err) throw err;
        res.end(data);
    });
});

server.listen(8000);

This code, on request to the server, reads the file without a stream. Trying to read a big file will result in a server that is stuck. I’ve taken a screenshot after waiting for the server for around 20 minutes:

Figure 5: Waiting for the server to respond after asking it to consume an 18.3GB file using fs.readFile

Legends say, this server is still trying to consume this file till today 😉

Now, doing the same with a stream is simple. We replace readFile with createDataStream and pipe res to it so it will output the contents of the file it processes:

const fs = require('fs');
const server = require('http').createServer();

server.on('request', async (req, res) => {
    res.writeHead(200);
    const data = fs.createReadStream('./data2.file');
    data.pipe(res);
});

server.listen(8000);

Figure 6 shows that this code completed the task in 90 seconds:

Figure 6: Completing the 18.3Gb file read took 90.32 seconds using streaming.

How to use Streams to complete our task?

Now that we know how to stream a file, we can use this technique to output our intersection result. There are several ways we can do this.

To make things simple, let’s make use of the functions we already have. What we’d like to do it get the biggest possible chunk in our stream and run it in our function.

We will have to change our functions though… so let’s take the optimizedIntersection function and use it.

Our algorithm should be as follows:

  1. Create a frequencyMap of nums1
  2. Start streaming nums2 from a file
  3. On every chunk, iterate over current chunk and add intersecting numbers.
  4. For simplicity sake, we’ll run this until the stream ends

So our optimizedIntersection function will look like this:

function optimizedIntersection(nums1: Number[], nums2Path: string) {
    // create the frequency map in one sweep of nums1
    const nums1Counter = frequencyMap(nums1);

    // create the new array while reading nums2 from file
    const nums2FileStream = fs.createReadStream(nums2Path);

    let intersection = []; // the array

    // listen to new chunks coming in and add the relevant elements to the array
    nums2FileStream.on('data', (data) => {
        const nums2 = data.toJSON().data;
        const chunkIntersection = nums2.reduce((intersection, val) => {
            if (nums1Counter.get(val)) {
                intersection.push(val);
                nums1Counter.set(val, nums1Counter.get(val) - 1);
            }
            return intersection;
        }, []);
        intersection = [...intersection, ...chunkIntersection];
    });
    
    // when we are done, we can log it or do anything else we want
    nums2FileStream.on('end', function() {
        console.log('Finished streaming stuff: ', intersection);
    });

    // we now return the stream where users can listen to some event and on stream end...
    return nums2FileStream;
}

Summary

Interview questions might not seem very practical most of the times. When digging deeper and asking followup questions, one might get some practical practices from doing simple code practice.

In this article we saw how to classically solve the Intersection of Two Arrays problem. The solution of O(m*n) was optimized to O(m+n). In addition, the optimization could be done better if we knew how our data might look. For instance, there’s a solution that is optimized for the case of sorted arrays.

Finally, we meddled a bit with streams in the case when we had a large file to consume that could not be kept in memory.

There are several ways to implement almost everything. The more complex that thing is – the more ways there are to do it. I’d love to read your suggestions to solution to the various problems presented here.

Thanks to Miki Ezra Stanger for the very kind and helpful review.

Featured Photo by Denys Nevozhai on Unsplash

0 0 votes
Article Rating
Subscribe
Notify of
guest

2 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
Miki
Miki
2 years ago

There’s one more optimization you can do here:
In the optimized solution, instead of creating a new array in each iteration (as is done with “intersection = […intersection, …newValues]”), you could just push numbers into the original array, then return it.
By using a single array instead of creating n or m different ones, you should see a significant performance improvement – especially when working with very large datasets.

Another one, which I’m not sure about, is to use an array with predefined number of cells, instead of just an empty one: nums1.reduce(() => { … }, new Array(Math.max(m, n)).
This should allow the engine to allocate adjacent bytes for the result, and thus require less movement/search over memory. I’m not sure how much of an improvement it’ll translate to in most cases though.