Can we learn something practical from a pure computer science interview question? In this article we will solve an interview question and optimize it – but we will also look at a practical way to handle data in real life.
The problem of finding intersecting numbers between two arrays is simple. Assuming an input of two number arrays, return an array that is composed of numbers they have in common. A number might appear more than once. If it appears twice in each array, the resulting array will have two instances of the same number.
Let’s look at an example:
nums1 = [1, 3, 5, 1, 7, 9, 3]
nums2 = [2, 1, 9, 7, 1, 3]
The intersection between them is:
[1, 1, 9, 7, 3]
We have two instances of 1
, once instance of 7
and one instance of 9
. 3
appears only once in the result despite appearing twice in nums1
, because nums2
has only one instance of 3
.
Table of Contents
Naïve Solution
As usual when tackling a new problem, we try to solve it as naively as possible. We will iterate over the first array and for each number in the first array we will search for a similar number in the second array. If we find one, we will push it in the new array we intend to return. We will also take it out of the second array, so we won’t count it twice…
It will look something like this:
function naiveIntersection(nums1: number[], nums2: number[]): number[] {
// iterate over nums1
return nums1.reduce((intersection, valueInNums1) => {
// search for the current value in nums2
const index = nums2.indexOf(valueInNums1);
// if we found one
if (index > -1) {
// push the number to the intersection array
intersection.push(valueInNums1);
// get the number out of nums2
nums2.splice(index, 1);
}
return intersection;
}, []);
};
The complexity is O(n*m), n
being the length of one array and m
the length of the second array. That’s because we have a nested loop (reduce
and indexOf
are iterating over the data set).
Optimized Solution
You’d might say – let iterate over the longer array first and then the loop that repeats itself will run more efficiently. Well… that’s still O(n*m), so we won’t get much benefit from this strategy.
We can get an O(n+m) complexity by using a different approach. If we iterate over nums1
once, and keep a counter for each number in a hash table (hash table – fancy name for an object), we would be able to create the intersection array with one sweep over nums2
. One loop for nums1
and another for nums2
– no nested loops, better complexity, right?
Here’s an implementation for this one:
function optimizedIntersection(nums1: number[], nums2: number[]): number[] {
// create the frequency map in one sweep of nums1
const nums1Counter = nums1.reduce((counterMap, value) => {
counterMap.set(value, counterMap.get(value) ? counterMap.get(value) + 1 : 1);
return counterMap;
}, new Map());
// create the new array in one sweep of nums2
return nums2.reduce((intersection, val) => {
if (nums1Counter.get(val)) {
intersection.push(val);
nums1Counter.set(val, nums1Counter.get(val) - 1);
}
return intersection;
}, []);
}
This simple solution is more efficient by far when it comes to longer arrays. How long? When tested on nums1.length => 10
and nums2.length => 1000
the difference was that the optimized solution ran ~75% faster (figure 1).
You can play with the performance benchmark here (change the sizes of the arrays for instance).
Small Numbers Solution
We could optimize more if we know certain things about our data. For instance, if we knew one of our datasets is small or if we knew we had lots of repeating numbers (e.g. the count maps would be much smaller than the lengths of the arrays) we could do something like this:
function frequencyMap(nums: number[]) {
const map = new Map<number, number>();
for (const num of nums) {
map.set(num, map.has(num) ? map.get(num) + 1 : 1);
}
return map;
}
function intersect(nums1: number[], nums2: number[]): number[] {
const nums1CounterMap = frequencyMap(nums1);
const nums2CounterMap = frequencyMap(nums2);
let intersection = [];
for (const [num, count] of nums1CounterMap.entries()) {
if (nums2CounterMap.has(num)) {
const commonCount = Math.min(count, nums2CounterMap.get(num));
intersection = [...intersection, ...new Array(commonCount).fill(num)];
}
}
return intersection;
};
The magic here happens in the loop that iterates over nums1CounterMap.entries()
. If nums2CounterMap
also has this number, we find how many times it repeats and just add a whole array of repeating numbers in one go into the intersection array.
This way, if we have a lot of repeats or one of our arrays is small, we get much better results because there are much less iterations while building the array.
Note that we still go over the data O(m+n), so we didn’t improve the complexity – we just used a trick for characteristics of specific possible data.
Figures 2, 3 and 4 visualize the differences when optimizing for certain data characteristics. You can see that changing the data’s characteristics changes how better the different algorithms play. Eventually for bigger numbers, optimized solution and small numbers are roughly the same. You can play with the numbers here.
Very Big num2
Array
Another variation of this problem might be the following: what happens if nums2
is being fetched from a file on disk that is too big for memory?
One fun fact is that JavaScript is limited (by default) to 2 GB. You can play with it with certain flags, but it highly not recommended to load big data sets into memory in most cases.
What can we do then? We load nums2
in chunks and test each chunk vs. nums1
.
That means 2 things:
- We cannot use the
Small Numbers
solution because in this solution, we count all of the numbers innums2
in advance. If all the numbers innums2
are unique, it will result in us storing all ofnums2
in memory again (insidecounterMap
). - We need to use a tool that allows us to get the file in chunks. The most prominent tool (and the only one I can think of because I never needed anything else) is: Nodejs Streams.
How to stream a file in Nodejs?
Using streams is pretty simple. In the case of reading a file, all you need to do is create a file readStream
using node’s fs.createReadStream
. Let’s first see the code without streaming:
const fs = require('fs');
const server = require('http').createServer();
server.on('request', async (req, res) => {
res.writeHead(200);
fs.readFile('./data2.file', async (err, data) => {
if (err) throw err;
res.end(data);
});
});
server.listen(8000);
This code, on request to the server, reads the file without a stream. Trying to read a big file will result in a server that is stuck. I’ve taken a screenshot after waiting for the server for around 20 minutes:
Legends say, this server is still trying to consume this file till today 😉
Now, doing the same with a stream is simple. We replace readFile
with createDataStream
and pipe res
to it so it will output the contents of the file it processes:
const fs = require('fs');
const server = require('http').createServer();
server.on('request', async (req, res) => {
res.writeHead(200);
const data = fs.createReadStream('./data2.file');
data.pipe(res);
});
server.listen(8000);
Figure 6 shows that this code completed the task in 90 seconds:
How to use Streams to complete our task?
Now that we know how to stream a file, we can use this technique to output our intersection result. There are several ways we can do this.
To make things simple, let’s make use of the functions we already have. What we’d like to do it get the biggest possible chunk in our stream and run it in our function.
We will have to change our functions though… so let’s take the optimizedIntersection
function and use it.
Our algorithm should be as follows:
- Create a frequencyMap of nums1
- Start streaming nums2 from a file
- On every chunk, iterate over current chunk and add intersecting numbers.
- For simplicity sake, we’ll run this until the stream ends
So our
function will look like this:optimizedIntersection
function optimizedIntersection(nums1: Number[], nums2Path: string) {
// create the frequency map in one sweep of nums1
const nums1Counter = frequencyMap(nums1);
// create the new array while reading nums2 from file
const nums2FileStream = fs.createReadStream(nums2Path);
let intersection = []; // the array
// listen to new chunks coming in and add the relevant elements to the array
nums2FileStream.on('data', (data) => {
const nums2 = data.toJSON().data;
const chunkIntersection = nums2.reduce((intersection, val) => {
if (nums1Counter.get(val)) {
intersection.push(val);
nums1Counter.set(val, nums1Counter.get(val) - 1);
}
return intersection;
}, []);
intersection = [...intersection, ...chunkIntersection];
});
// when we are done, we can log it or do anything else we want
nums2FileStream.on('end', function() {
console.log('Finished streaming stuff: ', intersection);
});
// we now return the stream where users can listen to some event and on stream end...
return nums2FileStream;
}
Summary
Interview questions might not seem very practical most of the times. When digging deeper and asking followup questions, one might get some practical practices from doing simple code practice.
In this article we saw how to classically solve the Intersection of Two Arrays problem. The solution of O(m*n) was optimized to O(m+n). In addition, the optimization could be done better if we knew how our data might look. For instance, there’s a solution that is optimized for the case of sorted arrays.
Finally, we meddled a bit with streams in the case when we had a large file to consume that could not be kept in memory.
There are several ways to implement almost everything. The more complex that thing is – the more ways there are to do it. I’d love to read your suggestions to solution to the various problems presented here.
Thanks to Miki Ezra Stanger for the very kind and helpful review.
Featured Photo by Denys Nevozhai on Unsplash
There’s one more optimization you can do here:
In the optimized solution, instead of creating a new array in each iteration (as is done with “intersection = […intersection, …newValues]”), you could just push numbers into the original array, then return it.
By using a single array instead of creating n or m different ones, you should see a significant performance improvement – especially when working with very large datasets.
Another one, which I’m not sure about, is to use an array with predefined number of cells, instead of just an empty one: nums1.reduce(() => { … }, new Array(Math.max(m, n)).
This should allow the engine to allocate adjacent bytes for the result, and thus require less movement/search over memory. I’m not sure how much of an improvement it’ll translate to in most cases though.
You are correct. I’m a big fan of pre allocation for performance boosts.
Didn’t want to delve too much into it, but if it is of interest to anyone, here is a link to articles in which I speak about it and how to measure its implications: https://yonatankra.com/?s=pre+allocation.