Favorite Technical Interview Questions?

Just ask Java newbies (and j2ee guys) about static, pass by and immutability and then give them sample code to hang themselves on. Beside 1 person, after over 150+ interviews I’ve yet to find a single other person (I suspect the one got prepped) to get all three right.. I’d guess and say avg get 0.75 right out of the 3. Yes that means most don’t even get 1 right.

Have to wonder what they teach in university these days as jvm & understanding how memory management works is crucial building blocks.
Yeah SA standards have dropped; there's a considerable bunch that exit without even a working comprehension of any language; I was certainly surprised to see how prevalent the copying of assignments has become; and of course dumbfounded by how they end up with degrees.
 
Last edited:
Yeah SA standards have dropped; there's a considerable bunch that exit without even a working comprehension of any language; I was certainly surprised to see how prevalent the copying of assignments has become; and of course dumbfounded by how they end up with degrees.

Hell no, I built a template engine and dynamic html renderer from scratch today. Started with the schema and worked my way up. 1kb of js in total. (the backend Devs needed a custom single purpose page, so that's why no react/vue)

Copying assignments? What bulldust is that? I see it this way... A task / assignment is an opportunity to learn, to explore, to test your metal and creativity.

You guys inspire me to learn more, so that when the time comes, I can pass it on to someone else.
 
Yeah SA standards have dropped; there's a considerable bunch that exit without even a working comprehension of any language; I was certainly surprised to see how prevalent the copying of assignments has become; and of course dumbfounded by how they end up with degrees.

The copying was already starting to be an issue back when I was teaching. They get degrees because with 100% marks for the practicals, they may only need 30-40% to pass their exams. And usually in an exam there is 30-40% of memory and/or basics before the 40% of “are you competent”, followed by the last 20-30% for those who really went beyond the scope of the course.

One typically only has to look at who got ~100% for their pracs and 30% for their exams to see who cheated.

If one does the work themselves their is a lot to be learned, otherwise it just serves to dilute the reputation of the degree.
 
The copying was already starting to be an issue back when I was teaching. They get degrees because with 100% marks for the practicals, they may only need 30-40% to pass their exams. And usually in an exam there is 30-40% of memory and/or basics before the 40% of “are you competent”, followed by the last 20-30% for those who really went beyond the scope of the course.

One typically only has to look at who got ~100% for their pracs and 30% for their exams to see who cheated.

If one does the work themselves their is a lot to be learned, otherwise it just serves to dilute the reputation of the degree.
The copying that I have been privy to certainly wouldn't classify as something difficult to automatically detect; especially considering the institutions involved. Makes me wonder about how bad this could be at the lesser ranked varsities in SA.

Plus one wonders why this hasn't been more publicly addressed, when the divergence between practical and examination scores would be so easy to spot.
 
1) How would you multiply two 4x4 matrices quickly?
2) How about two 10000x10000 matrices?
3) How about two100000x1000000000 matrices?
 
1) How would you multiply two 4x4 matrices quickly?
2) How about two 10000x10000 matrices?
3) How about two100000x1000000000 matrices?
SIMD? e.g. Intel's dgemm or with Apple, the Accelerate (SIMD) framework has an operation called vDSP_mmulD
 
SIMD? e.g. Intel's dgemm or with Apple, the Accelerate (SIMD) framework has an operation called vDSP_mmulD

Definitely along those lines for (1). The overhead of a DGEMM call would erase the benefit of SIMD for something as small as a 4x4, so the question would then move onto an inline implementation and what sequence of SIMD instructions to use. It’s not trivial SIMD code since one needs to transpose the matrix data and there will also be cross lane operations.
 
Definitely along those lines for (1). The overhead of a DGEMM call would erase the benefit of SIMD for something as small as a 4x4, so the question would then move onto an inline implementation and what sequence of SIMD instructions to use. It’s not trivial SIMD code since one needs to transpose the matrix data and there will also be cross lane operations.
Yeah dgemm is not trivial but Apple Accelerate certainly is; vDSP_mmulD is simple.
Haven't had a look but I guess OpenCV GSL should also have something similar.

As for the heavy matrix processing loads were you leaning towards mesh network processing (e.g. hypercube network), or something a bit lighter?
 
Yeah dgemm is not trivial but Apple Accelerate certainly is; vDSP_mmulD is simple.
Haven't had a look but I guess OpenCV GSL should also have something similar.

As for the heavy matrix processing loads were you leaning towards mesh network processing (e.g. hypercube network), or something a bit lighter?

vDSP_mmulID would have a similar issue to DGEMM. They’re both general implementations for arbitrary sized matrixes, so in practice it will be a bunch of for loops that will only do one iteration for the 4x4 case, or even some unrolled loops that do 0 iterations followed by a residual loop that does 1, etc.

For the heavy matrix processing, I was thinking more along the lines of how the calculation would be distributed and reduced over many machines. Also, ideally the machines would have some powerful GPUs in them.

A very good candidate would be able to tell me how to optimize for the various bandwidth vs compute trade offs between PCIE bandwidth, network bandwidth, and computational horsepower, and also be able to estimate how long the operation would take on the proposed hardware and whether the computation is limited by bandwidth or compute.
 
Last edited:
vDSP_mmulID would have a similar issue to DGEMM. They’re both general implementations for arbitrary sized matrixes, so in practice it will be a bunch of for loops that will only do one iteration for the 4x4 case, or even some unrolled loops that do 0 iterations followed by a residual loop that does 1, etc.

For the heavy matrix processing, I was thinking more along the lines of how the calculation would be distributed and reduced over many machines. Also, ideally the machines would have some powerful GPUs in them.

A very good candidate would be able to tell me how to optimize for the various bandwidth vs compute trade offs between PCIE bandwidth, network bandwidth, and computational horsepower, and also be able to estimate how long the operation would take on the proposed hardware and whether the computation is limited by bandwidth or compute.
Not sure I can appreciate the issue re vDSP_mmulD, am I missing something because two 4x4 matrices is only 16 values in/out. As for the heavy loads, yeah I thought as much hence I inferred distributed processing; as for your requirement it's sounds awfully specific but I guess that's what you need.

For fun I'd probably just want to see how I could use something like an applicative functor, re it lends itself to concurrent product over types, is easy to compose, and parallelise.
Edit: Obviously this is an area that has already seen research.
 
Last edited:
Not sure I can appreciate the issue re vDSP_mmulD, am I missing something because two 4x4 matrices is only 16 values in/out. As for the heavy loads, yeah I thought as much hence I inferred distributed processing; as for your requirement it's sounds awfully specific but I guess that's what you need.

For fun I'd probably just want to see how I could use something like an applicative functor, re it lends itself to concurrent product over types, is easy to compose, and parallelise.
Edit: Obviously this is an area that has already seen research.

For the 4x4, it’s 2x16 values in and 1x16 values out. So, it’s really simple and a good implementation for it will be loop free. The API calls you mention don’t assume 4x4 so they will have many loops, most of which are redundant. The cost of running through all that code and skipping all those loops will likely take away any advantage to using them. A fair chunk of the advantage may even be gone just from the “call” instruction.
 
For the 4x4, it’s 2x16 values in and 1x16 values out. So, it’s really simple and a good implementation for it will be loop free. The API calls you mention don’t assume 4x4 so they will have many loops, most of which are redundant. The cost of running through all that code and skipping all those loops will likely take away any advantage to using them. A fair chunk of the advantage may even be gone just from the “call” instruction.
Ok, same page... sure re 4x4, but for me that's just a simple transversal in/out to match the format of vDSP_mmulD, naturally the approach with the larger matrices would be different.
Swift:
import Accelerate
var m1 = [[1, 2, 3, 4],[5, 6, 7, 8],[9, 10, 11, 12],[13, 13, 15, 16]]
  .flatMap{$0}.map(Double.init)
var m2 = [[16, 15, 14, 13],[12, 11, 10, 9],[8, 7, 6, 5],[4, 3, 2, 1]]
  .flatMap{$0}.map(Double.init)
var mr = [Double](repeating : 0.0, count : 16)
vDSP_mmulD(m1, 1, m2, 1, &mr, 1, 4, 4, 4)
var mresult = stride(from: 0, to: mr.count, by: 4).map { Array(mr[$0..<$0+4]) }

Understandably I have no clue about the loads that you handle, but just considering the level of concern I can only imagine it requires a lot of finesse with algorithmic processing efficiencies.
 
Understandably I have no clue about the loads that you handle, but just considering the level of concern I can only imagine it requires a lot of finesse with algorithmic processing efficiencies.

For real-time processing it is important that things like the 4x4 are as fast as possible (4x4 is specifically used a lot in games), although time sensitive linear algebra is also used for time sensitive trading or any real-time machine learning application.

For large scale number crunching, being 20% more efficient is roughly equivalent to having a 20% larger super computer. Given that this can cost circa $100m, an extra 20% overall is effectively worth $20m. :)
 
For real-time processing it is important that things like the 4x4 are as fast as possible (4x4 is specifically used a lot in games), although time sensitive linear algebra is also used for time sensitive trading or any real-time machine learning application.

For large scale number crunching, being 20% more efficient is roughly equivalent to having a 20% larger super computer. Given that this can cost circa $100m, an extra 20% overall is effectively worth $20m. :)
Yeah I get the general idea, certainly is a vastly different set of concerns to the average code I write.
Btw the Swift example is just to illustrate that the vDSP_mmulD was an option, naturally that code could be further refined, still as it stands my runtimes didn't seem all that shabby (~ 0.000147ms) across a few runs.
 
Yeah I get the general idea, certainly is a vastly different set of concerns to the average code I write.
Btw the Swift example is just to illustrate that the vDSP_mmulD was an option, naturally that code could be further refined, still as it stands my runtimes didn't seem all that shabby (~ 0.000147ms) across a few runs.

That’s 147ns, so roughly 500 clocks on a Xeon Skylake. 4 loads should take 5 clocks (from L1). 4 vector multiply-adds should take 20 clocks, so it should take roughly 25 clocks total or 8ns. There is also capacity to do more than one of these in parallel if need be. If the one matrix needs to be transposed, probably another 4 clocks or so (so 9-10ns), so what you have there is roughly an order of magnitude slower. As the matrix gets bigger I would expect the hand-rolled vs vDSP implementation to converge.
 
"Explain the difference between value and reference types"

That seems to trip up about half of the candidates.
 
"Explain the difference between value and reference types"

That seems to trip up about half of the candidates.

Yup! Also you can ask in what situations can the compiler elide the copy for value types. Eg, can it elide it in C/C++ if the value is declared “const”?
 
Top
Sign up to the MyBroadband newsletter
X