Sometime one shouldn’t be too optimistic with .NET Fx optimization. I knew that the extension method IEnumerable.Count() was optimized to check if the enumerable was a collection, in such case the Collection.Count property was called instead (see Count() impl at the end of the post).
So when it comes to set operations such as Intersect Union… if one of the sequence involved was an HashSet, I assumed that the obvious optimization of using the O(1) HashSet.Contains() magic, was implemented. But actually, by decompiling I discovered that it was not implemented. A Set was created each time the code achieve a common set operation, no matter the type of sequences. In the case where one wishes to do set operations with an existing HashSet this is unfortunate. The gain one can expect is around 2.7, with variations depending on the sequences collateral size. This comes from the fact that with the HashSet optimization, the operation cost is O(other sequence size).
If the HashSet is bigger than the other sequence, gains factor can be high (like 15x).
If the HashSet is smaller than the other sequence, gains factor tends to 1x.
If the HashSet size is comparable to the other sequence size, the gain factor is around 2.7.
Here are quantitative measures:
nbElemHashsetA=1000 nbElemB=1000 intersectionSize=800 nbLoop=10000
Gain : 283%
nbElemHashsetA=1000 nbElemB=1000 intersectionSize=200 nbLoop=10000
Gain : 270%
nbElemHashsetA=1000 nbElemB=100 intersectionSize=50 nbLoop=10000
Gain : 1539%
nbElemHashsetA=100 nbElemB=1000 intersectionSize=50 nbLoop=10000
Gain : 127%
Here is the code optimization:
internal static IEnumerable MyIntersect(this IEnumerable first, IEnumerable second) {
So when it comes to set operations such as Intersect Union… if one of the sequence involved was an HashSet, I assumed that the obvious optimization of using the O(1) HashSet.Contains() magic, was implemented. But actually, by decompiling I discovered that it was not implemented. A Set was created each time the code achieve a common set operation, no matter the type of sequences. In the case where one wishes to do set operations with an existing HashSet this is unfortunate. The gain one can expect is around 2.7, with variations depending on the sequences collateral size. This comes from the fact that with the HashSet optimization, the operation cost is O(other sequence size).
If the HashSet is bigger than the other sequence, gains factor can be high (like 15x).
If the HashSet is smaller than the other sequence, gains factor tends to 1x.
If the HashSet size is comparable to the other sequence size, the gain factor is around 2.7.
Here are quantitative measures:
nbElemHashsetA=1000 nbElemB=1000 intersectionSize=800 nbLoop=10000
Gain : 283%
nbElemHashsetA=1000 nbElemB=1000 intersectionSize=200 nbLoop=10000
Gain : 270%
nbElemHashsetA=1000 nbElemB=100 intersectionSize=50 nbLoop=10000
Gain : 1539%
nbElemHashsetA=100 nbElemB=1000 intersectionSize=50 nbLoop=10000
Gain : 127%
Here is the code optimization:
internal static IEnumerable MyIntersect(this IEnumerable first, IEnumerable second) {