String Interning - The Next Step

I watched this latest Code Cop episode from Nick Chapsas and had to add some of my own thoughts. I usually agree with Nick, but on rare occasion I either disagree with him, or as is the case today, I feel he didn't go deep enough.

The two things I felt should have been were:

  • Explain alternatives
  • Explain when it DOES make sense to intern a string
So, alternatives are important because, while the original post may have been an April fool's "prank", the problem it purports to solve is a REAL problem. This, I think, is a big part of why Nick felt the need to tackle it in the first place. However, because it is a real problem, it also requires a deeper dive.

I can see his video having one of 2 effects on people watching it. Either they stop trying to better manage their strings or they start trying to justify why they should be interning strings.

For me, the sort of example provided in the original LinkedIn advice is bad for a different and very simple reason; human error. If you have the exact same string literal in multiple places in your code, those multiple places are likely NOT one line after another. And that means that you can make typos, or update it in one place and not in another. Realistically, the fragility of the code is going to become an issue MUCH faster than the potential for memory allocations. Generally speaking, what you would want for these literals is a single variable, possibly static (depending on the use case) and then reference that variable everywhere that you need that value.

Having it in a single (possibly static) variable means that, as with string interning, there is only a single reference to that string. And, you can't make typos when referencing the string as you'll get compile time errors.

As for when to use string interning? Well, firstly, when it isn't a string literal. Beyond that though, to make it worth while, I would say that you would want it to be a non literal value which will also be frequently accessed. Perhaps across multiple threads at a time and in a scenario where a static value could not, for some reason, stand in for it. Likely, you also want it to a value which would otherwise stay persisted in memory for a "long time". Long time is in quotes because it really depends on the situation. 5ms can be a "long time" if it is in a highly traversed path of code which is executed thousands of times a second, whereas the same code in a function which is called once a minute and is part of a much larger application would be better off simply allocating the memory as needed and letting it get garbage collected later.

An example might be something like a tenant name or identifier in a multi-tenant web service. These are values which wouldn't be known at compile time, but almost every call will be associated with a tenant at some level and things like logs, lookups and new records in the DB will often require this value and it may get calculated and stored multiple times throughout the lifecycle of a single request. If the web server is busy enough and a tenant is likely enough to make multiple repeated calls in a short period, then you might notice some gains in a memory constrained system. 

I'll admit, I struggled to come up with a good example, so I took an example which could work and I'm focusing on why it could work in some scenarios. The important parts here are that we're assuming:
  • That there is a single value which could not be known at compile time and thus cannot be a string literal
  • The value is likely to be needed in multiple places in the code, and likely places which are unaware of each other (different parts of the Asp.Net Pipeline for example, such as multiple middle wares or injectable services)
  • The value itself couldn't be a static value
    • A multi-tenant system will have multiple tenants for example, so a single static variable could not be used
  • The value would need to be calculated frequently enough that it would not make sense to simply use dependency injection
    • For example, we could save the multiple allocations on a single call by using IoC to track a scoped object which holds the value, but it would still be calculated for every call. 
    • This could become wasteful if an object with the same value were likely to be instantiated many times concurrently. 
I say the multi-tenant example is bad because while a tenant would meet that criteria, a multi-tenant system would generally built to scale handling an unknown, but potentially very large number of tenants. And we would generally not want to intern them all.

Otherwise, the example gets the basics right. And some multi-tenant systems may be run on-premises or in smaller, private clouds with a few high volume consumers. In those cases, it would be perfect.

My objective in clarifying is important. I agree with Nick, it is most likely that you want to leave string interning up to the compiler and not deal with it yourself. Even when I can fabricate a scenario, it needs guard rails to make any level of sense. And, software almost always gets pushed outside the guard rails once it goes to production. 

Furthermore, the amount of sense it makes, even with guard rails, is questionable. While it may be more memory efficient for that variable, it may not be as CPU efficient to intern it. What ultimately makes this example highly suspect is the tenant name or id in the example above is EXTREMELY unlikely to be the only string we need to build and store throughout the lifecycle of a complex enough call, but it IS likely to be the only such variable or one of a very small number of variables. So, if the system is memory constrained in the first place (which is the problem you'd be hoping to solve), you're likely to end up stressing the CPU more without eliminating memory as the bottleneck. Put another way, you make the memory usage a bit better at the cost of making everything else worse. 

Comments

Popular Posts