Entity ID Allocation Schemes

akent99 · ‎06-07-2017

We are considering some changes in allocating IDs for entities. This includes well known entities (Product, Category, Customer, Order, etc.) and other secondary entities (newsletter subscriptions, authentication rules, etc.). This blog post discusses pros and cons of different strategies with a focus on Magento. We wanted to open this discussion up for community feedback. This is being considered as a part of possible future database API work. Comments are welcome!

Please note that these techniques are well known approaches, built into technologies like the Java JPA. This discussion is about Magento adoption of these strategies, not the strategies themselves.

Auto Increment IDs

Today, many entities use auto-increment IDs allocated by MySQL – when you insert a record into a table, MySQL allocates a unique ID automatically if the ID column is NULL.

One of the side effects of this approach is when creating new entities, you do not know the entity ID until after the record is inserted. Application logic must insert a record and then ask what ID was allocated. This works, but there are a few undesirable side effects.

If you need to insert related records (records that reference each other using IDs), your application logic must insert one record, then wait for the response to work out the ID allocated, insert the next record, wait for the response, etc. This means your business logic structure is somewhat dictated by interactions with the database.
When you construct an object in memory, it is only “complete” after it is stored to the database (because the ID is unknown). So code has to know how “complete” the in-memory representation is.
Saving records to the database changes the record contents (adds the ID). That feels “strange.”
Plugins can get a bit funky because the record fed into the save() function is not complete.
It violates the “Command” pattern (a software design pattern used by larger patterns such as CQRS that Magento is aiming to follow with the new database API).

To be clear, Magento works today using the MySQL ID allocation approach – it would be wrong to say the approach is unacceptable. However, with the investigation into the new Magento persistence API, it is being reviewed to see if there is a better approach.

IDs vs Row IDs and Content Staging

It is worth briefly taking a side bar on how Content Staging works in Magento Enterprise Edition. Content Staging allows database changes to be future-dated, so they come into effect at some future time point (such as during a special sale event). This is a very popular Enterprise Edition feature.

The Content Staging implementation separates row IDs (a unique ID generated per table row) from entity IDs (IDs that are shared across versions of the record, exposed to external applications and APIs). Content Staging then associates a time span with each record. A time point is specified in queries against the database to identify the right version of the entity (or no entity if it is not supposed to exist at that time point). Internally, for performance, whenever possible Magento performs join queries on the row ID rather than the entity ID.

An implication of this is the entity ID cannot be allocated by inserting records into a time versioned table as the entity ID is shared between multiple database records (different versions of the same entity). So there is already some concept of IDs not being allocated by insertion into a table. To be precise, a separate non-versioned table is used to allocate entity IDs. This ID, once allocated, is used when saving versioned entity records.

ID Allocation Strategies

In order to fully form an entity before saving it to disk, a unique ID must be allocated. The question is how to allocate this ID. Two well-known strategies are UUIDs and having a sequence number table that remembers the next ID to allocate. The following discussion explores these two approaches further.

UUIDs

There have been several variations of standards for allocation of UUIDs, but the idea is to combine a timestamp (with sub-second granularity) and something unique about the machine that allocated it (such as the MAC address) to form an ID that is unique across all machines in the world. There are lots of articles out on the internet on how this is done, the important characteristics for Magento are that the numbers are relatively random and fairly long (16 bytes, normally displayed as a 36-character string composed from 32 hexadecimal characters with 4 hyphens).

The following are some positive characteristics of using UUIDs.

Any application can allocate a UUID (including external applications to Magento, such as a PIM or ERP integration) to allocate IDs. UUIDs are commonly used in the computing industry.
They can help with security as guessing IDs is harder (unlike sequence numbers).
They do have a number of negative characteristics however.
They are longer than smaller identifiers, so take up more storage space.
Their random nature makes B-tree index structures less efficient for updating and querying. (Hash indexes are not affected in the same way.)
They are less convenient for humans – no human is going to remember a complete UUID (they can remember the last few digits).
They take up more space to display (e.g., in debug messages, when querying MySQL tables directly, etc.).

See also the following blog post from Percona on ways to optimize usage of UUIDs: https://www.percona.com/blog/2014/12/19/store-uuid-optimized-way/.

Sequence Numbers

Another approach is to use a sequence number. A table or similar persistent storage can be used to remember the next ID to allocate. (One such scheme is already supported in Magento via the Sequence class.)

It is worth noting that you don’t need a database update per ID allocated – you would typically have an in-memory cache of say the next 100 values, so that you only need to update the database once every 100 values allocated. You can end up with gaps in the sequence in the case of power outages, but that is not common. So the overhead of allocating an ID can be kept pretty low.

It is also worth noting that some IDs can be lost if business logic allocated an ID but then failed to store the record in the database for some reason. This is okay, as long as it is not common to discard an ID.

The following are some positive characteristics of sequence numbers.

Values are small (a fixed size integer) making them storage efficient.
They are close to the current database ID allocation strategy.
Values are in a generally incrementing sequence, which can help with database efficiency in query operations.
Values are short making them easier to remember and making debugging easier for humans.
They can remain integers, like today, rather than changing the database field to binary or string.

However, there are negatives:

All applications that wish to allocate an ID need to talk to a central service. This could be a separate REST call, or in the case of a PIM or ERP could create extra integration complexity.
There are no security benefits as legal values are easy to guess compared to UUIDs.
It is worth noting that objects can have other attributes that are unique IDs. For example, products have the SKU field that is also unique. Thus, if an application (such as a PIM or ERP) wanted to use UUIDs, this can still be done by using a product attribute. The discussion in this blog post is focused on the Magento ID that is used internally.

Opinions

For a change, rather than express a single point of view from the Magento architects, this blog post shares three different views. Let us know if you find this approach more interesting.

Alan Kent’s Opinion

Pre-allocation of IDs (allocating IDs before saving records to the database) does have benefits over allocating IDs on record insertion. The question I always ask myself is if the benefit of a change is worth the impact. In this case, with Content Staging in Enterprise Edition, some of these changes have already been made, and I dislike having code (ID allocation in this case) behave in different ways in different areas of the codebase (Community vs Enterprise Edition). So, I am coming around to the arguments put forward to make this change.

If pre-allocation is the way we go, I prefer sequence numbers to UUIDs because they are easier for humans to deal with. Debug messages are easier to read, database tables are easier to query, and you can display them to end users in the Admin. They are also closer to how Magento behaves today, so there is less ripple. For example, the Admin displays IDs at times – changing them to UUIDs would require the IDs to be hidden as they are too long for display in the Admin.

Performance-wise there are negatives to UUIDs (more storage required, less efficient to index) and positives (no central coordinating resource to interact with, easier integration with external systems).

Pre-allocating IDs for records can also help in situations like batch operations where records need to cross reference each other. It is also more portable across different database technologies, helping to loosen the dependence on MySQL.

Igor Miniailo’s Opinion

I definitely like the idea of having persistent agnostic entities, where the business logic knows nothing about the fact that the entity will be persisted. Database generated IDs are a good example of leaky abstraction, because most developers or architects implicitly rely on a database implementation to provide the ID. What if an implementation does not have the ability to generate a unique ID? This sounds like a Liskov Substitution Principle violation just waiting to happen.

Also, persistence ignorance will help us implement a better validation process in the future. The entity will be in a valid state after creation, but not after persistence. This aligns with Domain Driven Design (DDD) architectural principals.

Persistent agnostic entities are much easier to cover with Unit tests as well.

Also, returning a newly created ID from the create operation, violates CQS (Command Query Separation) as the Create method ought to be a Command and commands are not supposed to return a value. From the perspective of encapsulation this is problematic. If we look only at the method signature we could be tricked into believing that the create method is a Query, and therefore has no side-effects.

Between pre-generated sequence numbers and UUIDs, I prefer to use UUIDs. Having generated sequences will require a dedicated REST service responsible for ID reservation that must be called before the entity creation call. This is inconsistent with our goal of having coarse-grained services where we create an entity and save all its attributes with one service call.

Also, by having a UUID, we prevent the system from a possible bottleneck in the ID generator service. So, by its nature, it's more scalable solution.

Anton Kril’s Opinion

I support pre-generated IDs. I will add the immutability argument to Alan's and Igor's list: with pre-generated IDs it is easier to make entities and DTOs immutable. The benefits of object immutability are well described by many authors on the internet. See the example by Vinai Kopp: https://www.slideshare.net/vinaikopp/architecture-inthesmallslides?#26. In short, immutability makes it easier to reason about the code.

Additionally, rule 2.2 in our Technical Guidelines states that objects should be ready to use after instantiation. It reduces the complexity of code and makes interfaces more natural and easier to use. The current approach of assigning the ID after the entity is persisted contradicts this rule.

As for UUIDs versus sequence numbers, I lean towards UUIDs. It is a popular and well-known approach, and allows you to avoid extra WebApi calls (the identifier retrieval call). While I agree with the validity of the “inconvenience” argument against it, I think it does not outweigh the benefits of UUIDs.

Your Opinion!

This blog post describes a possible future change in the way ID allocation for entities will work. Pre-allocating IDs solves a number of issues with the current scheme.

For a change, we included the perspective of three different architects in this post. Why not add your perspective in the comments below! Which of the above approaches resonates with you and why?