Entity ID Allocation Schemes

akent99 · ‎06-07-2017

We are considering some changes in allocating IDs for entities. This includes well known entities (Product, Category, Customer, Order, etc.) and other secondary entities (newsletter subscriptions, authentication rules, etc.). This blog post discusses pros and cons of different strategies with a focus on Magento. We wanted to open this discussion up for community feedback. This is being considered as a part of possible future database API work. Comments are welcome!

Please note that these techniques are well known approaches, built into technologies like the Java JPA. This discussion is about Magento adoption of these strategies, not the strategies themselves.

Auto Increment IDs

Today, many entities use auto-increment IDs allocated by MySQL – when you insert a record into a table, MySQL allocates a unique ID automatically if the ID column is NULL.

One of the side effects of this approach is when creating new entities, you do not know the entity ID until after the record is inserted. Application logic must insert a record and then ask what ID was allocated. This works, but there are a few undesirable side effects.

If you need to insert related records (records that reference each other using IDs), your application logic must insert one record, then wait for the response to work out the ID allocated, insert the next record, wait for the response, etc. This means your business logic structure is somewhat dictated by interactions with the database.
When you construct an object in memory, it is only “complete” after it is stored to the database (because the ID is unknown). So code has to know how “complete” the in-memory representation is.
Saving records to the database changes the record contents (adds the ID). That feels “strange.”
Plugins can get a bit funky because the record fed into the save() function is not complete.
It violates the “Command” pattern (a software design pattern used by larger patterns such as CQRS that Magento is aiming to follow with the new database API).

To be clear, Magento works today using the MySQL ID allocation approach – it would be wrong to say the approach is unacceptable. However, with the investigation into the new Magento persistence API, it is being reviewed to see if there is a better approach.

IDs vs Row IDs and Content Staging

It is worth briefly taking a side bar on how Content Staging works in Magento Enterprise Edition. Content Staging allows database changes to be future-dated, so they come into effect at some future time point (such as during a special sale event). This is a very popular Enterprise Edition feature.

The Content Staging implementation separates row IDs (a unique ID generated per table row) from entity IDs (IDs that are shared across versions of the record, exposed to external applications and APIs). Content Staging then associates a time span with each record. A time point is specified in queries against the database to identify the right version of the entity (or no entity if it is not supposed to exist at that time point). Internally, for performance, whenever possible Magento performs join queries on the row ID rather than the entity ID.

An implication of this is the entity ID cannot be allocated by inserting records into a time versioned table as the entity ID is shared between multiple database records (different versions of the same entity). So there is already some concept of IDs not being allocated by insertion into a table. To be precise, a separate non-versioned table is used to allocate entity IDs. This ID, once allocated, is used when saving versioned entity records.

ID Allocation Strategies

In order to fully form an entity before saving it to disk, a unique ID must be allocated. The question is how to allocate this ID. Two well-known strategies are UUIDs and having a sequence number table that remembers the next ID to allocate. The following discussion explores these two approaches further.

UUIDs

There have been several variations of standards for allocation of UUIDs, but the idea is to combine a timestamp (with sub-second granularity) and something unique about the machine that allocated it (such as the MAC address) to form an ID that is unique across all machines in the world. There are lots of articles out on the internet on how this is done, the important characteristics for Magento are that the numbers are relatively random and fairly long (16 bytes, normally displayed as a 36-character string composed from 32 hexadecimal characters with 4 hyphens).

The following are some positive characteristics of using UUIDs.

Any application can allocate a UUID (including external applications to Magento, such as a PIM or ERP integration) to allocate IDs. UUIDs are commonly used in the computing industry.
They can help with security as guessing IDs is harder (unlike sequence numbers).
They do have a number of negative characteristics however.
They are longer than smaller identifiers, so take up more storage space.
Their random nature makes B-tree index structures less efficient for updating and querying. (Hash indexes are not affected in the same way.)
They are less convenient for humans – no human is going to remember a complete UUID (they can remember the last few digits).
They take up more space to display (e.g., in debug messages, when querying MySQL tables directly, etc.).

See also the following blog post from Percona on ways to optimize usage of UUIDs: https://www.percona.com/blog/2014/12/19/store-uuid-optimized-way/.

Sequence Numbers

Another approach is to use a sequence number. A table or similar persistent storage can be used to remember the next ID to allocate. (One such scheme is already supported in Magento via the Sequence class.)

It is worth noting that you don’t need a database update per ID allocated – you would typically have an in-memory cache of say the next 100 values, so that you only need to update the database once every 100 values allocated. You can end up with gaps in the sequence in the case of power outages, but that is not common. So the overhead of allocating an ID can be kept pretty low.

It is also worth noting that some IDs can be lost if business logic allocated an ID but then failed to store the record in the database for some reason. This is okay, as long as it is not common to discard an ID.

The following are some positive characteristics of sequence numbers.

Values are small (a fixed size integer) making them storage efficient.
They are close to the current database ID allocation strategy.
Values are in a generally incrementing sequence, which can help with database efficiency in query operations.
Values are short making them easier to remember and making debugging easier for humans.
They can remain integers, like today, rather than changing the database field to binary or string.

However, there are negatives:

All applications that wish to allocate an ID need to talk to a central service. This could be a separate REST call, or in the case of a PIM or ERP could create extra integration complexity.
There are no security benefits as legal values are easy to guess compared to UUIDs.
It is worth noting that objects can have other attributes that are unique IDs. For example, products have the SKU field that is also unique. Thus, if an application (such as a PIM or ERP) wanted to use UUIDs, this can still be done by using a product attribute. The discussion in this blog post is focused on the Magento ID that is used internally.

Opinions

For a change, rather than express a single point of view from the Magento architects, this blog post shares three different views. Let us know if you find this approach more interesting.

Alan Kent’s Opinion

Pre-allocation of IDs (allocating IDs before saving records to the database) does have benefits over allocating IDs on record insertion. The question I always ask myself is if the benefit of a change is worth the impact. In this case, with Content Staging in Enterprise Edition, some of these changes have already been made, and I dislike having code (ID allocation in this case) behave in different ways in different areas of the codebase (Community vs Enterprise Edition). So, I am coming around to the arguments put forward to make this change.

If pre-allocation is the way we go, I prefer sequence numbers to UUIDs because they are easier for humans to deal with. Debug messages are easier to read, database tables are easier to query, and you can display them to end users in the Admin. They are also closer to how Magento behaves today, so there is less ripple. For example, the Admin displays IDs at times – changing them to UUIDs would require the IDs to be hidden as they are too long for display in the Admin.

Performance-wise there are negatives to UUIDs (more storage required, less efficient to index) and positives (no central coordinating resource to interact with, easier integration with external systems).

Pre-allocating IDs for records can also help in situations like batch operations where records need to cross reference each other. It is also more portable across different database technologies, helping to loosen the dependence on MySQL.

Igor Miniailo’s Opinion

I definitely like the idea of having persistent agnostic entities, where the business logic knows nothing about the fact that the entity will be persisted. Database generated IDs are a good example of leaky abstraction, because most developers or architects implicitly rely on a database implementation to provide the ID. What if an implementation does not have the ability to generate a unique ID? This sounds like a Liskov Substitution Principle violation just waiting to happen.

Also, persistence ignorance will help us implement a better validation process in the future. The entity will be in a valid state after creation, but not after persistence. This aligns with Domain Driven Design (DDD) architectural principals.

Persistent agnostic entities are much easier to cover with Unit tests as well.

Also, returning a newly created ID from the create operation, violates CQS (Command Query Separation) as the Create method ought to be a Command and commands are not supposed to return a value. From the perspective of encapsulation this is problematic. If we look only at the method signature we could be tricked into believing that the create method is a Query, and therefore has no side-effects.

Between pre-generated sequence numbers and UUIDs, I prefer to use UUIDs. Having generated sequences will require a dedicated REST service responsible for ID reservation that must be called before the entity creation call. This is inconsistent with our goal of having coarse-grained services where we create an entity and save all its attributes with one service call.

Also, by having a UUID, we prevent the system from a possible bottleneck in the ID generator service. So, by its nature, it's more scalable solution.

Anton Kril’s Opinion

I support pre-generated IDs. I will add the immutability argument to Alan's and Igor's list: with pre-generated IDs it is easier to make entities and DTOs immutable. The benefits of object immutability are well described by many authors on the internet. See the example by Vinai Kopp: https://www.slideshare.net/vinaikopp/architecture-inthesmallslides?#26. In short, immutability makes it easier to reason about the code.

Additionally, rule 2.2 in our Technical Guidelines states that objects should be ready to use after instantiation. It reduces the complexity of code and makes interfaces more natural and easier to use. The current approach of assigning the ID after the entity is persisted contradicts this rule.

As for UUIDs versus sequence numbers, I lean towards UUIDs. It is a popular and well-known approach, and allows you to avoid extra WebApi calls (the identifier retrieval call). While I agree with the validity of the “inconvenience” argument against it, I think it does not outweigh the benefits of UUIDs.

Your Opinion!

This blog post describes a possible future change in the way ID allocation for entities will work. Pre-allocating IDs solves a number of issues with the current scheme.

For a change, we included the perspective of three different architects in this post. Why not add your perspective in the comments below! Which of the above approaches resonates with you and why?

Navarr · ‎06-07-2017

While UUIDs might not be as efficient, they seem like the future. They also shut down theoretical attack vectors that rely on knowing that objects are referenced by an incrementing ID

andrewhowdencom · ‎06-07-2017

Being able to schedule the creation of a record from another service (some of of product management) and knowing that it's ID will be persistent through the stack would make it much easier to debug through multiple applications. UUIDs also allow for retries, which sequential IDs do not (as I understand it) as they require the knowledge of state that retrying denies.

+1 for UUID.

akent99 · ‎06-07-2017

To clarify, there are two issues here.

Should we allocate ids in advance (vs when record is inserted). I think that would address the concern of keeping an id through the full stack and the retry concern.
UUID vs small integers. You can preallocate small integers just like UUIDs. The difference is only in that you have to talk to a central service to get a small integer, whereas anyone can create a UUID without talking to that central service.

Going with UUIDs avoids the central allocation of ids, but means you would never show product ids to merchants any more e.g. in Admin (because they are so long). A project I was on that used UUIDs I personally found painful for debugging just because they were so long and hard to query for. E.g. With an 80 column terminal window, doing a query in MySQL you will use up half the screen width immediately with the id column.

So maybe there are two separate points in comments:

(1) Do you support preallocation of ids (whatever form they are).

(2) Is a central id allocation service too painful over using UUIDs (this is really about the length of the id - are shorter ids valuable vs avoiding a central allocation service)

jhodgie · ‎06-09-2017

1. I think Magento should use preallocation of ids.

2. I think Magento should use UUIDs and then do something like what git does with its commitish and only show the necessary length of string to be unique while still useful in places you want to show it.

akent99 · ‎06-10-2017

Extra reading links from twitter:

mamut · ‎06-14-2017

I was happy to flag it as "nice to have" as it will be making things easier for providing non-MySQL implementations of repository service contracts, building master-master replication or introducing sharding to databases. But then reality called back and said that this would be probably very painful process for existing and running stores - just imagine rekeying whole database from simple int's to uuid's on bigger stores with numbers above hundreds of thousands products and god know how many orders.
For sure UUID way seems more flexible and scalable compared to central allocation. Reading related sources shows that UUIDs were not designed to be good primary keys and commonly accepted standard algorithms for generation UUIDs produce suboptimal results. This would require to implement non standard algorithms that provide values that are considered good primary keys (like KSUID mentioned inhttps://segment.com/blog/a-brief-history-of-the-uuid/). But this would mean non-standard implementations being required also in additional systems (like PIM, ERP) in order to enforce same ID generation algorithm (one that produces good primary keys).

(In short as good primary key I understand values that are monotonic / k-sortable)

akent99 · ‎06-14-2017

Good point on upgrade issues. Less issues if keep with ids.

We can of course include an upgrade script, but the upgrade would be substantial as we would need to patch all existing ids and all references to those ids in the schema. A less exciting yet practical consideration.

aleron75 · ‎09-05-2017

Hello.

I think that almost all are agreeing on the fact that having pre-generated IDs would be a good step forward.

Based on the fact that, as solution provider, we manage several environments (dev + staging + prod), having UUIDs would allow us to port contents more easily between them. Thus my vote goes to UUIDs even if I acknowledge their cons.

My best,

-Alessandro

Tiago Sampaio · ‎08-31-2020

Did it have any updates? I'd like to know if this discussion stepped forward.

-Tiago

aleron75 · ‎08-31-2020

As far as I know, no. But nowadays the best place to discuss such improvements is the app design channel on Magento Slack: https://magentocommeng.slack.com/archives/CBSL1DF8B

My best.

Alessandro