The Proposed Magento Persistence Layer

akent99 · ‎08-18-2017

This blog post summarizes current thoughts on the proposed Magento persistence layer for Magento 2. This work is currently not assigned to be completed by a specific release. This blog is part of a series sharing internal thinking as we progress to gain community feedback as we proceed.

For those interested in history, during Magento 2.0 development it was decided to not tackle the separation of the database persistence layer from business logic for timeline reasons. This work commenced in Magento 2.1 as a part of the Staging feature of Magento Enterprise Edition, but the code was not stabilized for general usage. Developers outside the core team were recommended to not use the new APIs in 2.1 as they were likely to undergo further change. This blog describes the proposed nature of such changes.

Objective

There are a number objectives behind this work. Some key objectives include:

Remove from business logic the hard-coded dependency on MySQL, opening the opportunity to support a range of relational and non-relational database technologies, which in turn opens the opportunity for various performance improvements.
Improve performance for web API requests getting data from the database by reducing the amount of code between the web API request coming in and the underlying database technology. One significant inefficiency here is due to the use of model classes that load up a complete instance in memory, even if the web API call does not need all the data.
Simplify development by providing a higher-level API for business logic to use.

Logical and Physical Schemas

It is proposed to introduce two levels of schema.

The logical schema is defined in terms of entities and attributes. The persistence layer API used by business logic uses logical schema concepts.
The physical schema is defined per underlying storage technology. For example, when using MySQL for storage, the physical schema is the MySQL database schema of tables with columns and rows.

Semantics of the persistence API are defined based on the logical schema. All supported storage technology adapters must conform to the logical schema semantics.

Master and Index Data

Master data is data that is managed (inserted, updated, deleted) directly by business logic. Inside the persistence layer, updates to master data can trigger indexers to update index structures (as is done today in Magento). That is, the current indexer infrastructure will become a part of the persistence layer.

Master data is the source of truth, and represents the precise semantics of entities for business purposes without concern for performance. The purpose of indexes is to improve performance of consumers on the Magento store front. When querying, the persistence layer may choose to use index data to speed up a query as long as the semantics of the query does not change, although there can be a time lag in indexes being refreshed from master data.

In the short term, it is anticipated that master data will continue to be stored in a relational database (with full transactional support) whereas indexes may be stored in a wider range of technologies. The goal is to open up the opportunity for innovation by the community at this level – for example, adding support for an in-memory catalog implementation in Redis for improved performance.

Entities and Attributes

At the logical schema level, the persistence layer will operate on “entities” and “attributes” (rather than “tables” and “columns” in MySQL). An “attribute value set” is a set of attribute name/value pairs (rather than a “row” in MySQL). (The term “tuple” is being considered as a shorthand for “attribute value set”.)

There are three ways attributes can be declared:

Core attributes are defined in the module where the entity is declared
Extension attributes are defined in other modules that wish to extend the set of core attributes
Custom attributes can be added via the Admin.

While the declaration of such attributes is different, a goal of the persistence layer is to provide consistent access to attributes independent of how they are declared.

Entities can be declared with scoped attributes (such as by store view or by website), as currently supported by the EAV model (see the previous “Magento EAV Review” blog post for more details), or with simple scalar types.

Complex Data Types

Support is being considered for attributes with complex data types, such as arrays of nested attribute value sets. This would allow, for example, product entities to hold an array of image information, where each image can further hold an array of information about resolutions and dimensions of all image variations available.

Such nested attributes would only be able to be retrieved by retrieving the enclosing entity (you cannot query and return a set of images if they are nested in product entities – you must query for products (at the entity level), then get the image information from within the product entities).

At the physical schema level, if a JSON storage engine is used, the complex data types may be encoded directly into the same JSON document. For relational tables, a separate table may be used, or the value could be encoded as JSON and stored in a column – although querying on such data may not be supported.

Queries can specify entity level existence filter criteria (find all products that have an image with a nested image type attribute of PNG). What is not planned to be supported is filter criteria that must apply across multiple attributes within the same nested object (“find all products with a nested image that is both a PNG image and has a DPI of 200, where image type and DPI are two separate attributes in the same nested repeating group”). Put simply, the goal is to allow a complex data structure to be stored within one entity when appropriate, not to get carried away with an overly complex and powerful query language.

The choice of what data types to nest within the parent entity is up to the application. It is likely that most existing tables in Magento would remain as separate entities – nesting of complex data types is likely to be the exception rather than the rule. It may be more common for queries to return nested data types by aggregating data than to store master data with complex data types.

Complex data types are being considered, but may not be supported due to the degree of additional complexity they can introduce when querying the underlying database. This complexity might be offset by not supporting queries on attributes within complex data types.

Removal of PHP Model Class Dependencies

One goal of the new persistence layer is to speed up query evaluation by by-passing the need for model classes. There will still be service contracts for specific entities, but the current repository service contracts backed by model classes will likely be phased out. Code performing queries will be encouraged (possibly mandated) to specify all wanted attributes, allowing the persistence layer to avoid the overhead of returning attributes that are not needed by the current request.

This may require support for declaring attributes within the database schema (that is, inside the persistence layer) that are computed from other attributes of the current entity. For example, there could be a full name attribute computed from first and last name attributes. This technique can also be used as a version migration strategy – even if the database schema is changed, the old attributes can sometimes be retained for a period of time by computing them from new attributes.

Without model classes, entities will still be declared using configuration files. A generic “attribute value set” (a tuple of attribute names and values) will instead be introduced for retrieved data, and a generic “attribute operation set” (a tuple of attribute names and attribute update operations) will be defined for updating data. Operations can set an attribute to a value or remove the attribute. (Removing an attribute is different to setting it to null, as described in the “EAV Review” blog post referenced above.)

Note: It may be desirable to merge the attribute value set and attribute operation set operations types, but that is left for a later discussion.

Indexes will similarly be defined with configuration files. Indexer code (in PHP) will be driven from configuration files as much as possible to allow new attributes to be defined without PHP code changes.

This whole area however is an important area to explore further. Removing PHP classes for models can improve performance, but also implies less points for plugin methods to tap into. (“Save” events would still be supported per entity.) Feedback is welcome on how often people are putting plugins directly on model classes and why, to make sure such needs are not missed in the new approach.

Attribute Value Set Tuples

As mentioned above, an attribute value set is a tuple of attribute name/value pairs. One aspect still under consideration is how to deal with scope values in an attribute value set. For a store front operation, the scope is always known and so the returned data will have all scope based defaults resolved. The result of a query will always be a simple set of name/value pairs.

In the Admin however, management of scopes is important. When retrieving data, it is important to understand what scopes have an attribute value defined versus which ones use the default scope. This additional information needs to be encoded within the attribute value set information.

It is still being considered whether to use different attribute value set definitions - one for when the scope is known and a different type when the data can contain scopes. When scopes are used, each attribute value may be represented as an array of values, indexed by scope. For example, the value of a “description” attribute may be a PHP array indexed by store view id where the values in the array are the description text (so you have a description per store view).

Entity update operations also need to deal with the concept of scope.

Attribute Value Set Tuples and Service Contract Data Entities

You can think of an attribute value set like a row in a relational database table. The implementation class could implement the PHP ArrayAccess interface allowing code to reference attributes by name easily ($prod[‘name’]). Core, extension, and custom attributes are planned to be defined in a single namespace for easier and more consistent access, unlike today where there are different approaches to retrieve core, extension, and custom attributes. (A reminder is that everything in this blog post represents current thoughts, but is subject to change. Feedback is welcome to shape these decisions.)

Data entity interfaces defined in service contracts (e.g. Magento\Catalog\Api\Data\ProductInterface) are planned to be retained as a type safe way to set attributes. However, instead of having a dedicated DTO class per interface, the data entity methods such as setName() would be implemented by calling a set() method on attribute value set tuples. For example:

class ProductTupleFacade implements ProductInterface, TupleFacadeInterface {
    private $tuple;
    ...

    // From ProductInterface
    public function setName($name : string) {
        $this->tuple[ProductInterface::NAME] = $name;
    }

    // From ProductInterface
    public function getName() : string {
        return $this->tuple[ProductInterface::NAME];
    }

    // From TupleFacadeInterface
    public function tuple() {
        return $this->tuple;
    }
}

Given a data entity, it is straightforward to retrieve the tuple and manipulate it directly if desired. REST (and SOAP) web API infrastructure will also be extended to understand the attribute value set and attribute operation set tuples as native data types. The goal is to de-serialize requests directly into the same tuple data structure that the persistence API uses to avoid data copying and improve performance, but with an implementation of the service contract interfaces for type safety and backwards compatibility. This area is however still under review due to the possible separation of APIs to retrieve data for use by the store front (no scopes) vs Admin (with scopes).

Id Allocation

A previous post described a proposal around changing the id allocation strategy from allocating ids upon insertion of records into the database to a scheme of pre-allocation. This change is also a part of the proposed persistence API changes. Please refer to the previous blog post for further details.

Query API

A generic query API is a major part of the new persistence API. The API will take a name and filter criteria as input parameters. The filter criteria will be similar to SQL constraints (the WHERE clause), but will be abstracted from SQL so filters can be mapped to other storage engine query strings as required.

In addition, queries on time-versioned resources (supported by the Staging feature of Enterprise Edition) are required to specify the time point the search should be conducted at (the default being “now”). There is a discussion in the id allocation strategy blog above on how time versioned entities are represented in Magento today.

One important restriction is that the query language does not support join operators via the API. Instead a named query view must be defined containing any joins. That means query API calls can provide an entity name or query view name. Inside the persistence layer, indexes may be used to improve the performance of queries, but that is transparent to the caller.

A likely short term restriction of query views is views will be restricted to a single storage engine – you cannot specify a join across MySQL and Elastic Search. To combine data from multiple sources an indexer would be used to extra data from multiple storage engines with the result stored in a single storage engine, ready for access via the query API.

Finally, the query API will also accept a list of attributes that the caller wishes returned. This allows the persistence engine to determine which values need to be retrieved from the database, which is particularly important with EAV attributes in master data, as each attribute may trigger an additional LEFT JOIN operator.

To expose the query API via web APIs as efficiently as possible, a query service (a service contract that REST etc. URLs can bind to) will be defined that will connect directly to the persistence layer query API. By using the same data types for tuples, this will avoid inefficiencies of internal data copying during query evaluation.

Sharding

Magento Enterprise Edition includes support for table sharding, where some complete tables are moved off to a separate database server, such as checkout related tables. This means that what used to be a simple join now has to span database servers. Foreign key constraints are also more problematic.

Table sharding will be implemented within the persistence layer so that business logic does not need to change when table sharding is enabled.

Row sharding (storing different rows of one table in different database servers) would also be supported in the persistence layer. (There is no immediate plan for supporting row sharding.)

Update API

In addition to the generic query API (low level) and service contract (high level), there will be a generic update API and service. The update service calls the update API but adds validation support and possibly event support. (It is still under consideration exactly which level events will slot in.) Business logic is expected to use the update service in order for data validation to be performed (and possibly for events to be triggered).

The update API is for manipulating master data entities. It supports operations on entities such as insert/create, update, and delete. Because of the proposed change to id allocation mentioned previously, save() in the old repository interfaces would typically be replaced by separate insert and update operations. An “upsert” operation is also being considered (insert if record with that key is not found in the database, update the existing record if it exists).

When inserting entities, the same Attribute Value Set tuple class as querying could be used. That is, you provide a set of attribute names and values to insert. For updates, because of the fallback scheme for scopes, there is also the need to be able to express “remove an attribute value”. So rather than specifying values for attributes, updates require operations on attributes – set attribute to specified value and remove attribute value. Thus, there may be the need to support separate classes for Attribute Value Set tuples and Attribute Operation Set tuples. (Setting an attribute to “null” is not the same as removing the attribute due to the EAV fallback scheme for default attribute values.) This is an ongoing area of investigation.

Finally, the update API will support partial updates of entities. Only attributes listed in an update request will be modified by update operations.

Like the query API, it is planned for the update API and update service contract to use the same tuple types so the same data structure can be passed straight through from the presentation layer down to the persistence layer with minimal data copying.

Conclusions

This blog post provides an update of where we are heading with the Magento persistence layer. It is by no means final. It introduces more terminology and concepts, not all of which are baked yet. The new persistence layer is designed to abstract semantics away from the current physical database structure to allow future flexibility in storage technologies without change to business logic code. It is also designed to support higher performance API access to the storage engines and provide a simpler, more consistent understanding of the database schema by hiding underlying EAV tables from business logic.

What is the likely impact of the new query and update APIs to third party extensions and sites built by solution partners? Existing code is likely to need some restructuring, but (at least in the short term) the database schema itself is not being modified – just the API to access it. Preserving the current underlying table structures also means that existing code should be able to run alongside code using the newer API for a period of time. (This is yet to be proven in practice, but is the plan.)

As always feedback welcome. The reason for these blog posts is both to inform and collect feedback. By discussing early, we hope to get feedback on issues so they can be considered early in the development lifecycle. Future posts are planned to get into sample code snippets in specific areas to make the concepts presented in this post more concrete.