cancel
Showing results for 
Search instead for 
Did you mean: 

[Architecture Proposal] Simplify URL Rewrites and make them better for SEO

[Architecture Proposal] Simplify URL Rewrites and make them better for SEO

Feature request from IvanChepurnyi, posted on GitHub Mar 04, 2016

It is a proposal for the future pull request, upon product team and dev team approval

1. Scalability Problem

At the moment, Magento database gets unmanageable, when you have URL rewrites for products and categories combined. The problem is in the way how URL rewrite table is build up. Becuase URL rewrite generator creates an entry in a database table for each combination of product and category. So it makes databases with only 3000 of products and 200 categories grow into 600,000 record table for URL rewrites. That limits most of the Magento merchants to create SEO valuable URL structure when they URL rewrites lookup takes more time that it could.

Solution

Use combination algorithm at runtime instead of generating all possible combination of URL. I already had PoC with one of my existing customers on 1.x; I have plans to port to Magento 2.0 if core team would be interested in such performance improvement contribution.

How does combination algorithm work? Let consider we have the following URL: path1/path2/path3/path4

Then we can chain multiple processors for URL match. In default implementation there will be 2:

  1. Full match processor
  2. Combined processor

Full match processor

Simply lookups the full URL rewrite, if it does not return match, then the next one is used

Combined processor

Splits the given URL in two parts: path1/path2/path3 and path4. If there is a URL rewrite for the first part and it has the flag, that it is combinable with another record, the processor then makes a request to find URL rewrite target for path4. By having specific merge algorithms assigned to it (category + product or even more in feature), it will create a virtual rewrite record that will be used to forward the request to a needed target path

How does it affect Canonical Url for a Product?

This algorithm allows to specify canonical category even at runtime, and you don't need to create a separate index for it.

How does it affect Duplicates of URL?

Because path1/path2/path3 and path4 should be unique in the main table, combining them together does not bring any issue, except if there is path1/path2/path3/path4 rewrite exists.

How does it affect Current Installments?

It will use the same tables that exist atm.

2. Internationalization Problem

Because Magento URL generator removes all the characters from non-latin alphabet, it makes impossible to use all the possibilities of allowed URI specification (RFC3986)

Solution

By lifting the logic behind the algorithm of URL generator we can gain much broad SEO possibilities, including ones with different writing schemes (Cyrillic, Chinese, Arabic, etc.). As you can easily use UTF-8 characters in URL path as soon as you escape with rawurlencode when building the rewritten URL, then on the processing convert it back with rawurldecode. SEO experts will love that as it gives better search results rankings in non-latin alphabets. This approach does make it even possible to generate URL rewrite lookup in much easier faster manner, as you can simply replace all the defined prohibited characters with "-" (dash). So it allows moving URL generation logic to the pure database level.

How does it affect Current Installments?

There are no changes to already translated URLs, it might re-index it a bit differently, but then redirect rewrite history will make auto-redirect.

3. Large Varchar Index Problem

As request path is a string match, when you have a huge database it makes it much slower to lookup the required record.

Solution

There is a simple solution that can help improve performance dramatically on such datasets. As all of us know CRC32 is an integer checksum of 32 bits (perfect for INT column). MySQL has a built-in function for calculating it and PHP has it as well. It is not widely used as it has much higher collision probability than other checksum algorithms. But it is a perfect solution for fast closest match calculation. It can have 5 billion different checksums calculated based on the input. The probability that a 100,000 record in the database will collide is 1 to 50. So there might be the worst scenario 50 records returned for one record lookup, but you can add a single char field to reduce that to only URLs starting with the same letter. Then on PHP side, you only walk over an array of returned result and check the closest match. This option allows managing millions of URL rewrite entries with reasonable lookup time.

How does it affect Current Installments?

It won't change existing table structure as the new table will be introduced, and existing records will be inserted by a simple SQL query during migration. New records will be handled in the URL indexer itself.

17 Comments
apiuser
New Member

Comment from schmengler, posted on GitHub Mar 09, 2016

Isn't (1) basically what Magento 1 EE already does?

apiuser
New Member

Comment from IvanChepurnyi, posted on GitHub Mar 09, 2016

@schmengler Can you please elaborate on this? I don't understand how this can be the same as in EE 1.x? You don't have combinable URL rewrites in any of the core version atm.

apiuser
New Member

Comment from schmengler, posted on GitHub Mar 09, 2016

What about Enterprise_Catalog_Model_Urlrewrite_Matcher_Product ? It splits the request path at the last "/" and looks up the parts separateley. AFAIK there are no combined values stored in the database.

apiuser
New Member

Comment from IvanChepurnyi, posted on GitHub Mar 09, 2016

@schmengler Sorry, my fault that I didn't check if EE has something similar in 1.x branch. Seems there are some nice improvements in it. Just checked it, and true there is kind of a similar approach, but not exactly what I am talking about here.

My proposal is to create chained URL rewrite processors with priorities, that are decoupled from each other:

  1. Full request processor that will match the full request path as it is
  2. Combined request processor that will match request path except the last item after slash and if it is a combinable URL rewrite (special flag in record metadata), then try to find the combination of last item
  3. Add any of your custom URL rewrite processors to the chain for your extension or project customization (layered nav, virtual pseudo categories like brands, etc).

The only thing that looks similar to EE 1.x is a part related to combining request paths, but it is still a different approach than the one that is described here.

apiuser
New Member

Comment from antboiko, posted on GitHub Mar 09, 2016

@IvanChepurnyi Hi Ivan, thank you for all proposals regarding URL rewrites.

Both "Internationalization Problem" and "Large Varchar Index Problem" definitely make sense. They both will be a subject for internal discussion and validation.

Regarding "Scalability Problem" proposal, this one was a part of EE 1.13.x. Currently we have few solutions to be considered for solving this.

Ivan, I will keep you informed of the progress on this matter.

Best, Anton.

apiuser
New Member

Comment from IvanChepurnyi, posted on GitHub Mar 09, 2016

@antboiko

Thanks for the official answer. There is something very important to consider when you read about scalability problem. Maybe I was not that clear in the description, but the combination processor is only tiny part of proposed solution. The whole Scalability solution goes far beyond category + product URL lookups. As at the moment (even with EE1.13.x) I cannot create request path match processor that would give me full control over how request_path is split up. I cannot have something like path1/path2/path3/path4 where path1 is a category and path2, path3, path4 are applied filter URL slugs. At the moment to make it possible I have to wrap the whole Magento\UrlRewrite\Model\UrlFinderInterface implementation, but that should not be the case. That is why I propose to create a layer on top of the storage that will do that and will allow chaining of the processors and keep the same models for storage as it is done now. Doing it the proposed way will make it simple for third-party vendors to add own URL processors without substituting the whole storage logic. Adding this layer would not break any backward compatibility, as every chain processor will still be using Magento\UrlRewrite\Model\UrlFinderInterface.

apiuser
New Member

Comment from antboiko, posted on GitHub Apr 19, 2016

@IvanChepurnyi

Hello Ivan,

Following up on our discussion let me inform you that your proposals were approved. I would like to summarize on what we agreed regarding the future PR:

1. Scalability Problem. This proposal makes sense as it will decrease the amount of records in DB. However it may cause the collision. Please make sure the solution covers this case as well and prevent occurrence of collided URLs. 2. Internalization Problem. Definitely will be a good option. Currently we have a transliteration mechanism for URL keys specified using non-ASCII characters. It would be great to have this option proposed by you as the one which user can turn on and use instead of transliteration mechanism. 3. Large Varchar Index Problem. Definitely makes sense. Can be implemented in the way you described it.

Thank you for your initiative aimed to improve and simplify URL Rewrites mechanism.

We look forward to your contribution.

Best regards, Anton.