Feature request from IvanChepurnyi, posted on GitHub Mar 04, 2016
It is a proposal for the future pull request, upon product team and dev team approval
1. Scalability Problem
At the moment, Magento database gets unmanageable, when you have URL rewrites for products and categories combined. The problem is in the way how URL rewrite table is build up. Becuase URL rewrite generator creates an entry in a database table for each combination of product and category. So it makes databases with only 3000 of products and 200 categories grow into 600,000 record table for URL rewrites. That limits most of the Magento merchants to create SEO valuable URL structure when they URL rewrites lookup takes more time that it could.
Solution
Use combination algorithm at runtime instead of generating all possible combination of URL. I already had PoC with one of my existing customers on 1.x; I have plans to port to Magento 2.0 if core team would be interested in such performance improvement contribution.
How does combination algorithm work? Let consider we have the following URL:
path1/path2/path3/path4
Then we can chain multiple processors for URL match. In default implementation there will be 2:
Full match processor
Combined processor
Full match processor
Simply lookups the full URL rewrite, if it does not return match, then the next one is used
Combined processor
Splits the given URL in two parts: path1/path2/path3 and path4 . If there is a URL rewrite for the first part and it has the flag, that it is combinable with another record, the processor then makes a request to find URL rewrite target for path4 . By having specific merge algorithms assigned to it (category + product or even more in feature), it will create a virtual rewrite record that will be used to forward the request to a needed target path
How does it affect Canonical Url for a Product?
This algorithm allows to specify canonical category even at runtime, and you don't need to create a separate index for it.
How does it affect Duplicates of URL?
Because path1/path2/path3 and path4 should be unique in the main table, combining them together does not bring any issue, except if there is path1/path2/path3/path4 rewrite exists.
How does it affect Current Installments?
It will use the same tables that exist atm.
2. Internationalization Problem
Because Magento URL generator removes all the characters from non-latin alphabet, it makes impossible to use all the possibilities of allowed URI specification (RFC3986)
Solution
By lifting the logic behind the algorithm of URL generator we can gain much broad SEO possibilities, including ones with different writing schemes (Cyrillic, Chinese, Arabic, etc.). As you can easily use UTF-8 characters in URL path as soon as you escape with rawurlencode when building the rewritten URL, then on the processing convert it back with rawurldecode. SEO experts will love that as it gives better search results rankings in non-latin alphabets.
This approach does make it even possible to generate URL rewrite lookup in much easier faster manner, as you can simply replace all the defined prohibited characters with "-" (dash). So it allows moving URL generation logic to the pure database level.
How does it affect Current Installments?
There are no changes to already translated URLs, it might re-index it a bit differently, but then redirect rewrite history will make auto-redirect.
3. Large Varchar Index Problem
As request path is a string match, when you have a huge database it makes it much slower to lookup the required record.
Solution
There is a simple solution that can help improve performance dramatically on such datasets. As all of us know CRC32 is an integer checksum of 32 bits (perfect for INT column). MySQL has a built-in function for calculating it and PHP has it as well. It is not widely used as it has much higher collision probability than other checksum algorithms. But it is a perfect solution for fast closest match calculation. It can have 5 billion different checksums calculated based on the input. The probability that a 100,000 record in the database will collide is 1 to 50. So there might be the worst scenario 50 records returned for one record lookup, but you can add a single char field to reduce that to only URLs starting with the same letter. Then on PHP side, you only walk over an array of returned result and check the closest match. This option allows managing millions of URL rewrite entries with reasonable lookup time.
How does it affect Current Installments?
It won't change existing table structure as the new table will be introduced, and existing records will be inserted by a simple SQL query during migration. New records will be handled in the URL indexer itself.
... View more