cancel
Showing results for 
Search instead for 
Did you mean: 

[Architecture Proposal] Simplify URL Rewrites and make them better for SEO

[Architecture Proposal] Simplify URL Rewrites and make them better for SEO

Feature request from IvanChepurnyi, posted on GitHub Mar 04, 2016

It is a proposal for the future pull request, upon product team and dev team approval

1. Scalability Problem

At the moment, Magento database gets unmanageable, when you have URL rewrites for products and categories combined. The problem is in the way how URL rewrite table is build up. Becuase URL rewrite generator creates an entry in a database table for each combination of product and category. So it makes databases with only 3000 of products and 200 categories grow into 600,000 record table for URL rewrites. That limits most of the Magento merchants to create SEO valuable URL structure when they URL rewrites lookup takes more time that it could.

Solution

Use combination algorithm at runtime instead of generating all possible combination of URL. I already had PoC with one of my existing customers on 1.x; I have plans to port to Magento 2.0 if core team would be interested in such performance improvement contribution.

How does combination algorithm work? Let consider we have the following URL: path1/path2/path3/path4

Then we can chain multiple processors for URL match. In default implementation there will be 2:

  1. Full match processor
  2. Combined processor

Full match processor

Simply lookups the full URL rewrite, if it does not return match, then the next one is used

Combined processor

Splits the given URL in two parts: path1/path2/path3 and path4. If there is a URL rewrite for the first part and it has the flag, that it is combinable with another record, the processor then makes a request to find URL rewrite target for path4. By having specific merge algorithms assigned to it (category + product or even more in feature), it will create a virtual rewrite record that will be used to forward the request to a needed target path

How does it affect Canonical Url for a Product?

This algorithm allows to specify canonical category even at runtime, and you don't need to create a separate index for it.

How does it affect Duplicates of URL?

Because path1/path2/path3 and path4 should be unique in the main table, combining them together does not bring any issue, except if there is path1/path2/path3/path4 rewrite exists.

How does it affect Current Installments?

It will use the same tables that exist atm.

2. Internationalization Problem

Because Magento URL generator removes all the characters from non-latin alphabet, it makes impossible to use all the possibilities of allowed URI specification (RFC3986)

Solution

By lifting the logic behind the algorithm of URL generator we can gain much broad SEO possibilities, including ones with different writing schemes (Cyrillic, Chinese, Arabic, etc.). As you can easily use UTF-8 characters in URL path as soon as you escape with rawurlencode when building the rewritten URL, then on the processing convert it back with rawurldecode. SEO experts will love that as it gives better search results rankings in non-latin alphabets. This approach does make it even possible to generate URL rewrite lookup in much easier faster manner, as you can simply replace all the defined prohibited characters with "-" (dash). So it allows moving URL generation logic to the pure database level.

How does it affect Current Installments?

There are no changes to already translated URLs, it might re-index it a bit differently, but then redirect rewrite history will make auto-redirect.

3. Large Varchar Index Problem

As request path is a string match, when you have a huge database it makes it much slower to lookup the required record.

Solution

There is a simple solution that can help improve performance dramatically on such datasets. As all of us know CRC32 is an integer checksum of 32 bits (perfect for INT column). MySQL has a built-in function for calculating it and PHP has it as well. It is not widely used as it has much higher collision probability than other checksum algorithms. But it is a perfect solution for fast closest match calculation. It can have 5 billion different checksums calculated based on the input. The probability that a 100,000 record in the database will collide is 1 to 50. So there might be the worst scenario 50 records returned for one record lookup, but you can add a single char field to reduce that to only URLs starting with the same letter. Then on PHP side, you only walk over an array of returned result and check the closest match. This option allows managing millions of URL rewrite entries with reasonable lookup time.

How does it affect Current Installments?

It won't change existing table structure as the new table will be introduced, and existing records will be inserted by a simple SQL query during migration. New records will be handled in the URL indexer itself.

17 Comments
New Member

Comment from peterjaap, posted on GitHub Mar 04, 2016

This sounds amazing and would tackle the core_url_rewrite problem. Not to mention open up possibilities to be more dynamic in your filtering options and category naming/structure.

+1

New Member

Comment from sergeifilippov, posted on GitHub Mar 04, 2016

:+1:

New Member

Comment from keyurshah, posted on GitHub Mar 05, 2016

that sounds awesome and would help with performance.

New Member

Comment from aleron75, posted on GitHub Mar 05, 2016

+1

New Member

Comment from rianorie, posted on GitHub Mar 05, 2016

+1

New Member

Comment from amacgregor, posted on GitHub Mar 06, 2016

+1

New Member

Comment from tgerulaitis, posted on GitHub Mar 06, 2016

:+1:

New Member

Comment from roman204, posted on GitHub Mar 06, 2016

+1

New Member

Comment from mcspronko, posted on GitHub Mar 09, 2016

+1

New Member

Comment from spras, posted on GitHub Mar 09, 2016

:+1: