Arabic Case Insensitive In Database Systems: How To Solve Alef With and Without Hamza Problem

If you are dealing with Arabic fields in databases, you most likely ran into this problem. You have characters which are considered the same in Arabic (like ‘ا’ and ‘أ’)  but dealt differently in database systems. The use of case insensitive collation like ‘utf8_unicode_ci’ won’t solve the problem. The ‘ا’ and ‘أ’ aren’t considered equal in the character mapping of ‘utf8_unicode_ci’

There are three solutions for this problem:

  1. Create a custom collation
  2. Add a normalized field
  3. Use regular expressions in queries

I will show you how to apply these solutions for MySQL, if you are using other DBS, check the documentation for similar solutions. Each solution will be applied to the following example table:

+----+--------------+
| id | name         |
+----+--------------+
|  1 | احمد        |
|  2 | أحمد        |
|  3 | أسامه       |
|  4 | أسامة       |
|  5 | اسامه       |
|  6 | اسَامه       |
+----+--------------+
6 rows in set

1. Create a custom collation

This is the recommended solution for most cases. You won’t change any data in your database. All you are going to do is simply telling the DB to treat these characters as one. The steps we will use here are based on MySQL documentation “Adding a UCA Collation to a Unicode Character Set“.

First of all, we need to modify a configuration file called “Index.xml” to add our collation. The location of the file could vary from one system to another. To get the location of the file in your system, run the following query on “information_schema” database:

SHOW VARIABLES LIKE 'character_sets_dir';

+--------------------+----------------------------+
| Variable_name      | Value                      |
+--------------------+----------------------------+
| character_sets_dir | /usr/share/mysql/charsets/ |
+--------------------+----------------------------+
1 row in set (0.00 sec)

So on my pc, the file path is ‘/usr/share/mysql/charsets/Index.xml’. Backup the file, open it, and go the element <charset name=”utf8″>. We will add our collation as a child element to this charset.

We need an ID and a name for this collation, the range of IDs from 1024 to 2047 is reserved for user-defined collations, so choose a number in this range. I named our collation “utf8_arabic_ci”.

Let’s add the collation to the Index.xml file and then explain what the collation rules mean. Add the following :

<charset name="utf8">
.
.
.
  <collation name="utf8_arabic_ci" id="1029">
   <rules>
     <reset>\u0627</reset> <!-- Alef 'ا' -->
     <i>\u0623</i>        <!-- Alef With Hamza Above 'أ' -->
     <i>\u0625</i>        <!-- Alef With Hamza Below 'إ' -->
     <i>\u0622</i>        <!-- Alef With Madda Above 'آ' -->
   </rules>
   <rules>
     <reset>\u0629</reset> <!-- Teh Marbuta 'ة' -->
     <i>\u0647</i>        <!-- Heh 'ه' -->
   </rules>
   <rules>
     <reset>\u0000</reset> <!-- Ignore Tashkil -->
     <i>\u064E</i>        <!-- Fatha 'َ' -->
     <i>\u064F</i>        <!-- Damma 'ُ' -->
     <i>\u0650</i>        <!-- Kasra 'ِ' -->
     <i>\u0651</i>        <!-- Shadda 'ّ' -->
     <i>\u064F</i>        <!-- Sukun 'ْ' -->
     <i>\u064B</i>        <!-- Fathatan 'ً' -->
     <i>\u064C</i>        <!-- Dammatan 'ٌ' -->
     <i>\u064D</i>        <!-- Kasratan 'ٍ' -->
   </rules>
 </collation>
</charset>

This added part tells the DBS that “utf8_arabic_ci” (or whatever you name it) is a child of utf8 char set, adding the following rules:

  1. ‘أ’, ‘إ’, ‘آ’ is the same character as ‘ا’ (All the Alef forms are one character)
  2. ‘ه’ is the same character as ‘ة’ (so “نسمة” is the same as “نسمه”)
  3. The tashkil characters are ignored (as if they aren’t there)

You can add more rules as you like. If you find an important rule that should be added to our collation, please add it in a comment. We will have to restart the MySQL server to use our collation, on my Linux I use:

sudo service mysql restart

Now, go to the table with Arabic names and change the column collation to our new collation. I will do this to my example table using the following query:

ALTER TABLE persons MODIFY name VARCHAR(50) CHARACTER SET 'utf8' COLLATE 'utf8_arabic_ci';

Query OK, 6 rows affected (0.89 sec) 
Records: 6 Duplicates: 0 Warnings: 0

Note: If you are using PHPMyAdmin, don’t expect to find the custom collation in the list of collations. You will have to write an SQL query like the above to change the column to the new collation.

Now every thing should be set. I will execute a search query and see if it is working as it should:

SELECT * FROM persons WHERE name = "اسامة";

+----+--------------+
| id | name         |
+----+--------------+
|  3 | أسامه        |
|  4 | أسامة        |
|  5 | اسامه        |
|  6 | اسَامه        |
+----+--------------+
4 rows in set (0.00 sec)

Sure enough, the variations of Alef are shown, also the Teh and the Heh, and the Tashkil is ignored.

This was the first solution and I think the most convenient one. But what about if you can’t access the character sets files (like when you are using a shared hosting for example), then the second solution should be suitable for you.

2. Add a normalized field

This solution won’t require editing configuration files, but it will require adding an additional column to the table and some data processing. The idea is simple, add a new column and fill it with the text in a “normalized state”, then use the normalized column in your queries, better explained by example.

I will use some PHP code in this example to add the normalized column to our table. Consider this PHP function:

function normalize_name($name) {
    $patterns     = array( "/إ|أ|آ/" , "/ه/", "/َ|ً|ُ|ِ|ٍ|ٌ|ّ/" ); 
    $replacements = array( "ا" ,   "ة"      , ""           );
    return preg_replace($patterns, $replacements, $name);
}

What this function does is replacing all occurrences of ‘ه’ with ‘ة’, all forms of Alef with ‘ا’, and removes the tashkil. Let’s see it in action:

normalize_name("أحمد");  // return: احمد
normalize_name("آمنة");  // return: امنة
normalize_name("أسامه"); // return: اسامة
normalize_name("مٌحَمَّد");  // return: محمد

Ok, now we got our normalize function. Let’s add a new column in our table and call it “normalized_name” and fill it using this function. We will have now a table with the following data:

1  احمد   احمد
2  أحمد   احمد
3  أسامه  اسامة
4  أسامة  اسامة
5  اسامه  اسامة
6  اسَامه  اسامة

The English columns names will break in the last layout so I kept only the data to avoid confusion. Now we got our normalized data. How do we use it to solve our problem?

If the user searched for the name “آسامه”, we will pass this name to the normalize function first, which will result in “اسامة”, then create a search query with this normalized name and display the original name in the result:

SELECT name FROM persons WHERE normalized_name = "اسامة";

+--------------+
| name         |
+--------------+
| أسامه       |
| أسامة       |
| اسامه       |
| اسَامه       |
+--------------+

That’s it. We added the normalized field to the table, searched for the normalized name, and displayed the original name. This is the second solution.

3. Use Regular Expressions in queries

In this solution, you won’t be changing any configuration files nor add extra columns to your tables. But Regexp search is slower than the ordinary search, you will lose the advantage of using indices, unlike the former solutions. Also, you won’t be able to ignore Tashkil, and it isn’t easy to generate a regex pattern for all possible cases. Using Regular Expressions is not recommended in the case of large databases or when you care about performance in general, nevertheless, you might find it useful for special kind of queries.

Regex isn’t in the standard SQL, but most database systems will provide it with a different syntax.  To apply this solution, we will replace our query string with a regexp pattern.

Regular expressions in MySQL is done using ‘REGEXP‘ or its synonym ‘RLIKE‘. You can browse the MySQL documentation for regexp to find out more about its syntax. To search for all occurrences of “اسامة”, we will use the following pattern:

"[ا|أ|إ|آ]سام[ه|ة]"

This pattern simply means: Look for any form of Alef at the beginning, and Teh or Heh at the end. Let’s try it out:

SELECT name FROM persons WHERE name REGEXP "[ا|أ|إ|آ]سام[ه|ة]";

+------------+
| name       |
+------------+
| أسامه     |
| أسامة     |
| اسامه     |
+------------+

Sure enough, all occurrences of “اسامة” are shown (except the one with the tashkil as mentioned before). Now, we will have to write a function to generate this pattern for every search query. I will give you an example to do this in PHP, but this is just an example and most likely you will need a different approach:

function generate_pattern($search_string) {
  $patterns     = array( "/(ا|إ|أ|آ)/", "/(ه|ة)/" ); 
  $replacements = array( "[ا|إ|أ|آ]", "[ة|ه]" ); 
  return preg_replace($patterns, $replacements, $search_string);
}

This will generate the following patterns:

generate_pattern("أسامة"); // return '[ا|إ|أ|آ]س[ا|إ|أ|آ]م[ة|ه]'
generate_pattern("أسامه"); // return '[ا|إ|أ|آ]س[ا|إ|أ|آ]م[ة|ه]'
generate_pattern("احمد");  // return '[ا|إ|أ|آ]حمد'

Notice that the function replaced the Alef in the middle of ‘اسامة’ as well. You might strict it to only replace Alef at the beginning of a word, and also restrict replacement of Heh at the end of a word. I will leave it you to adjust the pattern to your case.

Conclusion

That will be it for all the possible solutions I know of to fix the Arabic insensitive search problem. If you know another solution please tell us in a comment and share your experience.

Leave a Reply

Your email address will not be published. Required fields are marked *