Arabic Case Insensitive In Database Systems: How To Solve Alef With and Without Hamza Problem

If you are dealing with Arabic fields in databases, you most likely ran into this problem. You have characters which are considered the same in Arabic (like ‘ا’ and ‘أ’)  but dealt differently in database systems. The use of case insensitive collation like ‘utf8_unicode_ci’ won’t solve the problem. The ‘ا’ and ‘أ’ aren’t considered equal in the character mapping of ‘utf8_unicode_ci’

There are three solutions for this problem:

  1. Create a custom collation
  2. Add a normalized field
  3. Use regular expressions in queries

I will show you how to apply these solutions for MySQL, if you are using other DBS, check the documentation for similar solutions. Each solution will be applied to the following example table:

+----+--------------+
| id | name         |
+----+--------------+
|  1 | احمد        |
|  2 | أحمد        |
|  3 | أسامه       |
|  4 | أسامة       |
|  5 | اسامه       |
|  6 | اسَامه       |
+----+--------------+
6 rows in set

1. Create a custom collation

This is the recommended solution for most cases. You won’t change any data in your database. All you are going to do is simply telling the DB to treat these characters as one. The steps we will use here are based on MySQL documentation “Adding a UCA Collation to a Unicode Character Set“.

First of all, we need to modify a configuration file called “Index.xml” to add our collation. The location of the file could vary from one system to another. To get the location of the file in your system, run the following query on “information_schema” database:

SHOW VARIABLES LIKE 'character_sets_dir';

+--------------------+----------------------------+
| Variable_name      | Value                      |
+--------------------+----------------------------+
| character_sets_dir | /usr/share/mysql/charsets/ |
+--------------------+----------------------------+
1 row in set (0.00 sec)

So on my pc, the file path is ‘/usr/share/mysql/charsets/Index.xml’. Backup the file, open it, and go the element <charset name=”utf8″>. We will add our collation as a child element to this charset.

We need an ID and a name for this collation, the range of IDs from 1024 to 2047 is reserved for user-defined collations, so choose a number in this range. I named our collation “utf8_arabic_ci”.

Let’s add the collation to the Index.xml file and then explain what the collation rules mean. Add the following :

<charset name="utf8">
.
.
.
  <collation name="utf8_arabic_ci" id="1029">
   <rules>
     <reset>\u0627</reset> <!-- Alef 'ا' -->
     <i>\u0623</i>        <!-- Alef With Hamza Above 'أ' -->
     <i>\u0625</i>        <!-- Alef With Hamza Below 'إ' -->
     <i>\u0622</i>        <!-- Alef With Madda Above 'آ' -->
   </rules>
   <rules>
     <reset>\u0629</reset> <!-- Teh Marbuta 'ة' -->
     <i>\u0647</i>        <!-- Heh 'ه' -->
   </rules>
   <rules>
     <reset>\u0000</reset> <!-- Ignore Tashkil -->
     <i>\u064E</i>        <!-- Fatha 'َ' -->
     <i>\u064F</i>        <!-- Damma 'ُ' -->
     <i>\u0650</i>        <!-- Kasra 'ِ' -->
     <i>\u0651</i>        <!-- Shadda 'ّ' -->
     <i>\u064F</i>        <!-- Sukun 'ْ' -->
     <i>\u064B</i>        <!-- Fathatan 'ً' -->
     <i>\u064C</i>        <!-- Dammatan 'ٌ' -->
     <i>\u064D</i>        <!-- Kasratan 'ٍ' -->
   </rules>
 </collation>
</charset>

This added part tells the DBS that “utf8_arabic_ci” (or whatever you name it) is a child of utf8 char set, adding the following rules:

  1. ‘أ’, ‘إ’, ‘آ’ is the same character as ‘ا’ (All the Alef forms are one character)
  2. ‘ه’ is the same character as ‘ة’ (so “نسمة” is the same as “نسمه”)
  3. The tashkil characters are ignored (as if they aren’t there)

You can add more rules as you like. If you find an important rule that should be added to our collation, please add it in a comment. We will have to restart the MySQL server to use our collation, on my Linux I use:

sudo service mysql restart

Now, go to the table with Arabic names and change the column collation to our new collation. I will do this to my example table using the following query:

ALTER TABLE persons MODIFY name VARCHAR(50) CHARACTER SET 'utf8' COLLATE 'utf8_arabic_ci';

Query OK, 6 rows affected (0.89 sec) 
Records: 6 Duplicates: 0 Warnings: 0

Note: If you are using PHPMyAdmin, don’t expect to find the custom collation in the list of collations. You will have to write an SQL query like the above to change the column to the new collation.

Now every thing should be set. I will execute a search query and see if it is working as it should:

SELECT * FROM persons WHERE name = "اسامة";

+----+--------------+
| id | name         |
+----+--------------+
|  3 | أسامه        |
|  4 | أسامة        |
|  5 | اسامه        |
|  6 | اسَامه        |
+----+--------------+
4 rows in set (0.00 sec)

Sure enough, the variations of Alef are shown, also the Teh and the Heh, and the Tashkil is ignored.

This was the first solution and I think the most convenient one. But what about if you can’t access the character sets files (like when you are using a shared hosting for example), then the second solution should be suitable for you.

2. Add a normalized field

This solution won’t require editing configuration files, but it will require adding an additional column to the table and some data processing. The idea is simple, add a new column and fill it with the text in a “normalized state”, then use the normalized column in your queries, better explained by example.

I will use some PHP code in this example to add the normalized column to our table. Consider this PHP function:

function normalize_name($name) {
    $patterns     = array( "/إ|أ|آ/" , "/ه/", "/َ|ً|ُ|ِ|ٍ|ٌ|ّ/" ); 
    $replacements = array( "ا" ,   "ة"      , ""           );
    return preg_replace($patterns, $replacements, $name);
}

What this function does is replacing all occurrences of ‘ه’ with ‘ة’, all forms of Alef with ‘ا’, and removes the tashkil. Let’s see it in action:

normalize_name("أحمد");  // return: احمد
normalize_name("آمنة");  // return: امنة
normalize_name("أسامه"); // return: اسامة
normalize_name("مٌحَمَّد");  // return: محمد

Ok, now we got our normalize function. Let’s add a new column in our table and call it “normalized_name” and fill it using this function. We will have now a table with the following data:

1  احمد   احمد
2  أحمد   احمد
3  أسامه  اسامة
4  أسامة  اسامة
5  اسامه  اسامة
6  اسَامه  اسامة

The English columns names will break in the last layout so I kept only the data to avoid confusion. Now we got our normalized data. How do we use it to solve our problem?

If the user searched for the name “آسامه”, we will pass this name to the normalize function first, which will result in “اسامة”, then create a search query with this normalized name and display the original name in the result:

SELECT name FROM persons WHERE normalized_name = "اسامة";

+--------------+
| name         |
+--------------+
| أسامه       |
| أسامة       |
| اسامه       |
| اسَامه       |
+--------------+

That’s it. We added the normalized field to the table, searched for the normalized name, and displayed the original name. This is the second solution.

3. Use Regular Expressions in queries

In this solution, you won’t be changing any configuration files nor add extra columns to your tables. But Regexp search is slower than the ordinary search, you will lose the advantage of using indices, unlike the former solutions. Also, you won’t be able to ignore Tashkil, and it isn’t easy to generate a regex pattern for all possible cases. Using Regular Expressions is not recommended in the case of large databases or when you care about performance in general, nevertheless, you might find it useful for special kind of queries.

Regex isn’t in the standard SQL, but most database systems will provide it with a different syntax.  To apply this solution, we will replace our query string with a regexp pattern.

Regular expressions in MySQL is done using ‘REGEXP‘ or its synonym ‘RLIKE‘. You can browse the MySQL documentation for regexp to find out more about its syntax. To search for all occurrences of “اسامة”, we will use the following pattern:

"[ا|أ|إ|آ]سام[ه|ة]"

This pattern simply means: Look for any form of Alef at the beginning, and Teh or Heh at the end. Let’s try it out:

SELECT name FROM persons WHERE name REGEXP "[ا|أ|إ|آ]سام[ه|ة]";

+------------+
| name       |
+------------+
| أسامه     |
| أسامة     |
| اسامه     |
+------------+

Sure enough, all occurrences of “اسامة” are shown (except the one with the tashkil as mentioned before). Now, we will have to write a function to generate this pattern for every search query. I will give you an example to do this in PHP, but this is just an example and most likely you will need a different approach:

function generate_pattern($search_string) {
  $patterns     = array( "/(ا|إ|أ|آ)/", "/(ه|ة)/" ); 
  $replacements = array( "[ا|إ|أ|آ]", "[ة|ه]" ); 
  return preg_replace($patterns, $replacements, $search_string);
}

This will generate the following patterns:

generate_pattern("أسامة"); // return '[ا|إ|أ|آ]س[ا|إ|أ|آ]م[ة|ه]'
generate_pattern("أسامه"); // return '[ا|إ|أ|آ]س[ا|إ|أ|آ]م[ة|ه]'
generate_pattern("احمد");  // return '[ا|إ|أ|آ]حمد'

Notice that the function replaced the Alef in the middle of ‘اسامة’ as well. You might strict it to only replace Alef at the beginning of a word, and also restrict replacement of Heh at the end of a word. I will leave it you to adjust the pattern to your case.

Conclusion

That will be it for all the possible solutions I know of to fix the Arabic insensitive search problem. If you know another solution please tell us in a comment and share your experience.

How to fix a flash memory corrupting files in Linux

I recently encountered a USB flash drive which corrupts any file written to it. The flash memory had a large size but it was useless. After googling around and using some Linux tools, I discovered it is a counterfeit. However, I managed to recover the working part. Although it was only a fraction of the size declared by the flash, it was better than nothing. I will show you how to do this in Linux in simple steps. For Windows users, check out this link.

We will use open source F3 tools created by Michel Machado from Digirati. The main tools are already in Debian/Ubuntu repositories. If you are using a different distribution, you can build the tools from source as we will explain later. If you want to make a thorough test on your flash drive, this is the reliable test. Just install them:

sudo apt-get install f3

Now fill the flash drive after mounting it using:

f3write /media/[MOUNT FOLDER]

then make the check:

f3read /media/[MOUNT FOLDER]

It might take some time depending on the size of your flash. If the flash memory passed this test (Data Lost is 0) then you can rest assure that your flash drive is OK. On the other hand, if you get data losses, then your flash drive is corrupted. Here is the output I got from making the test on an old corrupted 1 Gigabyte USB flash:

                  SECTORS      ok/corrupted/changed/overwritten
Validating file 1.h2w ... 1021326/   991976/      0/      2

  Data OK: 498.69 MB (1021326 sectors)
Data LOST: 484.36 MB (991978 sectors)
               Corrupted: 484.36 MB (991976 sectors)
        Slightly changed: 0.00 Byte (0 sectors)
             Overwritten: 1.00 KB (2 sectors)
Average reading speed: 21.21 MB/s

From this result, you can see half of my flash is corrupted. But the good news is we can use the working part and exclude the corrupted part. Let’s see how.

The same developer made a couple other tools called f3probe and f3fix. You won’t find these tools in the Debian/Ubuntu repositories as they are considered experimental because they weren’t tested on a large scale. We will have to build these tools from source, so make sure you have the build tools and we will also install some extra dev packages.

UPDATE: Since Ubuntu xenial (16.04LTS), f3probe and f3fix are packaged with f3 package. You can skip to f3probe usage after downloading the package. Thanks to ComputingFroggy for this info.

Download the latest release of f3 (version 6.0 at the time of this writing) from https://github.com/AltraMayor/f3/releases. The next step is to get the dependencies, for Ubuntu users (Or Debian using testing repository):

sudo apt-get install libudev1 libudev-dev libparted0-dev

Note: For Debian Jessie users not using the testing repository. You will find the missing packages in the backports repository.

We have our dependencies now. Extract f3 and navigate to the folder and type:

make experimental

NOTE: If you are using the latest development version or version > v6.0 then type make extra instead.

Hopefully the build process won’t have any errors and you will see an output like:

cc -std=c99 -Wall -Wextra -pedantic -MMD -ggdb -c -o libutils.o libutils.c
cc -std=c99 -Wall -Wextra -pedantic -MMD -ggdb -c -o libdevs.o libdevs.c
cc -std=c99 -Wall -Wextra -pedantic -MMD -ggdb -c -o libprobe.o libprobe.c
cc -std=c99 -Wall -Wextra -pedantic -MMD -ggdb -c -o f3probe.o f3probe.c
cc -o f3probe libutils.o libdevs.o libprobe.o f3probe.o -lm -ludev
cc -std=c99 -Wall -Wextra -pedantic -MMD -ggdb -c -o f3brew.o f3brew.c
cc -o f3brew libutils.o libdevs.o f3brew.o -lm -ludev
cc -std=c99 -Wall -Wextra -pedantic -MMD -ggdb -c -o f3fix.o f3fix.c
cc -o f3fix libutils.o f3fix.o -lparted

And you will find the binaries in your folder:

> find . -executable
.
./log-f3wr
./f3write.h2w
./f3probe
./f3brew
./f3fix

If you encountered any errors then most likely you have missing dependencies. Try to search for the dev packages in your distribution’s repositories.

Now everything should be set. We will start with f3probe, insert the flash without mounting it, then run:

lsblk

to check if we have the right device, else we will be messing other devices. The output should be something like:

NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
sr0 11:0 1 1024M 0 rom
sdc 8:32 1 986M 0 disk
└─sdc1 8:33 1 985M 0 part
sda 8:0 0 298.1G 0 disk
├─sda4 8:4 0 7.4G 0 part [SWAP]

...

Now according to this output, my flash partition is /dev/sdc1. But to use f3probe, you have to pass the device it self /dev/sdc. Again, double check you are passing the right device. Beware that f3probe will distroy any data on the flash. Now run:

./f3probe /dev/sdc

In my case the output was:

f3 probe 6.0
Copyright (C) 2010 Digirati Internet LTDA.
This is free software; see the source for copying conditions.

WARNING: Probing normally takes from a few seconds to 15 minutes, but
         it can take longer. Please be patient.

Probe finished, recovering blocks... Done

Bad news: The device `/dev/sdc' is a counterfeit of type limbo

You can "fix" this device using the following command:
f3fix --last-sec=1015872 /dev/sdc

Device geometry:
                 *Usable* size: 496.03 MB (1015873 blocks)
                Announced size: 986.00 MB (2019328 blocks)
                        Module: 1.00 GB (2^30 Bytes)
        Approximate cache size: 0.00 Byte (0 blocks), need-reset=yes
           Physical block size: 512.00 Byte (2^9 Bytes)

Probe time: 16.10s

F3probe did a good job identifying the usable size in just 16 seconds. Confirming that the device is a counterfeit and also telling us how to fix it. Unplug your flash and insert it again without mounting it. Run the command from the previous output, in my case it was:

> sudo ./f3fix --last-sec=1015872 /dev/sdc
F3 fix 6.0
Copyright (C) 2010 Digirati Internet LTDA.
This is free software; see the source for copying conditions.

Drive `/dev/sdc' was successfully fixed

Now your flash should be fixed. Format it as usual and it should be almost ready. Here is a screenshot of the flash drive partition table from Gparted after formatting the device. This shows how f3fix made a new partition containing only the working part and excluded the lossy part:

Gparted screenshot of the fixed flash

It is a good idea you to run f3write/f3read on the working part to make sure it is working probably. Sometimes the first test fails after fixing. From the f3 doc:

If you get some sectors corrupted, repeat the f3write/f3read test. Some drives recover from these failures on a second full write cycle. However, if the corrupted sectors persist, the drive is junk because not only is it a fake drive, but its real memory is already failing.

Here is my output after testing:

                  SECTORS      ok/corrupted/changed/overwritten
Validating file 1.h2w ... 1013280/        0/      0/      0

  Data OK: 494.77 MB (1013280 sectors)
Data LOST: 0.00 Byte (0 sectors)
               Corrupted: 0.00 Byte (0 sectors)
        Slightly changed: 0.00 Byte (0 sectors)
             Overwritten: 0.00 Byte (0 sectors)
Average reading speed: 15.32 MB/s

 

I have 500MB working from my old corrupted 1G flash drive. Better then nothing I guess. Hope you got something useful out of your flash too. A lesson for the future, don’t buy used and cheap flash memory cards unless you test them first.

References:

  1. F3 – an alternative to h2testw
  2. askubuntu.com-check-real-size-of-usb-thumb-drive

Limiting excerpt and title length in wordpress

Sometimes you have a web page layout that will break if the content exceeds a certain size. For example, if the excerpt of a post is too long it could break your layout. Let’s see how we can apply a word limit on the excerpt and the title.

Limit The Excerpt:

The excerpt has a straightforward way. There is a WordPress filter called excerpt_length. To limit the words to 20 words, add the following code to your functions.php located in your theme folder /wp‑content/themes/[THEME]/functions.php.

function custom_excerpt_length( $length ) {
	return 20;
}
add_filter( 'excerpt_length', 'custom_excerpt_length', 999 );

Limit The Title:

The title is a bit trickier. There is a WordPress filter called the_title, but if we use this filter like we did with the excerpt, it will apply the limit in all of the website even on single post pages.

If we want to apply the limit on a certain page, we will have to find the location of the code responsible for displaying the title in this page. For example, in the Twenty Sixteen, the WordPress default theme, the post title code is located in the file wp‑content/themes/twentysixteen/template‑parts/content.php

When you locate the file, look for the_title() function. This function is responsible for displaying the title of the post. We need to modify the output of this function to apply our word limit

We can create a custom function to apply the limit but there is already a built-in WordPress function called wp_trim_words(). From the wordpress documentation:

This function is localized. For languages that count ‘words’ by the individual character (such as East Asian languages), the $num_words argument will apply to the number of individual characters.

Sounds good. Now let’s use it to wrap our title. First we need the title to be returned by the_title() not echoed directly which is the default. In order to do this we will pass the third argument as false. You only need to change the third argument, keep the first and the second argument as is:

the_title( '', '', false )

Now we will pass the returned title from the_title() to wp_trim_words():

echo wp_trim_words ( the_title ( '', '', false ), 4 , '...');

Where:
4: the number of words to keep
‘…’: is the characters that indicate there is more in the title. We can omit this argument here as this is the default.

Keep in mind the that any edits to theme files will be replaced if the theme is updated so it is better to create a child theme to keep your edits

WordPress header parser error

I had this weird problem when I uploaded my WordPress plugin RSS from aljazeera to the WordPress plugin directory. After installing the plugin it makes a notice that the plugin needs to be updated to version 0.0! and clicking update doesn’t seem to have any effect.

At the first glance I thought it is a problem in the readme.txt file in the SVN repository. WordPress provide a repository for plugins hosted on the WordPress plugin directory, but I checked it and every thing was right. Then I checked the plugin PHP files and every thing seemed OK. After hours of experimenting and googling I finally found the problem.

This was the header of the PHP file in the plugin:


/*
 * Plugin Name: RSS from Aljazeera
 * Plugin URI : https://wordpress.org/plugins/aljazeera-rss/
 * Description: Display the latest Arabic RSS feed from aljazeera.net news website
 * Version    : 1.2
 * Author     : Ahmed Essam
 * Author URI : http://www.ahmedspace.com
 * License    : GPL2
 * License URI: https://www.gnu.org/licenses/gpl-2.0.html
 */

My mistake was adding padding spaces to align all the “:” to be on the same column. The WordPress parser only accepts header values where the “:” is adjacent to the name. But you can put spaces after the “:”. I changed the header to be:


/*
 * Plugin Name: RSS from Aljazeera
 * Plugin URI:  https://wordpress.org/plugins/aljazeera-rss/
 * Description: Display the latest Arabic RSS feed from aljazeera.net news website
 * Version:     1.2
 * Author:      Ahmed Essam
 * Author URI:  http://www.ahmedspace.com
 * License:     GPL2
 * License URI: https://www.gnu.org/licenses/gpl-2.0.html
 */

Then every thing worked as it should.
Apparently you have to keep this format for every headers parsed by WordPress including theme headers.
Happy coding!