I recently ran into a problem when rendering PDF files that contained Arabic letters. I had a list of texts in a database which had to be printed out to a PDF file using PDFSharp. But the words looked completely wrong.
This is how I solved it.
The Arabic alphabet consists of letters in different forms (glyphs). Depending on whether a letter stands alone, or is at the beginning, middle or end of a syllable it will have a different Unicode glyph. But when storing Arabic letters in a Unicode string, they are stored as a raw string in general Unicode forms, independent of their position in a syllable.
This was all new to me, as someone who does not understand Arabic. But I got a great deal of help from an Arabic colleague. I also looked a lot at this page about Arabic script in Unicode.
On my first attempt at rendering a PDF with Arabic words, all letters were printed in the wrong directly (left-to-right) and they were all printed as individual letters; none of them were combined to syllables. This was clearly wrong and unacceptable for the customer.
Rendering “this is a test phrase” would look like this:
Upon analysis, I found the cause: The text strings were stored in a database in general Unicode form.
As a format, PDF files store text strings exactly as in the input (general Unicode form), and PDF viewers render the text exactly as it is stored in the PDF.
PDFSharp does not do any glyph conversion while rendering. Neither do they support right-to-left (RTL) rendering out-of-the-box. In fact, PDFSharp does not support RTL at all.
A HTML-to-PDF library like wkhtmltopdf does render the texts correctly. But it works by rendering an HTML page using the WebKit rendering engine, then it makes a PDF of that page. In this process, WebKit does the conversion of glyphs. However, wkhtmltopdf is not thread-safe or scalable and I did not want to make an intermediate HTML page.
To render a PDF file containing correctly written Arabic words, I ended up adding the following workarounds to my PDF rendering code. 1. Detecting if a string consists of only Arabic letters. If not, skip the next steps. 2. Replacing individual letters with contextually correct Arabic glyphs. 3. Reversing the string to emulate RTL.
With this in place, rendering “this is a test phrase” now looks like this:
Here is the code for detecting if a string is purely Arabic characters.
This is the code for replacing Arabic letters with correct glyphs.
Please let me know if you use the glyph converter and notice any errors to the converted glyphs. I would be happy to correct it for everyone else.