With the information revolution in full swing, electronic documents have become the principal media of information. Business and non-business organizations alike are producing new electronic documents at a breakneck pace. On the other hand, there is a big demand for Arabic applications worldwide. This has become a basic requirement under any environment whether a desktop, an Internet, an Intranet, computer telephony or even hand-held consumer electronics.
Arabic document verification differs basically from the English. While the latter relies mainly on spelling checking, Arabic speller can only fulfill this task partially. It has to interplay heavily with grammatical checking. In many cases Arabic word could be correct as a standalone unit, i.e. spelling wise, but incorrect in the context in which it’s embedded. This is mainly due to the high degree of context-sensitivity of the Arabic syntax.
Sakhr recognized these facts at a very early stage, realized the importance of having an automatic corrector on both the spelling and grammar levels, and developed an excellent Spell/Grammar Checker named Sahehly
Due to Arabic language complexity, Arabic grammar checker needs a lot of research and linguistic resources to fulfill its software R&D complexity. Basically, there are three main methodologies to achieve such NLP solution: syntax based, statistics based, and rule based.
In a Syntax-based checking approach, a text has to be fully analyzed morphologically and syntactically, this implies the development of a lexical database, a morphological analyzer and a syntactical parser.
On the other hand, in Statistics-based checking, the availability of a large corpus i.e. text, has encouraged researchers to innovate statistical models for valuable linguistic knowledge extraction. Finally, the classical Rule-based approach matches a set of rules against a text provided with a Morphological/Lexical analyzer. The main challenge in this approach lies on all rules should be developed manually. Relying on negative knowledge for detection and diagnosis, the error anticipation technique is employed in the rule-based methodology.
The rule-based checker approach has many advantages: a sentence does not have to be complete to be checked. Moreover, it is easy to configure, as each rule has an expressive description and can be turned on and off individually. Unlike statistical systems, it might provide detailed error messages with helpful comments, even explaining grammar rules. The recommended approach is to build it incrementally, starting with just one rule and then extending maximize the coverage.
Besides NLP rule based traditional solution, we also used heuristic rules and applied state of the arts techniques including machine learning.
Sahehly deals with common Arabic writing errors, which the user often commits as a result of his inability to differentiate between some similar spelling cases, so he confuses them while writing.
Here are some features of the Spelling part of Sahehly:
Sahehly recognizes the grammatical errors contained in the text, which the user makes as a result of his lack of knowledge of the rules of Arabic grammar, which results in grammatical errors. The corrector can take into account the related provisions and rules of the actual sentence as well as the nominal in addition prepositions, adverbs, number ... etc. for example. "كان المصريين القدماء يعيشوا على ضفتا نهر النيل منذ سبع آلاف سنة" The corrector suggests the corrected sentence instead of this sentence to become as follows: "كان المصريون القدماء يعيشون على ضفتي نهر النيل منذ سبعة آلاف سنة" Here is some of grammar cases that are covered by Sahehly:
Sahehly also deals with the wrong Arabic diacritics and gives alternative suggestions for the word that is wrongly formed. For example, if the user enters the phrase "فَوْهة المدفع", the program will suggest the correct alternatives for these incorrectly formed words, "فُوَّهة المِدْفَع".
It takes into consideration the context in which the wrong word is mentioned. For example, if you enter the wrong phrase: "وزارة الاعلام" or "ترفع الاعلام"; Here Sahehly detects the wrong word “hamza” in accordance with the context of the phrase. In that case, the correction of the phrase according to the context of each word would be as follows: "الإعلام" for the context of the first sentence and "الأعلام" for the context of the second sentence.
الجحر الأسود: البحر الأسود، الحجر الأسود
Sahehly is a rule- -based grammar checker for modern standard Arabic. It helps the user to write a sentence by analyzing each word and then only accepting the sentence if it is grammatically correct.
The main features of the system are (1) it performs complete grammatical analysis of sentences, and (2) checks the sentence for common grammatical errors and offers suggestions for improvement. The design of the whole system is shown in the below figure.
The system is composed of five sub-systems:
It is a morphology-based spell checking, and automatically corrects the spelling mistakes that emerge while editing the Arabic text. The program provides correct alternatives for misspelled words, so that the user can choose one of them. Corrector is distinguished with many features like accurate suggestions based on context, Idiom recognition, and diacritics recognition.
It resolves the basic problem of handling the un-vowelized Arabic text automatically by simulating the mental process exercised by Arabic native speakers in interpreting undiacritized text and substituting missing vowels.
It identifies all possible fully diacretized forms of the input word, with its morphological information such as the Root, Morphological Pattern (MP), Prefix, Suffix and many other information. It has the following features:
o E.g. رجل ، الإنترنت ، الإدام ، إداوة ،…
o E.g. عِلْم ٌ، عِلم ، علم
o E.g. the word “ورجالنا”, its root is” رجل” the morphological pattern is” فِعَال” the part of speech is “جمع تكسير لاسم ذات” the prefix is “و” and the suffix is “نا” …
Many grammatical errors can be described as violations of formal constraints between different syntactical categories. The constraints may be due to agreement order between sub-phrases elements. The intended grammar check system aims to covers the basic grammar rules for the nominal sentence and the verbal sentence.
In our implementation, the error detection is embedded within the grammar rule and each rule consists of two parts, the constraints (condition) part and the success/fail action part. If any of the constraints is not satisfied, then the whole rule will fail and based on the success or failure, an error message reporting which type of error has occurred will be issued.