It is preferrable to read the pdf statment.Cuber QQ is poor in English writing, and in the process of preparing this contest, he realized that he is making too many grammar mistakes that an auto-correction engine is needed. Instead of using online tools like ''Microsoft Aim Writing'' or ''Grammarly'', he was interested in building a new engine on his own.
In particular, he adopted a naive sequence-to-sequence model, that takes a sequence, which is usually a sentence, and predict for each token, which is usually a word or character, whether there is something wrong with it, and if yes, what it should be replaced with. Here are several examples:
- In ''Cuber QQ was one of the admirers Quber CC.'', ''admirers'' should be replaced with ''admirers of''.
- In ''Cuber QQ confess his love to Cuber QQ just now.'', ''confess'' should be replaced with ''confessed''.
- In ''Quber CC said that they are being and always will be good friends.'', ''are being'' should be replaced with ''are''.
You might notice that, in this sequence-to-sequence model, the phrase to replace should be at least one token, and the target should be at least one token too. This is related to the architecture and training approach of his model. We will not go into too many machine learning details here, as it will make the statement tedious. The problem is however, the training data does not conform with such format. In the training data, a sequence with flaws can be annotated with three types of annotations: add, delete and replace. Concretely,
- $A$ $l$ $s_1$ $s_2$ $\cdots$ $s_v$: to add sequence $s$ before position $l$.
- $D$ $l$ $r$: to delete from $l$-th token to the $r$-th token, inclusive.
- $R$ $l$ $r$ $s_1$ $s_2$ $\cdots$ $s_v$: to replace sub-sequence from $l$-th token to $r$-th token, inclusive, with sequence $s$.
All the annotations are applied directly to the original sequence, i.e., the indices like $l$ and $r$ refers to the original indices, instead of the indices after modification.
As ''add'' and ''delete'' will not be supported in the model, the preprocessing step needs to rewrite all ''add'' and ''delete'' with ''replace''. Furthermore, as there are many ways to achieve such goal, Cuber QQ wants to find the cheapest way, i.e., after the annotation rewriting, the total number of replaced tokens should be as minimum as possible. If there is a tie, the number of annotation records should be as minimum as possible. In case there is still a tie, any one of them is acceptable.