Grammar Checker
We write a grammar and a parser to parse the POS tag sequence.
Data
Input data: sentences with POS tags The input is a tsv (tab-separated values) file like the sample:
|id|label|sentence|pos|
| -|-----|--------|---|
|73|0|Many thanks in advance for your cooperation .| JJ NNS IN NN IN PRP$ NN .| 74| 1| At that moment we saw the bus to come .|IN DT NN PRP VBD DT NN TO VB .|
The id column is the unique id for each sentence. The label column indicates whether a sentence contains grammar errors (1 means having errors and 0 means error-free). The pos column contains the POS tags for each token in the sentence, also separated by a single space.
The POS tags follow the Penn Treebank (PTB) tagging scheme, described here
Tasks
Task 1: Building a toy grammar
- We wrote a toy CFG for English in NLTK’s .cfg format.
Task 2: Constituency parsing
- We used the chart parser from NLTK to parse each of the POS sequences in the dataset with the toy grammar we wrote in task 1. We stored results in a TSV file with three columns:
Column name | Description | ||
---|---|---|---|
id | The id of the input sentence. | ground_truth | The ground truth label of the input sentence, copied from the dataset. |
prediction | 1 if the sentence has grammar errors, 0 if not. In other words, whether the POS sequence can be parsed successfully with your grammar and parser. |
Task 3: Evaluation and error analysis
- We evaluate the performance of our grammar checker by calculating its precision and recall on the data available to us. To do that, we compared the prediction of our system on a given sentence and its corresponding label in the dataset.
Report and Results
Further details and results can be found here
Contributors
Leen Alzebdeh @Leen-Alzebdeh
Sukhnoor Khehra @Sukhnoor-K
Resources Consulted
Jurafsky, D., & Martin, J. H. (2009). Speech and language processing: An introduction to natural language processing, computational linguistics, and speech recognition. Pearson Prentice Hall.
GitHub Copilot
Libraries
We run this project using standard Python libraries csv, sys, nltk.
Instructions to execute code
-
Ensure Python is installed, as well as the Python Standard Library.
-
Ensure the library nltk is installed, it can be installed using the following command:
pip install --user -U nltk
- Ensure you have input data in the format outlined above and in a file ‘data/train.tsv’
Example usage: use the following command in the current directory.
python3 src/main.py data/train.tsv grammars/toy.cfg output/train.tsv