Thursday, February 3, 2022

What is SQuAD

 The Stanford Question Answering Dataset (SQuAD) is a set of question and answer pairs that present a strong challenge for NLP models. Whether you’re just interested in learning about a popular NLP dataset or planning to use it in one of your projects, here are all the basics you should know.


What task does SQuAD present? As implied by its name, SQuAD focuses on the task of question answering. It tests a model’s ability to read a passage of text and then answer questions about it (flashback to reading comprehension on the SAT). It’s a relatively straightforward task; here’s an example that the dataset’s creators gave:


How was SQuAD created? To compile SQuAD, the creators sampled 536 from the top 10,000 Wikipedia articles. From each of these sampled articles, they extracted a total of 23,215 individual paragraphs (making sure to filter for paragraphs that were too small). They split the dataset by articles such that 80% of articles went into the training set, 10% into a development set, and 10% into a testing set.


Annotating SQuAD. The most important part of creating a dataset — annotating it — was done by Mechanical Turk workers. Classic! I’m seeing Mechanical Turk making a cameo in a lot of these NLP papers. These workers were selected only if they had a history of high quality work (as measured by the HIT approval rate). For each selected paragraph, the workers were asked to come up with and answer 5 questions on the content of the paragraph. They were provided a text field to type their question, and they could highlight the answers in the paragraph. The creators of SQuAD made sure that the questions that the workers came up with were in their own words, even disabling the copy-paste functionality. Noooooo! Not my copy-paste tools!



References:

https://towardsdatascience.com/the-quick-guide-to-squad-cae08047ebee#:~:text=Oct%208%2C%202020%C2%B74%20min,the%20basics%20you%20should%20know.

No comments:

Post a Comment