4th Week Report – WEBM&DIA

At the start of this week, we had our first issue. Our dataset had so many unique values, around 1.3 million, that we weren’t able to run and test the pre-processing code as it takes hours to run everything. So we had to down-sample the majority of our dataset. Since we knew there were around 80k questions with target=1, which confirms the toxicity of a question, we wanted to have around 80k questions with target=0, so we can have a balanced dataset. To do that we used the resample() method from sklearn.utils.

As for Natural Language and Conversational Systems we got introduced to PL3: “Recognize name entities on Twitter with LSTMs”, where we use a recurrent neural network to solve Named Entity Recognition (NER) problem which is a common task in natural language processing systems. The solution of this task is based on neural networks, particularly, on Bi-Directional Long Short-Term Memory Networks (Bi-LSTMs).

In addition, we also had to continue to carry out the state of the art of the article, giving the general idea of what we are going to do and which articles we will use for consultation and comparison between projects.