Recommendations for a context aware text classifier (lemmy.world)

submitted 2 years ago by Bluetreefrog@lemmy.world to c/machinelearning@lemmy.world

3 comments fedilink hide all child comments

I've got a bot running/in development to detect and flag toxic content on Lemmy but I'd like to improve on it as I'm getting quite a few false positives. I think that part of the reason is that what constitutes toxic content often depends on the parent comment or post.

During a recent postgrad assignment I was taught (and saw for myself) that a bag of words model usually outperforms LSTM or transformer models for toxic text classification, so I've run with that, but I'm wondering if it was the right choice.

Does anyone have any ideas on what kind of model would be most suitable to include a parent as context, but to not explicitly consider whether the parent is toxic? I'm guessing some sort of transformer model, but I'm not quite sure how it might look/work.

you are viewing a single comment's thread
view the rest of the comments

[-] vluz@kbin.social 1 points 2 years ago* (last edited 2 years ago)

Oof, pop-culture references are hard and I had not considered that at all.
Thanks for the examples, I'll have a think on how to deal with those.

My only insight is one you already had.
Test at least the comment before, and then use the output to dampen or amplify the final result.
Sorry for being no help at all.

My project is very basic but I'll post it here for any insight you might get out of it.
I teach Python in a variety of settings and this is part of a class.

The data used is from Kaggle: https://www.kaggle.com/competitions/jigsaw-toxic-comment-classification-challenge/
The original data came from Wikipedia toxic comments dataset.
There is code too from several users, very helpful for some insight into the problem.

Data is dirty and needs clean up so I've done so and posted result on HF here:
https://huggingface.co/datasets/vluz/Tox

Model is a very basic TensorFlow implementation intended for teaching TF basics.
https://github.com/vluz/ToxTest
Some of the helper scripts are very wonky, need fixing before I present this in class.

Here are my weights after 30 epochs:
https://huggingface.co/vluz/toxmodel30

And here is it running on a HF space:
https://huggingface.co/spaces/vluz/Tox

this post was submitted on 11 Aug 2023

10 points (100.0% liked)

Machine Learning | Artificial Intelligence

1102 readers

4 users here now

Welcome to Machine Learning – a versatile digital hub where Artificial Intelligence enthusiasts unite. From news flashes and coding tutorials to ML-themed humor, our community covers the gamut of machine learning topics. Regardless of whether you're an AI expert, a budding programmer, or simply curious about the field, this is your space to share, learn, and connect over all things machine learning. Let's weave algorithms and spark innovation together.

founded 2 years ago

MODERATORS

Hopps@lemmy.world