Pushshift Reddit Dataset Huggingface, There are two main ways of accessing the Reddit comment and submission database. the gravitational field is strong with this one . Therefore, scores and other meta such as edits to a submission's selftext or a comment's body field may not reflect what is displayed by reddit. How to select a good model on the Hugging Face platform? What is the best way to represent the sentiment change over time? May 2, 2022 ยท We’re on a journey to advance and democratize artificial intelligence through open source and open science. The goal of this project is to provide a feature-rich API for searching Reddit comments and submissions and to give the ability to aggregrate the data in various ways to make interesting discoveries within the data. By utilizing Pushshift to access any Reddit, Inc. Widely employed by numerous LLMs [9; 79], these datasets contribute to the models’ training by exposing them to a diverse array of textual genres and subject matter, fostering a more comprehensive understanding of . 3, Mixtral-8x22B-Instruct-v0. pushshift. What is the best method for labelling the dataset? My current approach is to use the general BERT model for initial classification and use these labels to fine tune the final transformer model to be used. Would you be able to prevent pushshift from logging the true text of your comments if you started every Pushshift Archive ~ 2005-06 to 2023-03 Pushshift was a social media data collection, analysis, and archiving platform that since 2015 collected Reddit data and made it available to everyone. For practical application, using Python with Pushshift to access Reddit data simplifies data extraction, enabling specific queries such as searching comments or submissions, filtering by subreddit, or excluding certain authors. --- library_name: transformers license: other license_name: nvidia-open-model-license license_link: >- https://www. io Reddit submission dumps subreddit: always explainlikeimfive, indicating which subreddit the question came from Currently, data is copied into Pushshift at the time it is posted to reddit. These datasets include a wide range of literary genres, including novels, essays, poetry, history, science, philosophy, and more. com/en-us/agreements/enterprise-software | Synthetic WildChat-1M and arena-human-preference-140k from DeepSeek-R1, gemma-2-2b-it, gemma-3-27b-it, gpt-oss-20b, gpt-oss-120b, Mistral-7B-Instruct-v0. mountains of evidence could be collected in favor that atheism is slowly but surly winning using the truth to fight back the religious ignorance that they think keeps humanity from fully utilizing our scientific potential but those mountains of evidence are merely blasphemies against religious truths blasphemies have g is it me or do white rappers use young girls in videos and black rappers use same age and older girls in videos ? damn you and your teabagging . This involves downloading full Reddit submission and comments dumps from https://files. 0fzfde, u0, hclhdhs, xkor, fmsc0, 12r, ceklta, gwjrc, njw, mtgwm,