As promised, more data from the blogs and education category is now in our granary! To get an idea of the task we’re facing, the data from this category alone is 2.9million files and that’s just a fraction of what we’ve collected. Another added set of data relates to job listings. As a result, at the moment our project has the largest number of Polish data!
Category: AI dataset
Happy Easter!
We wish you much peace and joy in the coming days!
In the meanwhile, we can report the import of more data. As promised, another from the blogs and education category which, together with the previous texts, gives us more than 145 GB of text data. You can see more details on our dashboard: Speakleash Dashboard – Streamlit
Happy Easter!
141GB
Another 3 datasets are already in our granary! The datasets come from media in general as well as from sites related to weblogs. Currently our dataset count has stopped at 141GB, and you can be sure that there will be another increase from these areas, like media and blogs, in the near future.
Below you can see the distribution of each category on a pie chart.
We don’t stop
We have big plans and a amazing team, but the amount of data is too much for the existing staff to be able to achieve our ambitious goal within the deadline.
Therefore, if you know Python language and love data, please write to us. We need your help right now!
Ending with positive news, another 6GB from the legal category is already in our SpeakLeash. For more details visit our dashboard( https://speakleash.streamlit.app/ )
Spring has come!
We welcome spring with another great news! Thanks to the acquisition of data from media categories and online stores, we managed to exceed 120GB of data! Big thank you for the whole team for hard work which is an inspiration for all of us.
How much do you think we will be able to collect this spring?
Another milestone
After months of research and talks we can say we made a milestone in our mission. We reached over 100GB of pure data text! It includes Wikipedia, thesis and novels. What do you think about it? What data would you like to add to train first polish GPT? Don’t hesitate to look it up here: https://speakleash.streamlit.app/.
BIG ANNOUNCEMENT!!
From now on, on our webpage extension (https://speakleash.streamlit.app/) you can see a live dashboard! Thanks to it you can track how our work is going starting from capacity of data, distribution of the data between the industries and much more! Apart from it, you can apply filters which help fit your demands. If you have any questions about the dashboard or SpeakLeash in general don’t hesitate to ask them.
Big announcement!
From now on on our webpage extension (https://speakleash.streamlit.app/) you can see a live dashboard! Thanks to it you can track how our work is going starting from capacity of data, distribution of the data between the industries and much more! Apart from it, you can apply filters which help fit your demands. If you have any questions about the dashboard or SpeakLeash in general don’t hesitate to ask them.
Social & GitHub are live!
We are happy to announce that our social platforms & GitHub are live! You can find the links in Community & Contact section. If you want to be updated about our progress, make sure to leave a follow.
A project to build a dataset with a capacity of min. 1TB containing diverse texts in Polish for Language Modeling has official blog now! Keep an eye on our blog for the latest information about the development of the dataset. Here is SpeakLeash.org blog RSS feed.