Category: AI dataset

We are leading

Post author By Maciej Ogrodnik
Post date 18 April 2023

As promised, more data from the blogs and education category is now in our granary! To get an idea of the task we’re facing, the data from this category alone is 2.9million files and that’s just a fraction of what we’ve collected. Another added set of data relates to job listings. As a result, at the moment our project has the largest number of Polish data!

AI dataset

Happy Easter!

Post author By Maciej Ogrodnik
Post date 9 April 2023

We wish you much peace and joy in the coming days!

In the meanwhile, we can report the import of more data. As promised, another from the blogs and education category which, together with the previous texts, gives us more than 145 GB of text data. You can see more details on our dashboard: Speakleash Dashboard – Streamlit

Happy Easter!

AI dataset

141GB

Post author By Maciej Ogrodnik
Post date 29 March 2023

Another 3 datasets are already in our granary! The datasets come from media in general as well as from sites related to weblogs. Currently our dataset count has stopped at 141GB, and you can be sure that there will be another increase from these areas, like media and blogs, in the near future.
Below you can see the distribution of each category on a pie chart.

AI dataset

We don’t stop

Post author By Maciej Ogrodnik
Post date 23 March 2023

We have big plans and a amazing team, but the amount of data is too much for the existing staff to be able to achieve our ambitious goal within the deadline.

Therefore, if you know Python language and love data, please write to us. We need your help right now!

Ending with positive news, another 6GB from the legal category is already in our SpeakLeash. For more details visit our dashboard( https://speakleash.streamlit.app/ )

AI dataset

Spring has come!

Post author By Maciej Ogrodnik
Post date 22 March 2023

We welcome spring with another great news! Thanks to the acquisition of data from media categories and online stores, we managed to exceed 120GB of data! Big thank you for the whole team for hard work which is an inspiration for all of us.
How much do you think we will be able to collect this spring?

AI dataset

Another milestone

Post author By Maciej Ogrodnik
Post date 17 March 2023

After months of research and talks we can say we made a milestone in our mission. We reached over 100GB of pure data text! It includes Wikipedia, thesis and novels. What do you think about it? What data would you like to add to train first polish GPT? Don’t hesitate to look it up here: https://speakleash.streamlit.app/.

AI dataset

BIG ANNOUNCEMENT!!

Post author By Maciej Ogrodnik
Post date 16 March 2023

From now on, on our webpage extension (https://speakleash.streamlit.app/) you can see a live dashboard! Thanks to it you can track how our work is going starting from capacity of data, distribution of the data between the industries and much more! Apart from it, you can apply filters which help fit your demands. If you have any questions about the dashboard or SpeakLeash in general don’t hesitate to ask them.

AI dataset

Big announcement!

Post author By Maciej Ogrodnik
Post date 16 March 2023

From now on on our webpage extension (https://speakleash.streamlit.app/) you can see a live dashboard! Thanks to it you can track how our work is going starting from capacity of data, distribution of the data between the industries and much more! Apart from it, you can apply filters which help fit your demands. If you have any questions about the dashboard or SpeakLeash in general don’t hesitate to ask them.

AI dataset

Social & GitHub are live!

Post author By SpeakLeash.org
Post date 5 December 2022

We are happy to announce that our social platforms & GitHub are live! You can find the links in Community & Contact section. If you want to be updated about our progress, make sure to leave a follow.

AI dataset

SpeakLeash a.k.a Spichlerz has official blog!

Post author By SpeakLeash.org
Post date 18 November 2022

A project to build a dataset with a capacity of min. 1TB containing diverse texts in Polish for Language Modeling has official blog now! Keep an eye on our blog for the latest information about the development of the dataset. Here is SpeakLeash.org blog RSS feed.