SpeakLeash /ˈspix.lɛʂ/ a.k.a. Spichlerz (polish for granary) is a new initiative to create the Polish Large Language Model (LLM). These are multi-purpose, transformer-based models used for natural language generation and processing.
Our goal is to build new and catalog existing data sets to provide researchers with the opportunity to conduct bleeding edge research on language modeling. The data sets curated under SpeakLeash are provided with manifests that describe licensing and basic statistics to provide for better fit for researchers.
Thanks to direct discussions with foreign LLM developers, e.g. Big Science (BLOOM), EleutherAI (GPT-J/GPT-NeoX-20B), we have received a wealth of information and access to (open) tools for building diverse text data sets. It is our hope to include our data sets in those research groups resulting in first class support for Polish language in ongoing and future projects.
The applications of LLMs are unlimited, from the generation of content, e.g. articles, journals, memos, to very advanced predictions in medicine, e.g. prediction of subsequent COVID-19 variants. What will you do with it?