
GitHub Copilot
The industry-standard AI pair programmer.

A large-scale, permissively-licensed code dataset for training Code LLMs.

The Stack is a comprehensive dataset comprising over 6TB of permissively-licensed source code files, encompassing 358 programming languages. Created as part of the BigCode Project, it's designed for training Large Language Models for Code (Code LLMs). The dataset facilitates the development of AI systems capable of synthesizing programs from natural language descriptions and code snippets. It provides provenance information for each data point, ensuring adherence to original licenses. Updated regularly to reflect data removal requests, The Stack aims to promote reproducibility in code LLM training by offering an open, large-scale resource. Users are required to agree to terms of use, including attribution and version updating, to ensure responsible usage.
The Stack is a comprehensive dataset comprising over 6TB of permissively-licensed source code files, encompassing 358 programming languages.
Explore all tools that specialize in autocomplete code. This domain focus ensures The Stack delivers optimized results for this specific requirement.
Explore all tools that specialize in pre-training. This domain focus ensures The Stack delivers optimized results for this specific requirement.
Provides detailed information about the origin and licensing of each code file, ensuring compliance with original licenses.
The dataset is regularly updated to enact validated data removal requests, ensuring the dataset remains current and compliant.
Covers 358 programming languages, providing a broad spectrum for training versatile code LLMs.
Implements near-deduplication to remove redundant code, improving training efficiency and model generalization.
Includes metadata such as file size, language, extensions, and repository information, enriching the training data.
Enables the loading and processing of the dataset in a streaming fashion, reducing memory footprint and facilitating large-scale training.
Install the 'datasets' library: `pip install datasets`.
Import the `load_dataset` function: `from datasets import load_dataset`.
Load the full dataset: `ds = load_dataset("bigcode/the-stack", split="train")`.
Load a specific language subset: `ds = load_dataset("bigcode/the-stack", data_dir="data/dockerfile", split="train")`.
Enable dataset streaming to avoid downloading the entire dataset at once: `ds = load_dataset("bigcode/the-stack", streaming=True, split="train")`.
Iterate through the dataset samples: `for sample in iter(ds): print(sample["content"])`.
Access file content via the 'content' feature: `sample["content"]`
All Set
Ready to go
Verified feedback from other users.
"The Stack is highly regarded for its comprehensive coverage and permissive licensing, making it a valuable resource for code LLM training, but requires careful handling of data removal requests."
Post questions, share tips, and help other users.

The industry-standard AI pair programmer.

The leading open-source AI code assistant that integrates any LLM into VS Code and JetBrains.
Design, document, and build APIs faster.
Digital developers who are actually easy to work with.
Open Source LLM Engineering Platform

The Open-Source Framework for Reinforcement Learning in Quantitative Finance.