The Stack

The Stack is a comprehensive dataset comprising over 6TB of permissively-licensed source code files, encompassing 358 programming languages. Created as part of the BigCode Project, it's designed for training Large Language Models for Code (Code LLMs). The dataset facilitates the development of AI systems capable of synthesizing programs from natural language descriptions and code snippets. It provides provenance information for each data point, ensuring adherence to original licenses. Updated regularly to reflect data removal requests, The Stack aims to promote reproducibility in code LLM training by offering an open, large-scale resource. Users are required to agree to terms of use, including attribution and version updating, to ensure responsible usage.

About The Stack

Core Capabilities

Main Tasks

Autocomplete code

Pre-training

Key Features

Data Provenance

Regular Updates

Language Diversity

Near-Deduplication

Metadata Richness

Streaming Support

Use Cases

Code Completion

Documentation Generation

Code Translation

Bug Detection

Code Summarization

Natural Language to Code

Quick Start Guide

Pros

Cons

Frequently Asked Questions

Reviews & Ratings

AI Verdict

Write a Review

Feedback & Questions

User Comments

PRO

Team

Enterprise

Specs

Core Tasks

Analytics

Categories

Use The Stack For

Alternative Tools

GitHub Copilot

Continue

Stoplight

Polyglot Labs

Langfuse

FinRL

finmarketpy

FindBugs

Data Interface