Skip to content

lzwjava/zz

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ZZ

Dataset processing and training utilities for machine learning projects.

Setup

pip install -r requirements.txt

Directory Structure

scripts/
  download/   # Dataset download scripts
  extract/    # Data extraction scripts
  analysis/   # Training analysis and evaluation
logs/         # Training logs and outputs

Pending Features

  • RAG (Retrieval-Augmented Generation) — local knowledge base Q&A using vector search over your own documents
  • Knowledge Base Management — ingest PDFs, text files, and other documents into a vector store (e.g. FAISS)
  • LLM Integration — connect local or API-based models for document-grounded answers
  • Agent Support — tool-calling agents that can query knowledge bases and external APIs
  • Web UI — chat interface for interacting with the knowledge base

Usage

Download Datasets

# Download FineWeb dataset
python scripts/download/download_fineweb.py --limit 1000 --output output.txt

# Download with wget scripts
bash scripts/download/wget_fineweb_1.sh

Extract Data

# Extract from parquet files
python scripts/extract/extract_parquet.py

# Extract FineWeb data
python scripts/extract/extract_fineweb.py

Analysis

# Calculate training duration
python scripts/analysis/calculate_duration.py

# Evaluate training metrics
python scripts/analysis/evaluate.py --file logs/train_log_openweb.txt

About

zz: Dataset processing and training utilities for machine learning projects.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors