-
TU-Darmstadt
- Darmstadt
-
01:54
(UTC -12:00) - http://akshitac8.github.io/
- @akshitac8
Highlights
Stars
Strategic research thinking agents for Claude Code — idea evaluation, project triage, and structured brainstorming. Helps you decide which papers to write, not just how to write them.
[ICLR 2026] Official repo for "FrameThinker: Learning to Think with Long Videos via Multi-Turn Frame Spotlighting"
[CVPR'25] Official implementation for paper - Contextual AD Narration with Interleaved Multimodal Sequence
[ECCV 2024] Paying More Attention to Image: A Training-Free Method for Alleviating Hallucination in LVLMs
The official code of "Thinking With Videos: Multimodal Tool-Augmented Reinforcement Learning for Long Video Reasoning"
PyTorch implementation of Audio Flamingo: Series of Advanced Audio Understanding Language Models
[ICCV25 Oral] Token Activation Map to Visually Explain Multimodal LLMs
A minimal codebase for finetuning large multimodal models, supporting llava-1.5/1.6, llava-interleave, llava-next-video, llava-onevision, llama-3.2-vision, qwen-vl, qwen2-vl, phi3-v etc.
🔥🔥First-ever hour scale video understanding models
This repository contains the official implementation of "FastVLM: Efficient Vision Encoding for Vision Language Models" - CVPR 2025
Official PyTorch implementation of One-Minute Video Generation with Test-Time Training
Qwen3-VL is the multimodal large language model series developed by Qwen team, Alibaba Cloud.
[ICML 2025] Official PyTorch implementation of LongVU
Visatronic: A Multimodal Decoder-Only Model for Speech Synthesis
Story-Based Retrieval with Contextual Embeddings. Largest freely available movie video dataset. [ACCV'20]
Code for "AnyGPT: Unified Multimodal LLM with Discrete Sequence Modeling"
SpeechGPT Series: Speech Large Language Models
LLaMA-Omni is a low-latency and high-quality end-to-end speech interaction model built upon Llama-3.1-8B-Instruct, aiming to achieve speech capabilities at the GPT-4o level.
Official PyTorch implementation of the paper "Chapter-Llama: Efficient Chaptering in Hour-Long Videos with LLMs"
[ACCV 2024] Official Implementation of "AutoAD-Zero: A Training-Free Framework for Zero-Shot Audio Description". Junyu Xie, Tengda Han, Max Bain, Arsha Nagrani, Gül Varol, Weidi Xie, Andrew Zisserman
🐸💬 - a deep learning toolkit for Text-to-Speech, battle-tested in research and production
[ICCV 2025] Official Implementation of "Shot-by-Shot: Film-Grammar-Aware Training-Free Audio Description Generation". Junyu Xie, Tengda Han, Max Bain, Arsha Nagrani, Eshika Khandelwal, Gül Varol, W…
【ICLR 2024, Spotlight】Sentence-level Prompts Benefit Composed Image Retrieval
OpenTAD is an open-source temporal action detection (TAD) toolbox based on PyTorch.
The suite of modeling video with Mamba
Official Implementation of LADS (Latent Augmentation using Domain descriptionS)
This repo contains the projects: 'Virtual Normal', 'DiverseDepth', and '3D Scene Shape'. They aim to solve the monocular depth estimation, 3D scene reconstruction from single image problems.
Official PyTorch implementation of StyleGAN3