专题:Multimodal Machine Learning Applications

This cluster of papers focuses on the development and improvement of visual question answering systems, image captioning techniques, and neural networks for understanding and generating descriptions of images and videos. The research involves semantic reasoning, multimodal fusion, scene graph generation, attention mechanisms, and deep learning approaches to bridge the gap between vision and language.
最新文献
The 'Ribet' Breakthrough: How a Complex User Session Pushed Our LLM to New Frontiers of Abstraction

article Full Text OpenAlex

PathRAG: Pruning Graph-based Retrieval Augmented Generation with Relational Paths

article Full Text OpenAlex

STRIDE-QA: Visual Question Answering Dataset for Spatiotemporal Reasoning in Urban Driving Scenes

article Full Text OpenAlex

VisionReward: Fine-Grained Multi-Dimensional Human Preference Learning for Image and Video Generation

article Full Text OpenAlex

S^2-KD: Semantic-Spectral Knowledge Distillation Spatiotemporal Forecasting

article Full Text OpenAlex

LogisticsVLN: Vision-Language Navigation for Low-Altitude Terminal Delivery Based on Agentic UAVs

article Full Text OpenAlex

Structured Language Interpretation Using Small Language Models for Real-Time Systems

article Full Text OpenAlex

ICL CHARACTERIZATION OF MULTI-MODAL GEO-FOUNDATION MODELS: WHEN CAN VISION-LANGUAGE TRANSFORMERS LEARN GEOSPATIAL TASKS?

article Full Text OpenAlex

Report on the Workshop on Explainability in Information Retrieval (WExIR) at SIGIR 2025

article Full Text OpenAlex

Merlin: a computed tomography vision–language foundation model and dataset

article Full Text OpenAlex

近5年高被引文献
MizAR 60 for Mizar 50

preprint Full Text OpenAlex 75523 FWCI9086.3469

Attention Is All You Need

preprint Full Text OpenAlex 6534 FWCI139.8739

Survey of Hallucination in Natural Language Generation

review Full Text OpenAlex 3195 FWCI369.5325

Contextual Personal Intelligence: A New Paradigm for AI That Evolves With You

preprint Full Text OpenAlex 2957 FWCI0

Learning to Prompt for Vision-Language Models

article Full Text OpenAlex 2575 FWCI231.1565

Hierarchical Text-Conditional Image Generation with CLIP Latents

preprint Full Text OpenAlex 2273 FWCI0

Swin Transformer V2: Scaling Up Capacity and Resolution

article Full Text OpenAlex 2145 FWCI118.1071

PaLM: Scaling Language Modeling with Pathways

preprint Full Text OpenAlex 2127 FWCI0

Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding

preprint Full Text OpenAlex 2105 FWCI0

A Metaverse: Taxonomy, Components, Applications, and Open Challenges

article Full Text OpenAlex 1728 FWCI168.16