专题:Multimodal Machine Learning Applications

This cluster of papers focuses on the development and improvement of visual question answering systems, image captioning techniques, and neural networks for understanding and generating descriptions of images and videos. The research involves semantic reasoning, multimodal fusion, scene graph generation, attention mechanisms, and deep learning approaches to bridge the gap between vision and language.
最新文献
Paper 1 — BIFACE-Based Sentence Coordinate Documents: Human-Readable Surfaces and AI+AGI-Referable Coordinates Across Documents, Code, Media, and Conversations

article Full Text OpenAlex

Paper 1 - BIFACE-Based Sentence Coordinate Documents: Human-Readable Surfaces and AI+AGI-Referable Coordinates Across Documents, Code, Media, and Conversations

article Full Text OpenAlex

LlaVA-CoT: Let Vision Language Models Reason Step-By-Step

article Full Text OpenAlex

KG-CMI: Knowledge Graph Enhanced Cross-Mamba Interaction for Medical Visual Question Answering

article Full Text OpenAlex

Distributed Adaptive Speculative Decoding: Accelerating Large Language Model Inference With Context-Aware Draft Selection

article Full Text OpenAlex

Capabilities of GPT‑5 on multimodal medical reasoning

article Full Text OpenAlex

The 'Ribet' Breakthrough: How a Complex User Session Pushed Our LLM to New Frontiers of Abstraction

article Full Text OpenAlex

PathRAG: Pruning Graph-based Retrieval Augmented Generation with Relational Paths

article Full Text OpenAlex

OpenDriveVLA: Towards End-to-end Autonomous Driving with Large Vision Language Action Model

article Full Text OpenAlex

STRIDE-QA: Visual Question Answering Dataset for Spatiotemporal Reasoning in Urban Driving Scenes

article Full Text OpenAlex

近5年高被引文献
MizAR 60 for Mizar 50

preprint Full Text OpenAlex 76304 FWCI8841.5671

Attention Is All You Need

preprint Full Text OpenAlex 6569 FWCI170.6129

Survey of Hallucination in Natural Language Generation

review Full Text OpenAlex 3521 FWCI370.0931

Affordance-Compiled Intelligence: Observable-Only Cognitive Impedance Matching for No-Meta LLM-Integrated Systems

preprint Full Text OpenAlex 2995 FWCI1622.453

Learning to Prompt for Vision-Language Models

article Full Text OpenAlex 2690 FWCI231.7849

Hierarchical Text-Conditional Image Generation with CLIP Latents

preprint Full Text OpenAlex 2287 FWCI0

Swin Transformer V2: Scaling Up Capacity and Resolution

article Full Text OpenAlex 2199 FWCI118.1024

PaLM: Scaling Language Modeling with Pathways

preprint Full Text OpenAlex 2131 FWCI0

Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding

preprint Full Text OpenAlex 2108 FWCI0

A Metaverse: Taxonomy, Components, Applications, and Open Challenges

article Full Text OpenAlex 1759 FWCI168.3309