专题:Multimodal Machine Learning Applications

This cluster of papers focuses on the development and improvement of visual question answering systems, image captioning techniques, and neural networks for understanding and generating descriptions of images and videos. The research involves semantic reasoning, multimodal fusion, scene graph generation, attention mechanisms, and deep learning approaches to bridge the gap between vision and language.
最新文献
VERA: Explainable Video Anomaly Detection via Verbalized Learning of Vision-Language Models

article Full Text OpenAlex

Fine-Grained Visual-Language Alignment for Remote Sensing Image-Text Retrieval

article Full Text OpenAlex

MVTamperBench: Evaluating Robustness of Vision-Language Models

article Full Text OpenAlex

MedPix 2.0: A Comprehensive Multimodal Biomedical Data Set for Advanced AI Applications with Retrieval Augmented Generation and Knowledge Graphs

article Full Text OpenAlex

GIT-CXR: End-to-End Transformer for Chest X-Ray Report Generation

article Full Text OpenAlex

Do Language Models Understand Time?

article Full Text OpenAlex

A Comprehensive Overview of Large Language Models

article Full Text OpenAlex

LMOD: A Large Multimodal Ophthalmology Dataset and Benchmark for Large Vision-Language Models

article Full Text OpenAlex

When language and vision meet road safety: Leveraging multimodal large language models for video-based traffic accident analysis

article Full Text OpenAlex

Advancing Object Detection in Transportation with Multimodal Large Language Models (MLLMs): A Comprehensive Review and Empirical Testing

review Full Text OpenAlex

近5年高被引文献
Learning Transferable Visual Models From Natural Language Supervision

preprint Full Text OpenAlex 4258 FWCI0

Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformers

article Full Text OpenAlex 2860 FWCI153.978

Exploring Simple Siamese Representation Learning

article Full Text OpenAlex 2573 FWCI226.29

Transformers in Vision: A Survey

review Full Text OpenAlex 2182 FWCI72.514

Hierarchical Text-Conditional Image Generation with CLIP Latents

preprint Full Text OpenAlex 2015 FWCI0

PaLM: Scaling Language Modeling with Pathways

preprint Full Text OpenAlex 1959 FWCI0

Survey of Hallucination in Natural Language Generation

review Full Text OpenAlex 1780 FWCI48.869

Tokens-to-Token ViT: Training Vision Transformers from Scratch on ImageNet

article Full Text OpenAlex 1767 FWCI93.437

RepVGG: Making VGG-style ConvNets Great Again

article Full Text OpenAlex 1710 FWCI85.318

Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding

preprint Full Text OpenAlex 1533 FWCI0