AI Blog #2: Rapid Listing - Building Multi-modal AI Application [English Blog]
![AI Blog #2: Rapid Listing - Building Multi-modal AI Application [English Blog]](https://static.chotot.com/storage/chotot-blog/2025/07/banner.png)
A Technical Journey from Traditional Programming to LLM Integration - Team Collaboration, Challenges, and Innovation and our User-First approach.
Transforming Traditional Marketplace Listings
What began as “Garage Sale” (Chợ Phiên) evolved into “Rapid Listing” (Đăng nhanh bằng AI)—a name that better reflected our ambitious vision of integrating AI technology into the core functionality of our Chợ Tốt mobile app. This project represented more than just a new feature; it was a multi-modal AI product, marking a fundamental shift from traditional deterministic programming to the probabilistic world of Large Language Models (LLMs).
At Chợ Tốt, our user-first philosophy drives every innovation. We recognized that traditional listing creation was time-consuming and often intimidating for sellers. This user pain point became the catalyst for Rapid Listing—transforming a complex, multi-step process into something as simple as pointing a camera at an item.
AI Quét Là Bán
The Technical Foundation: From Idea to Implementation for iOS App
Brainstorming and Architecture Design
The concept was revolutionary for our marketplace—users could simply point their camera at an item, and our AI would automatically generate a high-quality listing with extracted product information, category mapping, and pricing suggestions.
The initial architecture involved several key components:
- Real-time camera feed processing using Apple’s
AVFoundation
andCore Media
framework - Multi-modal data extraction (text, audio, video, image)
- Hybrid intelligence model combining on-device and cloud AI
- Probabilistic UI design to handle non-deterministic AI responses
Research and Dataset Preparation
Our Data Engineering team executed a comprehensive data collection and annotation pipeline:
- Multi-category taxonomy spanning diverse product verticals
- Large-scale dataset with thousands of labeled video samples
- Automated quality validation pipelines ensuring consistent annotation standards
This dataset became the foundation for training our AI models to accurately extract attributes and map products to Chợ Tốt’s specific category taxonomy.
Technical Challenges and Solutions
Transitioning from Deterministic to Probabilistic Programming
The Problem: Traditional mobile app development follows a predictable, deterministic pattern: fetch data from an API, decode it into a strict data model, and display it in a predefined UI. LLM-driven development shatters this paradigm.
Note: This blog highlights case studies and architecture decisions for iOS, but the same principles and workflow apply to Android.
The Solution: We developed a flexible data ingestion layer that could handle:
- Variable AI outputs with optional fields
- Unexpected response formats
- Partial or incomplete data
- Resilient parsing mechanisms
Instead of “fail-fast” decoding, we embraced graceful degradation and suggestion-based interfaces.
Real-time Multi-modal Data Processing
The Problem: Unlike our previous straightforward use of AVFoundation
for video playback, Rapid Listing demanded a sophisticated real-time data extraction pipeline.
At Chợ Tốt, we have a dedicated Video Hub section where users can upload and share videos related to their listings, enhancing the overall user experience and engagement on the platform. We leveraged this existing engineering knowledge and expertise to build the Rapid Listing feature, applying advanced real-time video processing to connect with our AI services while ensuring a seamless user experience.
For Rapid Listing, we need more than that:
- Process
CMSampleBuffer
which is an API from AppleCore Media
framework to process streams directly from the camera - Extract high-quality frames without blocking the main thread
- Synchronize video, audio, and image data streams
- Intelligently select relevant frames to minimize costs and latency
Sample buffers are Core Foundation objects that the system uses to move media sample data through the media pipeline. – CMSampleBuffer Document
The Solution: We architectured a multi-tiered processing system:

Hybrid Intelligence Architecture
Relying solely on server-side AI would introduce unacceptable latency and a poor user experience. We implemented a hybrid model to mitigate this:
Tier 1: On-Device Intelligence (TensorFlow Lite)
- Real-time object detection and user guidance
- Product category mapping
- Local feedback enhancement
- ~26ms inference time on 224x224 image crops
Tier 2: Remote LLM for Deep Inference
- Complex attribute extraction
- Advanced and fine-grained category mapping
- Price estimation and validation
- Quality assurance against platform guidelines
System Architecture Diagrams
System Architecture Breakdown

This diagram shows the real-time multi-modal capture process.
Capture Phase: Users point their camera at items, triggering continuous video streaming and audio recording from the device sensors.
Real-time Processing: The On-Device TensorFlow Lite
model performs object detection (~26ms latency) in a continuous loop, while the real-time UI provides immediate category guidance and visual feedback to users with question prompts, so that user can talk and describe the product they want to sell, this step is optional.
Intelligent Selection: The system automatically selects the best frames from the video stream and segments relevant audio portions, preparing optimized multi-modal data for further cloud processing and listing generation.

This system enables users to create optimized listings using AI-powered analysis of mobile camera and audio inputs.
The architecture balances on-device processing for speed with cloud-based AI for sophisticated analysis, creating an efficient pipeline from raw mobile content to professional marketplace listings.
Key Technical Hurdles
Our existing Video Hub experience provided the foundation, but Rapid Listing required real-time processing capabilities.
We optimized our video processing pipeline to handle:
- Continuous frame extraction from camera streams
- Intelligent frame selection
- Memory-efficient processing to prevent device overheating
- Seamless integration with AI inference services
UI/UX for Probabilistic Systems
The non-deterministic nature of LLM responses forced a fundamental rethink of our UI/UX approach:
- Traditional approach: Fixed data model → Rigid interface
- AI-first approach: Variable data → Adaptive, suggestion-based UI
We designed the interface to present AI-generated listings as pre-filled, editable drafts, managing user expectations while inviting review and refinement.
Integration Challenges
Coordinating on-device ML models, real-time camera processing, and remote LLM services required meticulous optimization:
- Threading management to prevent UI blocking
- Memory optimization for ML model efficiency
- Network resilience for remote AI calls
- Error handling for probabilistic systems
Project Management: Agile in an AI World
We adopted Kanban methodology to manage the inherent uncertainty of R&D and AI development.

This approach proved essential for:
- Cross-team coordination between multiple teams
- Iterative development with rapid prototyping and testing
- Stakeholder alignment across multiple work-groups
- Risk management in uncertain AI development timelines
The collaborative nature was crucial—this is a full cross-workgroup initiative involving multiple teams.
The Result
Rapid Listing with AI (Đăng nhanh bằng AI) helps sellers create listings more easily and conveniently with AI assistance. The AI feature is integrated directly into our current listing flow, automatically suggesting and optimizing content to save users' time and improve listing quality.
Learn more at out Official Help Guide: Đăng nhanh bằng AI
Looking Forward
The success of Rapid Listing reinforces our commitment to user-first innovation. Every AI integration we plan stems from real user needs we've identified across Chợ Tốt's platform. This project proved that when we combine cutting-edge technology with deep user empathy, we can create experiences that truly transform how people interact with our marketplace
Acronyms and Technical Terms
- Rapid Listing: feature "Đăng nhanh bằng AI"
- AI: Artificial Intelligence - Computer systems that can perform tasks typically requiring human intelligence
- API: Application Programming Interface - A set of protocols for building software applications
- AVFoundation and Core Media: Apple’s multimedia frameworks for working with audiovisual media.
- LLM: Large Language Model - AI models trained on vast amounts of text data
- ML: Machine Learning - A subset of AI that enables computers to learn from data
- R&D: Research & Development - Work directed toward innovation and new product development
- TensorFlow Lite: Google’s lightweight machine learning framework for mobile devices. It's now called
LiteRT
(or Lite Runtime). Cross-platform on both Android and iOS. - UI: User Interface - The visual elements through which users interact with an application
- UX: User Experience - The overall experience a user has when interacting with a product
Thank you!