CLIP 图文检索系统:构建跨模态语义搜索引擎

发布时间:2026/6/20 6:34:11
CLIP 图文检索系统:构建跨模态语义搜索引擎 CLIP 图文检索系统构建跨模态语义搜索引擎1. 引言CLIP (Contrastive Language-Image Pre-training) 是 OpenAI 在 2021 年提出的跨模态模型它将图像和文本映射到同一个语义空间使得用文字搜图片和用图片搜文字成为可能。应用场景电商商品图搜“红色连衣裙” → 搜出所有红色连衣裙图片素材管理输入描述 → 找到匹配的设计素材内容审核图片 → 检索相似违规内容医学影像检索症状描述 → 找到相似病例影像2. CLIP 原理2.1 双塔架构图像 → Image Encoder (ViT/ResNet) → 图像嵌入向量 ─┐ ├→ 对比学习 → 余弦相似度 文本 → Text Encoder (Transformer) → 文本嵌入向量 ─┘2.2 对比学习目标给定一个 batch 的 N 个 (图像, 文本) 对 - 正样本对匹配的 (图像_i, 文本_i) - 负样本对不匹配的 (图像_i, 文本_j), j ≠ i 损失函数对称交叉熵损失 L 0.5 * (L_image_to_text L_text_to_image) 目标正样本对的余弦相似度最大化负样本对最小化3. 环境搭建pipinstalltorch torchvision pipinstalltransformers pipinstallfaiss-gpu# GPU 版 FAISSpipinstallpillow numpy4. 图像特征提取importtorchfromtransformersimportCLIPModel,CLIPProcessorfromPILimportImageimportosimportnumpyasnpclassCLIPFeatureExtractor:CLIP 特征提取器def__init__(self,model_nameopenai/clip-vit-large-patch14):self.devicecudaiftorch.cuda.is_available()elsecpuself.modelCLIPModel.from_pretrained(model_name).to(self.device)self.processorCLIPProcessor.from_pretrained(model_name)self.model.eval()torch.no_grad()defencode_image(self,image_path:str)-np.ndarray:提取单张图像特征imageImage.open(image_path).convert(RGB)inputsself.processor(imagesimage,return_tensorspt)inputs{k:v.to(self.device)fork,vininputs.items()}featuresself.model.get_image_features(**inputs)featuresfeatures/features.norm(dim-1,keepdimTrue)# L2 归一化returnfeatures.cpu().numpy().flatten()torch.no_grad()defencode_text(self,text:str)-np.ndarray:提取文本特征inputsself.processor(text[text],return_tensorspt,paddingTrue)inputs{k:v.to(self.device)fork,vininputs.items()}featuresself.model.get_text_features(**inputs)featuresfeatures/features.norm(dim-1,keepdimTrue)returnfeatures.cpu().numpy().flatten()defencode_batch_images(self,image_paths:list,batch_size32)-np.ndarray:批量提取图像特征all_features[]foriinrange(0,len(image_paths),batch_size):batch_pathsimage_paths[i:ibatch_size]images[Image.open(p).convert(RGB)forpinbatch_paths]inputsself.processor(imagesimages,return_tensorspt,paddingTrue)inputs{k:v.to(self.device)fork,vininputs.items()}featuresself.model.get_image_features(**inputs)featuresfeatures/features.norm(dim-1,keepdimTrue)all_features.append(features.cpu().numpy())returnnp.vstack(all_features)5. 向量索引与检索5.1 使用 FAISS 构建索引importfaissclassCLIPSearchEngine:基于 CLIP FAISS 的图文检索引擎def__init__(self,model_nameopenai/clip-vit-large-patch14):self.extractorCLIPFeatureExtractor(model_name)self.indexNoneself.image_paths[]self.dimension768# ViT-L/14 输出维度defbuild_index(self,image_dir:str):构建图像索引# 收集所有图片extensions{.jpg,.jpeg,.png,.bmp,.webp}self.image_paths[os.path.join(image_dir,f)forfinos.listdir(image_dir)ifos.path.splitext(f)[1].lower()inextensions]print(f索引{len(self.image_paths)}张图片...)# 提取特征featuresself.extractor.encode_batch_images(self.image_paths)featuresfeatures.astype(float32)# 构建 FAISS 索引iflen(self.image_paths)10000:# 小规模精确搜索self.indexfaiss.IndexFlatIP(self.dimension)# 内积 余弦相似度已归一化else:# 大规模IVF 近似搜索nlistmin(int(len(self.image_paths)**0.5),1000)quantizerfaiss.IndexFlatIP(self.dimension)self.indexfaiss.IndexIVFFlat(quantizer,self.dimension,nlist)self.index.train(features)self.index.nprobe20# 搜索的聚类数self.index.add(features)print(f索引构建完成共{self.index.ntotal}向量)defsave_index(self,path:str):保存索引到磁盘faiss.write_index(self.index,f{path}.index)np.save(f{path}.paths.npy,np.array(self.image_paths))defload_index(self,path:str):加载索引self.indexfaiss.read_index(f{path}.index)self.image_pathsnp.load(f{path}.paths.npy).tolist()defsearch_by_text(self,query:str,top_k10)-list:用文字搜图片text_featuresself.extractor.encode_text(query).astype(float32)text_featurestext_features.reshape(1,-1)scores,indicesself.index.search(text_features,top_k)results[]forscore,idxinzip(scores[0],indices[0]):results.append({path:self.image_paths[idx],score:float(score),})returnresultsdefsearch_by_image(self,image_path:str,top_k10)-list:用图片搜图片img_featuresself.extractor.encode_image(image_path).astype(float32)img_featuresimg_features.reshape(1,-1)scores,indicesself.index.search(img_features,top_k)results[]forscore,idxinzip(scores[0],indices[0]):results.append({path:self.image_paths[idx],score:float(score),})returnresults5.2 使用示例# 构建索引engineCLIPSearchEngine()engine.build_index(/data/product_images)engine.save_index(product_index)# 文字搜图resultsengine.search_by_text(红色连衣裙时尚风格,top_k5)forrinresults:print(f{r[score]:.3f}|{r[path]})# 以图搜图resultsengine.search_by_image(query.jpg,top_k5)forrinresults:print(f{r[score]:.3f}|{r[path]})6. Web API 服务fromfastapiimportFastAPI,UploadFile,Filefromfastapi.responsesimportJSONResponseimportshutil,tempfile appFastAPI()engineCLIPSearchEngine()engine.load_index(product_index)app.get(/search/text)asyncdefsearch_text(query:str,top_k:int10):resultsengine.search_by_text(query,top_k)returnJSONResponse(content{results:results})app.post(/search/image)asyncdefsearch_image(file:UploadFileFile(...),top_k:int10):withtempfile.NamedTemporaryFile(deleteFalse,suffix.jpg)astmp:shutil.copyfileobj(file.file,tmp)tmp_pathtmp.name resultsengine.search_by_image(tmp_path,top_k)os.unlink(tmp_path)returnJSONResponse(content{results:results})# 启动服务uvicorn app:app--host0.0.0.0--port8000# 测试curlhttp://localhost:8000/search/text?query蓝色运动鞋top_k57. 性能优化优化方法适用场景效果FAISS IVF10K 图片搜索速度提升 10xFAISS PQ1M 图片内存减少 8-16xGPU 索引实时搜索毫秒级响应特征缓存重复查询避免重复编码批量编码建索引时吞吐提升 5x8. 总结CLIP 图文检索的核心优势在于语义理解——不需要精确的关键词匹配只要语义相关就能检索到。结合 FAISS 向量数据库可以实现百万级图片的毫秒级检索。