Famous Bert I believe most students have heard of the algorithm , It is Google To launch the NLP field “ Wang fried class ” Pre training model , Its presence NLP Several records have been refreshed in the task , And get state of the art The achievement of .
But there are a lot of novices who have deep learning BERT The model is not easy to build , It's very difficult to get started , Ordinary people may have to study for a few days to build a model .
No problem , Today we will introduce this module , Can make you in 3 Minutes based on BERT The algorithm builds a question and answer search engine . It is bert-as-service project . This open source project , Can make you based on more GPU Rapid machine building BERT service ( Support fine tuning model ), And it can be used by multiple clients concurrently .
1. Get ready
Before the start , You have to make sure that Python and pip Has been successfully installed on the computer , without , Installation .
( Optional 1) If you use Python The goal is data analysis , It can be installed directly Anacond
( Optional 2) Besides , Recommended VSCode Editor , It has many advantages .
Please choose one of the following ways to enter the command to install the dependency : 1. Windows Environmental Science open Cmd ( Start - function -CMD). 2. MacOS Environmental Science open Terminal (command+ Space input Terminal). 3. If you're using a VSCode Editor or Pycharm, You can directly use the Terminal.
pip install bert-serving-server # Server side pip install bert-serving-client # client
Please note that , Server version requirements :Python >= 3.5,Tensorflow >= 1.10 .
In addition, download the pre trained BERT Model , stay https://github.com/hanxiao/bert-as-service#install Can be downloaded from .
Also available at Python Practical dictionary backstage reply bert-as-service Download these pre trained models .
When the download is complete , take zip Extract the file into a folder , for example /tmp/english_L-12_H-768_A-12/
2.Bert-as-service Basic use
After installation , Enter the following command to start BERT service :
bert-serving-start -model_dir /tmp/english_L-12_H-768_A-12/ -num_worker=4
-num_worker=4 On behalf of this will start one with four worker Service for , This means that it can handle up to four Concurrent request . exceed 4 Other concurrent requests will be queued in the load balancer for processing .
The following shows what the server looks like when it starts correctly :
Now you can simply code the sentences , As shown below :
from bert_serving.client import BertClient bc = BertClient() bc.encode(['First do it', 'then do it right', 'then do it better'])
As BERT A feature of , You can compare them with |||
( There are spaces before and after ) Connect to get the code of a pair of sentences , for example
bc.encode(['First do it ||| then do it right'])
Remote use BERT service
You can also be on one (GPU) Start the service on the machine and from another (CPU) The machine calls it , As shown below :
# on another CPU machine from bert_serving.client import BertClient bc = BertClient(ip='xx.xx.xx.xx') # ip address of the GPU machine bc.encode(['First do it', 'then do it right', 'then do it better'])
3. Build a Q & a search engine
We will pass bert-as-service from FAQ Find the most similar question to the question entered by the user in the list , And return the corresponding answer .
FAQ You can also list in Python Practical dictionary backstage reply bert-as-service download .
First , Load all questions , And display statistics :
prefix_q = '##### **Q:** ' with open('README.md') as fp: questions = [v.replace(prefix_q, '').strip() for v in fp if v.strip() and v.startswith(prefix_q)] print('%d questions loaded, avg. len of %d' % (len(questions), np.mean([len(d.split()) for d in questions]))) # 33 questions loaded, avg. len of 9
Altogether 33 Questions were loaded , The average length is 9.
Then use the pre trained model :uncased_L-12_H-768_A-12 Start a Bert service :
bert-serving-start -num_worker=1 -model_dir=/data/cips/data/lab/data/model/uncased_L-12_H-768_A-12
Next , Encode our problem as a vector :
bc = BertClient(port=4000, port_out=4001) doc_vecs = bc.encode(questions)
Last , We are ready to receive user queries , And perform a simple “ Fuzzy ” Search for .
So , Every time a new query comes , We encode it as a vector and compute its dot product doc_vecs
Then sort the results in descending order , Return to the former N A similar question :
while True: query = input('your question: ') query_vec = bc.encode([query])[0] # compute normalized dot product as score score = np.sum(query_vec * doc_vecs, axis=1) / np.linalg.norm(doc_vecs, axis=1) topk_idx = np.argsort(score)[::-1][:topk] for idx in topk_idx: print('> %s\t%s' % (score[idx], questions[idx]))
complete ! Now run the code and enter your query , See how this search engine handles fuzzy matching :
The complete code is as follows , altogether 23 Line code ( Reply to keywords in the background can also be downloaded ):
Slide up to see the complete code
import numpy as np from bert_serving.client import BertClient from termcolor import colored prefix_q = '##### **Q:** ' topk = 5 with open('README.md') as fp: questions = [v.replace(prefix_q, '').strip() for v in fp if v.strip() and v.startswith(prefix_q)] print('%d questions loaded, avg. len of %d' % (len(questions), np.mean([len(d.split()) for d in questions]))) with BertClient(port=4000, port_out=4001) as bc: doc_vecs = bc.encode(questions) while True: query = input(colored('your question: ', 'green')) query_vec = bc.encode([query])[0] # compute normalized dot product as score score = np.sum(query_vec * doc_vecs, axis=1) / np.linalg.norm(doc_vecs, axis=1) topk_idx = np.argsort(score)[::-1][:topk] print('top %d questions similar to "%s"' % (topk, colored(query, 'green'))) for idx in topk_idx: print('> %s\t%s' % (colored('%.1f' % score[idx], 'cyan'), colored(questions[idx], 'yellow')))
It's simple enough ? Of course , This is a pre training based Bert A simple example of model making QA Search model .
You can also fine tune the model , Let this model perform more perfectly as a whole , You can put your data in a directory , And then execute run_classifier.py Fine tune the model