Giter Club home page Giter Club logo

qa_kafka's Introduction

Step1 : Get DATA

采集的数据见 epubs ,最终处理后的版本为 .json 文件

难点:如何收集,解析数据,整理数据 ? 数据质量决定了检索。

Step 2 : Create DB

get_chunks.py

  1. 创建数据库:create_db_from_json
    1. 读取数据 create_db_from_json
    2. MinHash - LSH 数据去重 LSH_dedup > LSHHash.py
    3. embedding : BAAI/bge-large-en-v1.5
  2. 读取数据库
    1. FAISS.load_local
    2. 切分用户回复 -> user_response
    3. 从 db 中检索得到相关文本 -> docs
    4. 组织 context,使用 llm 判别:llama.cpp + "qwen1_5-4b-chat-q6_k.gguf" prompt 为:
       messages = [
        {"role": "system",
         "content": f"For this task, you're going to evaluate a user's response to a computer science interview question. "
                    "\nYour evaluation should be based solely on the provided context. "
                    "If the response is correct, provide 'right' as the value, if it's incorrect, provide 'wrong', "
                    "and if you're unable to make a decision, provide 'don't know'."
                    " Remember, your assessment should be based only on the given context and only from right / wrong / don't know."
                    f"\n\ncontext : {context}" \
                    f"\n\nquestion : {question}"},
        {"role": "user", "content": answer}
    ]
    
    

速度:

llama_print_timings:        load time =   33183.27 ms
llama_print_timings:      sample time =       1.06 ms /     2 runs   (    0.53 ms per token,  1890.36 tokens per second)
llama_print_timings: prompt eval time =   27894.66 ms /   342 tokens (   81.56 ms per token,    12.26 tokens per second)
llama_print_timings:        eval time =    1778.37 ms /     1 runs   ( 1778.37 ms per token,     0.56 tokens per second)
llama_print_timings:       total time =   29713.67 ms /   343 tokens

流程输出:

question: 
	What is the difference between Event Time and Processing Time in the context of Kafka?

user repsonse chunk: 
	Event Time: Refers to the time at which an event actually occurred or was generated.
    Event Time: Typically associated with the timestamp embedded within the data of each event.

context retrieved: 
	Topic: Stream Processing What Is Stream Processing?
	It is worth noting that neither the definition of event streams nor the attributes we later listed say anything about the data contained in the events or the number of events per second. The data differs from system to system — events can be tiny (sometimes only a few bytes) or very large (XML messages with many headers); they can also be completely unstructured, key-value pairs, semi-structured JSON, or structured Avro or Protobuf messages. While it is often assumed that data streams are “big data” and involve millions of events per second, the same techniques we’ll discuss apply equally well (and often better) to smaller streams of events with only a few events per second or minute.
	Topic: Stream Processing Stream-Processing Concepts Time
	Stream-processing systems typically refer to the following notions of time:
	Topic: Stream Processing What Is Stream Processing?
	Let’s start at the beginning: What is a data stream (also called an event stream or streaming data)? First and foremost, a data stream is an abstraction representing an unbounded dataset. Unbounded means infinite and ever growing. The dataset is unbounded because over time, new records keep arriving. This definition is used by Google, Amazon, and pretty much everyone else.

Answer: 
	right

qa_kafka's People

Contributors

yiandli avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.