mckaywrigley / paul-graham-gpt Goto Github PK

View Code? Open in Web Editor NEW

2.6K 21.0 390.0 1.58 MB

AI search & chat for all of Paul Graham’s essays.

Home Page: https://paul-graham-gpt.vercel.app

License: MIT License

TypeScript 94.12% JavaScript 1.41% PLpgSQL 3.77% CSS 0.70%

paul-graham-gpt's Introduction

Paul Graham GPT

AI-powered search and chat for Paul Graham's essays.

All code & data used is 100% open-source.

Dataset

The dataset is a CSV file containing all text & embeddings used.

Download it here.

I recommend getting familiar with fetching, cleaning, and storing data as outlined in the scraping and embedding scripts below, but feel free to skip those steps and just use the dataset.

How It Works

Paul Graham GPT provides 2 things:

A search interface.
A chat interface.

Search

Search was created with OpenAI Embeddings (text-embedding-ada-002).

First, we loop over the essays and generate embeddings for each chunk of text.

Then in the app we take the user's search query, generate an embedding, and use the result to find the most similar passages from the book.

The comparison is done using cosine similarity across our database of vectors.

Our database is a Postgres database with the pgvector extension hosted on Supabase.

Results are ranked by similarity score and returned to the user.

Chat

Chat builds on top of search. It uses search results to create a prompt that is fed into GPT-3.5-turbo.

This allows for a chat-like experience where the user can ask questions about the book and get answers.

Running Locally

Here's a quick overview of how to run it locally.

Requirements

Set up OpenAI

You'll need an OpenAI API key to generate embeddings.

Set up Supabase and create a database

Note: You don't have to use Supabase. Use whatever method you prefer to store your data. But I like Supabase and think it's easy to use.

There is a schema.sql file in the root of the repo that you can use to set up the database.

Run that in the SQL editor in Supabase as directed.

I recommend turning on Row Level Security and setting up a service role to use with the app.

Repo Setup

Clone repo

git clone https://github.com/mckaywrigley/paul-graham-gpt.git

Install dependencies

npm i

Set up environment variables

Create a .env.local file in the root of the repo with the following variables:

OPENAI_API_KEY=

NEXT_PUBLIC_SUPABASE_URL=
SUPABASE_SERVICE_ROLE_KEY=

Dataset

Run scraping script

npm run scrape

This scrapes all of the essays from Paul Graham's website and saves them to a json file.

Run embedding script

npm run embed

This reads the json file, generates embeddings for each chunk of text, and saves the results to your database.

There is a 200ms delay between each request to avoid rate limiting.

This process will take 20-30 minutes.

App

Run app

npm run dev

Credits

Thanks to Paul Graham for his writing.

I highly recommend you read his essays.

3 years ago they convinced me to learn to code, and it changed my life.

Contact

If you have any questions, feel free to reach out to me on Twitter!

Notes

I sacrificed composability for simplicity in the app.

Yes, you can make things more modular and reusable.

But I kept pretty much everything in the homepage component for the sake of simplicity.

paul-graham-gpt's People

Contributors

Stargazers

Watchers

Forkers

rahulprabha dgilperez timhanlon arilla bigdatasciencegroup codeaudit dbuos satpreetsingh nicolazj danecjensen assistancechat xantin gy ajasver barralf realandygithub yashgoenka robflaherty nztinversive martinbowling moerdowo shivamsinha15 zcsunt mtwebb thesekyi qiyulan dkmin ggbond1024 hardikdangar databill86 vincanger nairda108 pkiage saymoon hirajanwin emilepetrone tobykim gonzalo10 hugsyy bnodnarb holawanli mindkhichdi kenshiro-o drelvischen keunwoochoi sheldor07 polya20 thebalaa coolnalu 0xviviyorg gigspace orionseven liucoj rmbrntt tractortoby serotoninpm skyzhao3q qtxdkk jeffreed-1 ukaserge crywas skymsg ml-sketch haoyitedaniu zenx0x ivanhuang1 amorist ahmedsaheed t-mc4 cningyuan ailabteam wuyangping johnliu33 miblue119 jaedukseo xdanger benrod3k techventurebuilder wangzhenzhe dalian-ai vpegasus qzchenwl blankxyz bitifirefly etreliu knightcn1983 tantailong marcus-arcadius animesh denglizong coverdpsy enjoysport2022 dannysmith tomtomzhang1970 ezhomelabs amumu-dev qscuio germmc im4media aswitlyk

paul-graham-gpt's Issues

'relation "public.paulg" does not exist'

Running npm run embed results in relation "public.paulg" does not exist. I believe line 31 of embed.ts needs to reference "pg" instead of "paulg".

paul-graham-gpt/scripts/embed.ts

Line 31 in 40a674e

.from("paulg")

error 42P01

Hello!

Thanks for this I am a newbie trying to recreate this on my server.

I am getting this error:

error {
  code: '42P01',
  details: null,
  hint: null,
  message: 'relation "public.pg" does not exist'
}

I am not sure where to start wit this. Thanks!

Error:42501, message: 'new row violates row-level security policy for table "pg"

when i ran the command npm run embed, I get this error. How to solve it?

error {
  code: '42501',
  details: null,
  hint: null,
  message: 'new row violates row-level security policy for table "pg"'
}

@mckaywrigley

Got a question for cosine similarity calculation

create or replace function pg_search (
  query_embedding vector(1536),
  similarity_threshold float,
  match_count int
)
returns table (
  id bigint,
  essay_title text,
  essay_url text,
  essay_date text,
  essay_thanks text,
  content text,
  content_length bigint,
  content_tokens bigint,
  similarity float
)
language plpgsql
as $$
begin
  return query
  select
    pg.id,
    pg.essay_title,
    pg.essay_url,
    pg.essay_date,
    pg.essay_thanks,
    pg.content,
    pg.content_length,
    pg.content_tokens,
    1 - (pg.embedding <=> query_embedding) as similarity
  from pg
  where 1 - (pg.embedding <=> query_embedding) > similarity_threshold
  order by pg.embedding <=> query_embedding
  limit match_count;
end;
$$;

For this search function, since 1 - (zen_motor.embedding <=> query_embedding) expression does indeed return cosine distance, where 0 indicates perfect similarity and 1 indicates perfect dissimilarity. So smaller number means more similar comparison, shouldn't we use 1 - (pg.embedding <=> query_embedding) < similarity_threshold instead of 1 - (pg.embedding <=> query_embedding) > similarity_threshold to get the most related content?

vercel site not functioning?

Might have overlooked something, seems like the site doesn't work?

The provided url is invalid. how to download the dataset?

is this url still working?

cause: [ConnectTimeoutError: Connect Timeout Error]

npm run embed have some error: embeddingResponse.data is '' , undefined is not iterable

I ran this command npm run embed and got the following error message.

This line of code will result in an error.
const embeddingResponse = await openai.createEmbedding
It's probably caused by an empty embeddingResponse.data = ''

I have configured .env.local.

OPENAI_API_KEY= mykey

NEXT_PUBLIC_SUPABASE_URL=Project URL URL
SUPABASE_SERVICE_ROLE_KEY=Project API keys service_role (secret)

I have already executed the four SQL commands in Supabase. There is a table called "pg" inside, do I need to set permissions for it?

 $$ npm run embed

> [email protected] embed
> tsx scripts/embed.ts

Loaded env from .env.local

**error:**
paul-graham-gpt/scripts/embed.ts:30
      **const [{ embedding }] = embeddingResponse.data.data;**
                              ^


TypeError: undefined is not iterable (cannot read property Symbol(Symbol.iterator))
    at generateEmbeddings (/paul-graham-gpt/scripts/embed.ts:30:31)
    at process.processTicksAndRejections (node:internal/process/task_queues:95:5)
    at <anonymous> (paul-graham-gpt/scripts/embed.ts:60:3)

Node.js v18.14.0



**print embeddingResponse log is** 


embeddingResponse
 {
  status: 200,
  statusText: 'Connection established',
  headers: {},
  config: {
    transitional: {
      silentJSONParsing: true,
      forcedJSONParsing: true,
      clarifyTimeoutError: false
    },
    adapter: [Function: httpAdapter],
    transformRequest: [ [Function: transformRequest] ],
    transformResponse: [ [Function: transformResponse] ],
    timeout: 0,
    xsrfCookieName: 'XSRF-TOKEN',
    xsrfHeaderName: 'X-XSRF-TOKEN',
    maxContentLength: -1,
    maxBodyLength: -1,
    validateStatus: [Function: validateStatus],
    headers: {
      Accept: 'application/json, text/plain, */*',
      'Content-Type': 'application/json',
      'User-Agent': 'OpenAI/NodeJS/3.1.0',
      Authorization: 'Bearer sk-mykey',
      'Content-Length': 822,
      host: 'api.openai.com'
    },
    method: 'post',
    data: `{"model":"text-embedding-ada-002","input":"(Someone fed my essays into GPT to make something that could answer questions based on them, then asked it where good ideas come from. The answer was ok, but not what I would have said. This is what I would have said.)The way to get new ideas is to notice anomalies: what seems strange, or missing, or broken? You can see anomalies in everyday life (much of standup comedy is based on this), but the best place to look for them is at the frontiers of knowledge. Knowledge grows fractally. From a distance its edges look smooth, but when you learn enough to get close to one, you'll notice it's full of gaps. These gaps will seem obvious; it will seem inexplicable that no one has tried x or wondered about y. In the best case, exploring such gaps yields whole new fractal buds."}`,
    url: 'https://api.openai.com/v1/embeddings'
  },
  request: <ref *1> ClientRequest {
    _events: [Object: null prototype] {
      abort: [Function (anonymous)],
      aborted: [Function (anonymous)],
      connect: [Function (anonymous)],
      error: [Function (anonymous)],
      socket: [Function (anonymous)],
      timeout: [Function (anonymous)],
      finish: [Function: requestOnFinish]
    },
    _eventsCount: 7,
    _maxListeners: undefined,
    outputData: [],
    outputSize: 0,
    writable: true,
    destroyed: false,
    _last: true,
    chunkedEncoding: false,
    shouldKeepAlive: false,
    maxRequestsOnConnectionReached: false,
    _defaultKeepAlive: true,
    useChunkedEncodingByDefault: true,
    sendDate: false,
    _removedConnection: false,
    _removedContLen: false,
    _removedTE: false,
    strictContentLength: false,
    _contentLength: 822,
    _hasBody: true,
    _trailer: '',
    finished: true,
    _headerSent: true,
    _closed: false,
    socket: Socket {
      connecting: false,
      _hadError: false,
      _parent: null,
      _host: null,
      _closeAfterHandlingError: false,
      _readableState: [ReadableState],
      _events: [Object: null prototype],
      _eventsCount: 8,
      _maxListeners: undefined,
      _writableState: [WritableState],
      allowHalfOpen: false,
      _sockname: null,
      _pendingData: null,
      _pendingEncoding: '',
      server: null,
      _server: null,
      parser: null,
      _httpMessage: [Circular *1],
      write: [Function: writeAfterFIN],
      [Symbol(async_id_symbol)]: 57,
      [Symbol(kHandle)]: null,
      [Symbol(lastWriteQueueSize)]: 0,
      [Symbol(timeout)]: null,
      [Symbol(kBuffer)]: null,
      [Symbol(kBufferCb)]: null,
      [Symbol(kBufferGen)]: null,
      [Symbol(kCapture)]: false,
      [Symbol(kSetNoDelay)]: true,
      [Symbol(kSetKeepAlive)]: true,
      [Symbol(kSetKeepAliveInitialDelay)]: 60,
      [Symbol(kBytesRead)]: 39,
      [Symbol(kBytesWritten)]: 1121
    },
    _header: 'POST https://api.openai.com/v1/embeddings HTTP/1.1\r\n' +
      'Accept: application/json, text/plain, */*\r\n' +
      'Content-Type: application/json\r\n' +
      'User-Agent: OpenAI/NodeJS/3.1.0\r\n' +
      'Authorization: Bearer sk-mykey\r\n' +
      'Content-Length: 822\r\n' +
      'host: api.openai.com\r\n' +
      'Connection: close\r\n' +
      '\r\n',
    _keepAliveTimeout: 0,
    _onPendingData: [Function: nop],
    agent: Agent {
      _events: [Object: null prototype],
      _eventsCount: 2,
      _maxListeners: undefined,
      defaultPort: 80,
      protocol: 'http:',
      options: [Object: null prototype],
      requests: [Object: null prototype] {},
      sockets: [Object: null prototype],
      freeSockets: [Object: null prototype] {},
      keepAliveMsecs: 1000,
      keepAlive: false,
      maxSockets: Infinity,
      maxFreeSockets: 256,
      scheduling: 'lifo',
      maxTotalSockets: Infinity,
      totalSocketCount: 1,
      [Symbol(kCapture)]: false
    },
    socketPath: undefined,
    method: 'POST',
    maxHeaderSize: undefined,
    insecureHTTPParser: undefined,
    joinDuplicateHeaders: undefined,
    path: 'https://api.openai.com/v1/embeddings',
    _ended: true,
    res: IncomingMessage {
      _readableState: [ReadableState],
      _events: [Object: null prototype],
      _eventsCount: 4,
      _maxListeners: undefined,
      socket: [Socket],
      httpVersionMajor: 1,
      httpVersionMinor: 1,
      httpVersion: '1.1',
      complete: true,
      rawHeaders: [],
      rawTrailers: [],
      joinDuplicateHeaders: undefined,
      aborted: false,
      upgrade: false,
      url: '',
      method: null,
      statusCode: 200,
      statusMessage: 'Connection established',
      client: [Socket],
      _consuming: true,
      _dumped: false,
      req: [Circular *1],
      responseUrl: 'https://api.openai.com/v1/embeddings',
      redirects: [],
      [Symbol(kCapture)]: false,
      [Symbol(kHeaders)]: {},
      [Symbol(kHeadersCount)]: 0,
      [Symbol(kTrailers)]: null,
      [Symbol(kTrailersCount)]: 0
    },
    aborted: false,
    timeoutCb: null,
    upgradeOrConnect: false,
    parser: null,
    maxHeadersCount: null,
    reusedSocket: false,
    host: '127.0.0.1',
    protocol: 'http:',
    _redirectable: Writable {
      _writableState: [WritableState],
      _events: [Object: null prototype],
      _eventsCount: 3,
      _maxListeners: undefined,
      _options: [Object],
      _ended: true,
      _ending: true,
      _redirectCount: 0,
      _redirects: [],
      _requestBodyLength: 822,
      _requestBodyBuffers: [],
      _onNativeResponse: [Function (anonymous)],
      _currentRequest: [Circular *1],
      _currentUrl: 'https://api.openai.com/v1/embeddings',
      [Symbol(kCapture)]: false
    },
    [Symbol(kCapture)]: false,
    [Symbol(kBytesWritten)]: 0,
    [Symbol(kEndCalled)]: true,
    [Symbol(kNeedDrain)]: false,
    [Symbol(corked)]: 0,
    [Symbol(kOutHeaders)]: [Object: null prototype] {
      accept: [Array],
      'content-type': [Array],
      'user-agent': [Array],
      authorization: [Array],
      'content-length': [Array],
      host: [Array]
    },
    [Symbol(errored)]: null,
    [Symbol(kUniqueHeaders)]: null
  },
  data: ''
}

'relation "public.pg" does not exist'

After creating a Supabase db and storing the creds in .env.local, when I run npm run embed, I get the following error:

error {
  code: '42P01',
  details: null,
  hint: null,
  message: 'relation "public.pg" does not exist'
}

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.