jsvine / markovify Goto Github PK

A simple, extensible Markov chain generator.

License: MIT License

Python 98.89% Makefile 1.11%

markovify's Introduction

Markovify

Markovify is a simple, extensible Markov chain generator. Right now, its primary use is for building Markov models of large corpora of text and generating random sentences from that. However, in theory, it could be used for other applications.

Why Markovify?
Installation
Basic Usage
Advanced Usage
Markovify In The Wild
Thanks

Why Markovify?

Some reasons:

Simplicity. "Batteries included," but it is easy to override key methods.
Models can be stored as JSON, allowing you to cache your results and save them for later.
Text parsing and sentence generation methods are highly extensible, allowing you to set your own rules.
Relies only on pure-Python libraries, and very few of them.
Tested on Python 3.7, 3.8, 3.9, and 3.10.

Installation

pip install markovify

Basic Usage

import markovify

# Get raw text as string.
with open("/path/to/my/corpus.txt") as f:
    text = f.read()

# Build the model.
text_model = markovify.Text(text)

# Print five randomly-generated sentences
for i in range(5):
    print(text_model.make_sentence())

# Print three randomly-generated sentences of no more than 280 characters
for i in range(3):
    print(text_model.make_short_sentence(280))

Notes:

The usage examples here assume you are trying to markovify text. If you would like to use the underlying markovify.Chain class, which is not text-specific, check out the (annotated) source code.
Markovify works best with large, well-punctuated texts. If your text does not use .s to delineate sentences, put each sentence on a newline, and use the markovify.NewlineText class instead of markovify.Text class.
If you have accidentally read the input text as one long sentence, markovify will be unable to generate new sentences from it due to a lack of beginning and ending delimiters. This issue can occur if you have read a newline delimited file using the markovify.Text command instead of markovify.NewlineText. To check this, the command [key for key in txt.chain.model.keys() if "___BEGIN__" in key] command will return all of the possible sentence-starting words and should return more than one result.
By default, the make_sentence method tries a maximum of 10 times per invocation, to make a sentence that does not overlap too much with the original text. If it is successful, the method returns the sentence as a string. If not, it returns None. To increase or decrease the number of attempts, use the tries keyword argument, e.g., call .make_sentence(tries=100).
By default, markovify.Text tries to generate sentences that do not simply regurgitate chunks of the original text. The default rule is to suppress any generated sentences that exactly overlaps the original text by 15 words or 70% of the sentence's word count. You can change this rule by passing max_overlap_ratio and/or max_overlap_total to the make_sentence method. Alternatively, this check can be disabled entirely by passing test_output as False.

Advanced Usage

Specifying the model's state size

State size is a number of words the probability of a next word depends on.

By default, markovify.Text uses a state size of 2. But you can instantiate a model with a different state size. E.g.,:

text_model = markovify.Text(text, state_size=3)

Combining models

With markovify.combine(...), you can combine two or more Markov chains. The function accepts two arguments:

models: A list of markovify objects to combine. Can be instances of markovify.Chain or markovify.Text (or their subclasses), but all must be of the same type.
weights: Optional. A list — the exact length of models — of ints or floats indicating how much relative emphasis to place on each source. Default: [ 1, 1, ... ].

For instance:

model_a = markovify.Text(text_a)
model_b = markovify.Text(text_b)

model_combo = markovify.combine([ model_a, model_b ], [ 1.5, 1 ])

This code snippet would combine model_a and model_b, but, it would also place 50% more weight on the connections from model_a.

Compiling a model

Once a model has been generated, it may also be compiled for improved text generation speed and reduced size.

text_model = markovify.Text(text)
text_model = text_model.compile()

Models may also be compiled in-place:

text_model = markovify.Text(text)
text_model.compile(inplace = True)

Currently, compiled models may not be combined with other models using markovify.combine(...). If you wish to combine models, do that first and then compile the result.

Working with messy texts

Starting with v0.7.2, markovify.Text accepts two additional parameters: well_formed and reject_reg.

Setting well_formed = False skips the step in which input sentences are rejected if they contain one of the 'bad characters' (i.e. ()[]'")
Setting reject_reg to a regular expression of your choice allows you change the input-sentence rejection pattern. This only applies if well_formed is True, and if the expression is non-empty.

Extending `markovify.Text`

The markovify.Text class is highly extensible; most methods can be overridden. For example, the following POSifiedText class uses NLTK's part-of-speech tagger to generate a Markov model that obeys sentence structure better than a naive model. (It works; however, be warned: pos_tag is very slow.)

import markovify
import nltk
import re

class POSifiedText(markovify.Text):
    def word_split(self, sentence):
        words = re.split(self.word_split_pattern, sentence)
        words = [ "::".join(tag) for tag in nltk.pos_tag(words) ]
        return words

    def word_join(self, words):
        sentence = " ".join(word.split("::")[0] for word in words)
        return sentence

Or, you can use spaCy which is way faster:

import markovify
import re
import spacy

nlp = spacy.load("en_core_web_sm")

class POSifiedText(markovify.Text):
    def word_split(self, sentence):
        return ["::".join((word.orth_, word.pos_)) for word in nlp(sentence)]

    def word_join(self, words):
        sentence = " ".join(word.split("::")[0] for word in words)
        return sentence

The most useful markovify.Text models you can override are:

sentence_split
sentence_join
word_split
word_join
test_sentence_input
test_sentence_output

For details on what they do, see the (annotated) source code.

Exporting

It can take a while to generate a Markov model from a large corpus. Sometimes you'll want to generate once and reuse it later. To export a generated markovify.Text model, use my_text_model.to_json(). For example:

corpus = open("sherlock.txt").read()

text_model = markovify.Text(corpus, state_size=3)
model_json = text_model.to_json()
# In theory, here you'd save the JSON to disk, and then read it back later.

reconstituted_model = markovify.Text.from_json(model_json)
reconstituted_model.make_short_sentence(280)

>>> 'It cost me something in foolscap, and I had no idea that he was a man of evil reputation among women.'

You can also export the underlying Markov chain on its own — i.e., excluding the original corpus and the state_size metadata — via my_text_model.chain.to_json().

Generating `markovify.Text` models from very large corpora

By default, the markovify.Text class loads, and retains, your textual corpus, so that it can compare generated sentences with the original (and only emit novel sentences). However, with very large corpora, loading the entire text at once (and retaining it) can be memory-intensive. To overcome this, you can (a) tell Markovify not to retain the original:

with open("path/to/my/huge/corpus.txt") as f:
    text_model = markovify.Text(f, retain_original=False)

print(text_model.make_sentence())

And (b) read in the corpus line-by-line or file-by-file and combine them into one model at each step:

combined_model = None
for (dirpath, _, filenames) in os.walk("path/to/my/huge/corpus"):
    for filename in filenames:
        with open(os.path.join(dirpath, filename)) as f:
            model = markovify.Text(f, retain_original=False)
            if combined_model:
                combined_model = markovify.combine(models=[combined_model, model])
            else:
                combined_model = model

print(combined_model.make_sentence())

Markovify In The Wild

BuzzFeed's Tom Friedman Sentence Generator / @mot_namdeirf.
/u/user_simulator, a Reddit bot that generates comments based on a user's comment history. [code]
SubredditSimulator, which uses markovify to generate random Reddit submissions and comments based on a subreddit's previous activity. [code]
college crapplication, a web-app that generates college application essays. [code]
@MarkovPicard, a Twitter bot based on Star Trek: The Next Generation transcripts. [code]
sekrits.herokuapp.com, a markovify-powered quiz that challenges you to tell the difference between "two file titles relating to matters of [Australian] national security" — one real and one fake. [code]
Hacker News Simulator, which does what it says on the tin. [code]
Stak Attak, a "poetic stackoverflow answer generator." [code]
MashBOT, a markovify-powered Twitter bot attached to a printer. Presented by Helen J Burgess at Babel Toronto 2015. [code]
The Mansfield Reporter, "a simple device which can generate new text from some of history's greatest authors [...] running on a tiny Raspberry Pi, displaying through a tft screen from Adafruit."
twitter markov, a tool to "create markov chain ("_ebooks") accounts on Twitter."
@Bern_Trump_Bot, "Bernie Sanders and Donald Trump driven by Markov Chains." [code]
@RealTrumpTalk, "A bot that uses the things that @realDonaldTrump tweets to create it's own tweets." [code]
Taylor Swift Song Generator, which does what it says. [code]
@BOTtalks / ideasworthautomating.com. "TIM generates talks on a broad spectrum of topics, based on the texts of slightly more coherent talks given under the auspices of his more famous big brother, who shall not be named here." [code]
Internal Security Zones, "Generative instructions for prison design & maintenance." [code]
Miraculous Ladybot. Generates Miraculous Ladybug fanfictions and posts them on Tumblr. [code]
@HaikuBotto, "I'm a bot that writes haiku from literature. beep boop" [code]
Chat Simulator Bot, a bot for Telegram. [code]
emojipasta.club, "a web service that exposes RESTful endpoints for generating emojipastas, as well as a simple frontend for generating and tweeting emojipasta sentences." [code]
Towel Generator, "A system for generating sentences similar to those from the hitchhikers series of books." [code]
@mercurialbot, "A twitter bot that generates tweets based on its mood." [code]
becomeacurator.com, which "generates curatorial statements for contemporary art expositions, using Markov chains and texts from galleries around the world." [code]
mannynotfound/interview-bot, "A python based terminal prompt app to automate the interview process."
Steam Game Generator, which "uses data from real Steam games, randomized using Markov chains." [code]
@DicedOnionBot, which "generates new headlines by The Onion by regurgitating and combining old headlines." [code]
@thought__leader, "Thinking thoughts so you don't have to!" [blog post]
@_murakamibot and @jamesjoycebot, bots that tweet Haruki Murakami and James Joyce-like sentences. [code]
shartificialintelligence.com, "the world's first creative ad agency staffed entirely with copywriter robots." [code]
@NightValeFeed, which "generates tweets by combining @NightValeRadio tweets with @BuzzFeed headlines." [code]
Wynbot9000, which "mimics your friends on Google Hangouts." [code]
@sealDonaldTrump, "a twitter bot that sounds like @realDonaldTrump, with an aquatic twist." [code]
@veeceebot, which is "like VCs but better!" [code]
@mar_phil_bot, a Twitter bot trained on Nietzsche, Russell, Kant, Machiavelli, and Plato. [code]
funzo-facts, a program that generates never-before-seen trivia based on Jeopardy! questions. [code]
Chains Invent Insanity, a Cards Against Humanity answer card generator. [code]
@CanDennisDream, a twitter bot that contemplates life by training on existential literature discussions. [code]
B-9 Indifference, a program that generates a Star Trek: The Next Generation script of arbitrary length using Markov chains trained on the show’s episode and movie scripts. [code]
adam, polish poetry generator. [code]
Stackexchange Simulator, which uses StackExchange's bulk data to generate random questions and answers. [code]
@BloggingBot, tweets sentences based on a corpus of 17 years of blogging.
Commencement Speech Generator, generates "graduation speech"-style quotes from a dataset of the "greatest of all time" commencement speeches)
@alg_testament, tweets sentences based on The Old Testament and two coding textbooks in Russian. [code]
@IRAMockBot, uses Twitter's data on tweets from Russian IRA-associated accounts to produce fake IRA tweets, for educational and study purposes.[code]
Personal Whatsapp Chat Analyzer, some basic analytics for WhatsApp chat exports (private & groups), word counting & markov chain phrase generator
DeepfakeBot, a system for converting your friends into Discord bots. [code]
python-markov-novel, writes a random novel using markov chains, broken down into chapters
python-ia-markov, trains Markov models on Internet Archive text files
@bot_homer, a Twitter bot trained using Homer Simpson's dialogues of 600 chapters. [code].
git-commit-gen, generates git commit messages by using markovify to build a model of a repo's git log
fakesocial, a fake social network using generated content. [code]
Slovodel Bot, a Telegram bot that generates non-existent Russian words using corpus made by algorithmically dividing existent words into syllables.
Deuterium, a Discord bot that generates messages on its own, after analyzing yours, and learning constantly. There's also a global model shared with all other servers.
Markovify Piano, generates coherent and plausible music generation.
TweetyPy, a Twitter bot that takes data from the top US trends or user tweets, learn and create own tweets and word clouds. [code]
cappuccino/ai.py, an IRC bot with a plugin that generates sentences based on the PostgreSQL-stored logs of the IRC channels it's in. [code]
django-markov, a reusable Django app that provides a generic backend to generate and store Markov chains for later retrieval and generation of sentences.

Have other examples? Pull requests welcome.

Thanks

Many thanks to the following GitHub users for contributing code and/or ideas:

Initially developed at BuzzFeed.

markovify's People

Contributors

Stargazers

Watchers

Forkers

tml avinassh witte-de-with amitparmar01 jaza hyperrhiz balajikvijayan d13sl0w gamesbyangelina monkpit terbo boweiliu mfalade narfman0 ntratcliff shawngraham kezadias schollz colhodm vaughngh yetanothertimes romanyakovlev techscientist ishafizan barliant steinitzu nirmohi0605 stevendinhton tmkuba kubilus1 laughingnomad swartzcr akumasade nikitaborisov merelijpelaar tomjonshea erichalldev decisionnguyen 5ilviu rskpdev longlinh stepjue looklikeapro sepam mjeries openinspiration jmeggesto cetrulin thundersfist mkolod shanecurcuru tsunaminoai dbuendiab punkrockio avsolatorio gfxox3jj3cqr0rmy jones77 vshubham8 cristianopessoa jaimelynschatz pombredanne matthiasgoe josipbudimcic alfiyazi otakumegane txdywy hailthedawn emorse04 schmidtscf kaktumar mitch90 smalleel ancadav icarusmj12 hanss314 asdqwex asommer luxun1 kendalvictor yongfu-li datahack-ru thallada danmayer trevoredris nemocpp s3bw mathsian kc2fresh shellleyma adityarj airsteffen ahjuroop qihongl nunomota raiuc queenjuliazxx notcraig seominlee enod kumaranvpl

markovify's Issues

What is valid text input for markovify.Text?

I'm trying out this functionality in conjunction with gensim's doc2vec (https://github.com/piskvorky/gensim/blob/develop/docs/notebooks/doc2vec-IMDB.ipynb) I'm getting an error when I try to utilize the ACL IMDB prepared text from the gensim example. I've attached a sample document (sandbox.txt) that causes this error as well as 2 samples that work (jayz.txt & sherlock.txt)

I'd like to identify what the difference is between these files that triggers the error in 1 but not the other two. Thanks!

jayz.txt
sandbox.txt
sherlock.txt

IndexError Traceback (most recent call last)
in ()
4 text = f.read()
5
----> 6 testModel = markovify.Text(text)
7 tool = grammar_check.LanguageTool("en-US")
8

C:\Anaconda\lib\site-packages\markovify\text.pyc in init(self, input_text, state_size, chain)
22 self.rejoined_text = self.sentence_join(map(self.word_join, runs))
23 state_size = state_size or 2
---> 24 self.chain = chain or Chain(runs, state_size)
25
26 def sentence_split(self, text):

C:\Anaconda\lib\site-packages\markovify\chain.pyc in init(self, corpus, state_size, model)
36 """
37 self.state_size = state_size
---> 38 self.model = model or self.build(corpus, self.state_size)
39
40 def build(self, corpus, state_size):

C:\Anaconda\lib\site-packages\markovify\chain.pyc in build(self, corpus, state_size)
46 appears.
47 """
---> 48 if (type(corpus) != list) or (type(corpus[0]) != list):
49 raise Exception("corpus must be list of lists")
50

IndexError: list index out of range

using make_sentence_with_start?

What is the proper way to call this function? When I try to pass a string of two words I get an error:

Also, is there a way to feed with only one words? Like Say I want the generated text to start with "You", is there a way to do this?

Thank you

Traceback (most recent call last):
  File "reddit.py", line 6, in <module>
    print(text_model.make_sentence_with_start('you need'))
  File "C:\Python3\lib\site-packages\markovify\text.py", line 139, in make_sentence_with_start
    return self.make_sentence(init_state, **kwargs)
  File "C:\Python3\lib\site-packages\markovify\text.py", line 122, in make_sentence
    words = prefix + self.chain.walk(init_state)
  File "C:\Python3\lib\site-packages\markovify\chain.py", line 98, in walk
    return list(self.gen(init_state))
  File "C:\Python3\lib\site-packages\markovify\chain.py", line 87, in gen
    next_word = self.move(state)
  File "C:\Python3\lib\site-packages\markovify\chain.py", line 73, in move
    choices, weights = zip(*self.model[state].items())
KeyError: ('you', 'need')

Shouldn't loading in a previous model take far less time?

As in, when I do this:

corpus = open('sherlock.txt').read()
corpus = re.sub( '\s+', ' ', corpus ).strip()
text_model = POSifiedText(corpus, state_size=4)
model1_json = text_model.chain.to_json()
with open(data.txt, 'w') as outfile:  
    json.dump(model1_json, outfile)

with open('data.txt') as json_file:  
	model2_json = json.load(json_file)

and....

NEW1_MODEL = POSifiedText.from_json(model2_json)
NEW2_MODEL = POSifiedText(corpus, state_size=4)

How come the NEW1_MODEL takes the exact same amount of time to generate as opposed to the NEW2_MODEL? Shouldn't it take far less time, because it's using an already generated model as opposed to a fresh corpus? Is there a huge error in my reasoning here?

Scoring generated text

Perhaps I'm missing something, but is there a function here to score the generated text, I'm planning to use Markovify for predicting whether a given text is anomalous to the trained model or not

how can i use this to generate first order markov chain number

I want to generate first order markov chain by giving randomly generated number sequence x=[5 6 1 6 4 1 2 4 6 6 1 6 6 3 5 1 3 6 5 6 4 1 6 6 5 5 5 3 4 2] as input. how to do this.

make_sentence never succeeds with seemingly normal input

This is the corpus, which always fails: https://my.mixtape.moe/synvpe.txt
This is a slightly modified corpus, which always succeeds, but always gives the same output: https://my.mixtape.moe/yrlbvb.txt
This is the code i'm using to to test the markov output:

#!/usr/bin/env python3
import markovify

toots_str = open("tootstr.txt", "r").read()
model = markovify.NewlineText(toots_str)
print(model.make_sentence(tries=100000))

Is there anything I'm doing wrong here? I'm using this code to create a bot. I've created over a dozen other bots using the exact same source code, and this is the first to have this issue.

The method make_sentence_with_start() seems like it only finds sentences that start with the seed.

It seems like make_sentence_with_start() is only looking through the corpus for lines/sentences that begin with the string you pass it, as opposed to searching the entire sentence for that word. For instance, I can add "I like to program with markov chains", and then trying to make_sentence_with_start("markov") will fail. However, if the log contains "markov chains are neat", then the chain will succeed, because the sentence begins with markov. Is there any way to make it so that it searches the entire sentence for the word, instead of just the beginning? For reference, here's the method we're using to build a markov chain:

def markovstring(prefix=None):
    with open ("log.txt", 'r', encoding='utf8') as file:
        text = file.read()
        text_model = markovify.NewlineText(text)
        if prefix is not None:
            try:
                message = text_model.make_sentence_with_start(prefix)
                for x in range(10):
                    message = text_model.make_sentence_with_start(prefix)
                if message is not None:
                    return message
                else:
                    return "Cannot create a chain with that string " + prefix
            except:
                return "Cannot create a chain with the string " + prefix
        else:
            message = text_model.make_sentence()
            while message is None:
                message = text_model.make_sentence()
            return message

Creating chains from very large files.

I am working on a project for generating HTML pages from a Markov chain. My current process is to copy the text of all .html pages in a folder into a single .txt file, then create a Markov model from that file.

The problem with this is that markovify loads the entire file into memory to create the model, and I have ~50GB of files.

Is there any way to create the model line-by-line / in chunks / read from disk instead of the whole thing at once?

markovify.Text(...) giving an error

markovify.Text(...) method give an error:
AttributeError: 'module' object has no attribute 'Text'

I'm using python 2.7...I just did pip install markovify and tried to run the sample code.

make_short_sentence can get stuck in loop

make_short_sentence is not limited to a number of attempts like make_sentence is, and can get stuck in the loop if make_sentence can not produce a sentence short enough.

Is there an easy way to build a chain off of a psql database of text?

Right now I can create a temporary text file of all the lines in the database but this takes a while.

edit: I'm stupid, and I just need to select from the database to get a list :\

Test fails straight out of the box

I've cloned the repository, and tried running the unittest test.test_itertext. This test doesn't require to set up the sherlock model. It reads the text files that come with the package and makes the models inside the test, so i didn't have any input into it. The error i keep getting is this:

(base) C:\Users\JGC\Desktop\Trabalhos\Python\markovify>python -m unittest test.test_itertext
EE.E
======================================================================
ERROR: test_from_json_without_retaining (test.test_itertext.MarkovifyTest)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "C:\Users\JGC\Desktop\Trabalhos\Python\markovify\test\test_itertext.py", line 25, in test_from_json_without_retaining 
    original_model = markovify.Text(f, retain_original=False)
  File "C:\Users\JGC\Desktop\Trabalhos\Python\markovify\markovify\text.py", line 53, in __init__
    parsed = parsed_sentences or self.generate_corpus(input_text)
  File "C:\Users\JGC\Desktop\Trabalhos\Python\markovify\markovify\text.py", line 152, in generate_corpus
    for line in text:
  File "C:\Users\JGC\anaconda3\lib\encodings\cp1252.py", line 23, in decode
    return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 3552: character maps to <undefined>

======================================================================
ERROR: test_from_mult_files_without_retaining (test.test_itertext.MarkovifyTest)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "C:\Users\JGC\Desktop\Trabalhos\Python\markovify\test\test_itertext.py", line 37, in test_from_mult_files_without_retaining
    models.append(markovify.Text(f, retain_original=False))
  File "C:\Users\JGC\Desktop\Trabalhos\Python\markovify\markovify\text.py", line 53, in __init__
    parsed = parsed_sentences or self.generate_corpus(input_text)
  File "C:\Users\JGC\Desktop\Trabalhos\Python\markovify\markovify\text.py", line 152, in generate_corpus
    for line in text:
  File "C:\Users\JGC\anaconda3\lib\encodings\cp1252.py", line 23, in decode
    return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 3552: character maps to <undefined>

======================================================================
ERROR: test_without_retaining (test.test_itertext.MarkovifyTest)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "C:\Users\JGC\Desktop\Trabalhos\Python\markovify\test\test_itertext.py", line 18, in test_without_retaining
    senate_model = markovify.Text(f, retain_original=False)
  File "C:\Users\JGC\Desktop\Trabalhos\Python\markovify\markovify\text.py", line 53, in __init__
    parsed = parsed_sentences or self.generate_corpus(input_text)
  File "C:\Users\JGC\Desktop\Trabalhos\Python\markovify\markovify\text.py", line 152, in generate_corpus
    for line in text:
  File "C:\Users\JGC\anaconda3\lib\encodings\cp1252.py", line 23, in decode
    return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 3552: character maps to <undefined>

----------------------------------------------------------------------
Ran 4 tests in 0.725s

FAILED (errors=3)

Running a conda 3.7.6 environment on Windows 10.

Issue with saving model to json

I'm using python 2.7 and when I try to save the model to json, I get the following error

UnicodeDecodeError: 'utf8' codec can't decode bytes in position 40-42: invalid continuation byte

Full traceback:
Traceback (most recent call last): File "python/markov.py", line 23, in <module> model_json = text_model.to_json() File "/usr/local/lib/python2.7/dist-packages/markovify/text.py", line 57, in to_json return json.dumps(self.to_dict()) File "/usr/local/lib/python2.7/dist-packages/markovify/text.py", line 49, in to_dict "chain": self.chain.to_json(), File "/usr/local/lib/python2.7/dist-packages/markovify/chain.py", line 124, in to_json return json.dumps(list(self.model.items())) File "/usr/lib/python2.7/json/__init__.py", line 244, in dumps return _default_encoder.encode(obj) File "/usr/lib/python2.7/json/encoder.py", line 207, in encode chunks = self.iterencode(o, _one_shot=True) File "/usr/lib/python2.7/json/encoder.py", line 270, in iterencode return _iterencode(o, 0) UnicodeDecodeError: 'utf8' codec can't decode bytes in position 40-42: invalid continuation byte

Question about combining models

If I do markovify.combine() with no weighting, is it effectively the same as training one model on the texts of all the combined models? Asking because I'd like to train a model on a lot of text files, and it works out easier to create a bunch of different ones and then combine them, as long as that works the way I'm expecting it to.

Doesn't work with http://www.anc.org/data/oanc/ngram/

Even with #93, it still returns None.

params = {
    'key': '* love *',
    'print_max': 100,
    'freq_threshold': 0,
    'output_style': 'sentence',
    'output_aux': 0,
    'print_format': 'text',
    'sort': None
}

r = requests.get('http://www.anc.org/cgi-bin/ngrams.cgi', params=params)
text = BeautifulSoup(r.text, 'html.parser').text
for punc in string.punctuation:
    text = text.replace(punc, '')

NLP Example doesn't work

The NLP example provided in the README currently causes markovify to spit out aload of Nones instead of sentances.

Make Short Sentence Word Delimited

It seems like it would make the method more intuitive if we were able to change this to word delimited compared to char delimited.

Often it comes back as None in my experiments with my limited corpus, but it should be guaranteed to come back as a value if it's word delimited and feel a bit more intuitive.

https://github.com/jsvine/markovify/blob/master/markovify/text.py#L130-L139

AttributeError: 'Text' object has no attribute 'parsed_sentences'

Hi, I'm trying to combine some text models and this is the erro that I got.

Traceback (most recent call last): File "generate_files.py", line 57, in <module> main() File "generate_files.py", line 49, in main combined_model = markovify.combine([last_day_model, last_week_model, last_month_model, last_six_month_model], [10, 5, 2, 1]) File "/home/lorenzo/project/m5bot/lib/python3.4/site-packages/markovify/utils.py", line 50, in combine combined_sentences += m.parsed_sentences AttributeError: 'Text' object has no attribute 'parsed_sentences'

Problems with json exporting

The README mentions markovify.Text.from_json(), which doesn't exist -- should be something like markovify.Text.from_chain(markovify.Chain.from_json()).
If you use text_model = markovify.NewlineText(text, state_size=3), export it as json and then load it back, when you use text_model.make_short_sentence(140) you get verbatim sentences out of the corpus. My corpus is composed of lines with fewer than 140 characters, but in any case, if I don't export the text_model to json and then load it back, sentences are generated correctly.

make_sentence_with_start() don't working

It's really broken? I can use make_sentence(), but when I try to use make_sentence_with_start(), then I got errors:

Traceback (most recent call last):
File "C:\test.py", line 12, in
string = text_model.make_sentence_with_start("Hello")
File "c:\python\lib\site-packages\markovify\text.py", line 254, in make_sentence_with_start
output = self.make_sentence(init_state, **kwargs)
File "c:\python\lib\site-packages\markovify\text.py", line 197, in make_sentence
words = prefix + self.chain.walk(init_state)
File "c:\python\lib\site-packages\markovify\chain.py", line 118, in walk
return list(self.gen(init_state))
File "c:\python\lib\site-packages\markovify\chain.py", line 107, in gen
next_word = self.move(state)
File "c:\python\lib\site-packages\markovify\chain.py", line 93, in move
choices, weights = zip(*self.model[state].items())
KeyError: ('_BEGIN', 'Hello')

Code:

import markovify
import codecs

with codecs.open("file.txt", encoding='utf-8') as f:
    text = f.read()

# Build the model.
text_model = markovify.Text(text)

#string = text_model.make_sentence(tries=100)
string = text_model.make_sentence_with_start("Hello")

Markovify 0.7.2, Python 3.7.4

markovify.combine(...) always returns a markovify.Text regardless of input type

While all the inputs to markovify.combine must be the same class, and that class doesn't have to be a Text, the output is always a Text. This means that any methods I've overridden in my own custom class disappear when I combine two models. While not necessarily a bug, this should at least be noted or documented for combine().

Your definition of Markov states is problematic

I noticed that in chain.py --> line 66, you defined your "follow" as a single element in a "run".

However, the real state could be a bigram or ngram depends on the state-size. So clearly there is a mismatch between the current state and the following state that it is transiting to.

Did I misunderstand you?

Merge models

Hi, if i have separate models built and saved as json, how can i merge them before generating a sentence?
Thanks

max_overlap_ratio=0.3 and still duplicate sentences from original source

I'm trying to generate unique sentences from a text source. I set max_overlap_ratio to 0.3 and i still have sentences that matches exactly some sentences from the original text.

How to avoid that ?

Make markovify.Text accept a file-like object or generator to reduce memory footprint when using large files?

I've written a script to recursively find all files with a given extension, generate a chain for each, and (once all files have an associated chain), combine them into one mega-chain and store it.
I'm running this on a vary large directory (~1.4 G), and while coding my script I was aware that holding all of that in ram (as markkovify.Text only accepts strings) would probably be an issue.
I was correct; not 2 seconds after having run it the process was killed.
Is there a way to modify .Text and .NewlineText so they can accept (and properly process, of course), a generator or file-like object to iterate over?
I have no problem implementing this myself and filing a pull request, I'm just unsure how to deal with sentence splitting along chunks.

Cannot combine Text models if they contain ()"'[] in the string

Environment

markovify-0.6.4
Python 3.6.3
Arch Linux

Description

If you are creating a new Text or NewlineText model and the string contains any of ()"'[], you will get a KeyError when trying to combine it with another model.

Here's some example code that demonstrates this issue. (Note that you can get titles-33(1)(a).txt from https://github.com/wragge/sekritfiles/blob/master/data/titles-33(1)(a).txt.)

#!/usr/bin/env python3

import markovify

with open('titles-33(1)(a).txt') as f:
    text = f.read()

model = markovify.NewlineText(text)
print(model.make_sentence())

broken_strings = [
        'this string is not broken',
        'this string contains (',
        '(',
        ')',
        '[',
        ']',
        '"',
        "'",
        ]

for broken in broken_strings:
    try:
        new_model = markovify.NewlineText(broken)
        combined_model = markovify.combine([model, new_model])
    except KeyError:
        print('Broken string:', broken)

Stacktrace

Traceback (most recent call last):
  File "./broken.py", line 23, in <module>
    new_model = markovify.NewlineText(broken)
  File "<redacted>/env/lib/python3.6/site-packages/markovify/text.py", line 36, in __init__
    self.chain = chain or Chain(self.parsed_sentences, state_size)
  File "<redacted>/env/lib/python3.6/site-packages/markovify/chain.py", line 45, in __init__
    self.precompute_begin_state()
  File "<redacted>/env/lib/python3.6/site-packages/markovify/chain.py", line 80, in precompute_begin_state
    choices, weights = zip(*self.model[begin_state].items())
KeyError: ('___BEGIN__', '___BEGIN__')

proper way to use to_json?

I am trying to generate a corpus with markovify and then save as json to file. Is this possible? I've been trying to use the generate_corpus() and to_json but keep getting an error that corpus must be a list of lists.

Thanks

Feature request: generate sentence from given input word

Does the API provide a way to generate a sentence that contains the input word?

eg.
Input word: accessed
Output: He went to the bank because somebody had accessed his account.

If the lib doesn't currently support this, but it can be achieved with simple modification, feel free to point me in the right direction and I'll have a go.

make_sentence_that_contains

Is it possible to create a Markov chain that contains a key word?

I'm thinking of a chat bot situation. For example, the word "computer" is used in the utterance, then this command would be called: text_model.make_sentence_that_contains("computer")

If so, could you please either add this to markovify, or show me how I could write the code for my personal use.

Many thanks.

Feature Request: Sentence segmentation and word tokenization by SEGTOK

https://github.com/fnl/segtok

Provides fast & stable Sentence segmentation.

What do you think?

Model portability

Hi, I was wondering how difficult it would be to extend the underlying classes, so that the model can be stored, say, redis.

Seed first word of sentence

Hello,

I was just wondering if it is possible to seed the first word of the sentence you want to create?

So I can have my sentence always start with the word I want?

Many thanks!

make_sentence_that_contains

I'm sorry to re-open this Feature Request (#52), but I've been trying the solution approach of using the while loop and with a corpus of 6mb it takes almost 30 seconds (or more) to generate a sentence containing the specific word.
It's somehow possible for you to implement this feature in code so it would be more fast to create sentences?

Thank you very much, and continue the good work!

Feature Suggestions:

Adjustable Word Weights:

Project https://github.com/G3Kappa/Adjustable-Markov-Chains/blob/master/markovchain.py#L174 has different functions to adjust the weights of words in the created output markov stream

Beam Search:

Does it make sense to add Beam search to Normal Markov Chains?

Thank you for reading.

Example of how to use .make_sentence(init_state = something)?

What exactly must be the format of the something? I tried putting a string of a word, putting a tuple with the word, putting a list with the word. Despite trying a lot of formats i keep getting this error

line 213, in make_sentence
    words = prefix + self.chain.walk(init_state)
  File "C:\Users\JGC\AppData\Local\Programs\Python\Python37\lib\site-packages\markovify\chain.py", line 137, in walk
    return list(self.gen(init_state))
  File "C:\Users\JGC\AppData\Local\Programs\Python\Python37\lib\site-packages\markovify\chain.py", line 126, in gen
    next_word = self.move(state)
  File "C:\Users\JGC\AppData\Local\Programs\Python\Python37\lib\site-packages\markovify\chain.py", line 112, in move
    choices, weights = zip(*self.model[state].items())
KeyError: 'was'

I'm using markovify 0.8.0 on a python 3.7 environment.

Something similar to begins with

Hi, this may be the wrong place to open this issue as it is more of a question/feature request, but is their anyway to make something similar to the make sentence beginning with method that makes a sentence ending with a certain string? Someone told me I could just use the beginning with sentence and reverse it, but the grammar would then be completely off I am sure. Is their a way to make it so that it creates a sentence that ENDS with a certain string? Any help would be greatly greatly appreciate. Thank you :)

NewlineText bug

AttributeError: 'module' object has no attribute 'NewlineText' happes on text file with new lines

import markovify

with open("phrases.txt") as f:
    content = f.read()

comment_model = markovify.NewlineText(content)
print(comment_model.make_sentence())

aaaand it won't open up text file even with Text. Returns None

"All `models` must be of the same type."

Hello,

There’s currently a check in utils.combine that raises an exception if all models are not of the same type. Given that get_model_dict is called on each model why do we need to assert that all models are of the same type?

My problem is that I inherit classes from Text and type(my_model) doesn’t take that into account. See a minimal example below:

class MyText(markovify.Text):
    pass

class MyOtherText(MyText):
    pass

markovify.combine([MyText("foo"), MyOtherText("bar")])
# Raises: ValueError: All `models` must be of the same type.

State Size Limit

Is there a state size limit?

A state size > 5 just takes forever to run.

Question about storing models as JSON

The readme mentions :

Models can be stored as JSON, allowing you to cache your results and save them for later.

Just to clarify, is this is only referring to markovify.Chain instances? Looking at the source I can only find methods for loading a Markov chain from a file/object, but not for a markovify.Text models.

make_sentence_with_start() returns error if only one word is used

It used to just use memory and do nothing, but now it gives:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Users\Me\AppData\Local\Programs\Python\Python35-32\lib\site-packages\mark
ovify\text.py", line 140, in make_sentence_with_start
    return self.make_sentence(init_state, **kwargs)
  File "C:\Users\Me\AppData\Local\Programs\Python\Python35-32\lib\site-packages\mark
ovify\text.py", line 122, in make_sentence
    words = prefix + self.chain.walk(init_state)
  File "C:\Users\Me\AppData\Local\Programs\Python\Python35-32\lib\site-packages\mark
ovify\chain.py", line 114, in walk
    return list(self.gen(init_state))
  File "C:\Users\Me\AppData\Local\Programs\Python\Python35-32\lib\site-packages\mark
ovify\chain.py", line 103, in gen
    next_word = self.move(state)
  File "C:\Users\Me\AppData\Local\Programs\Python\Python35-32\lib\site-packages\mark
ovify\chain.py", line 89, in move
    choices, weights = zip(*self.model[state].items())
KeyError: ('I',)

This is in Python 3.5.2

Code:

with open("BlocSpeaks.txt", "r", encoding="utf-8") as txt:
    text = text.read()

text_model = markovify.Text(text)

text_model.make_sentence_with_start("I")
#Returns error as seen above

text_model.make_sentence_with_start("I dont")

'''"I dont play BLOC tho fucking homofag I can not longer stand Commy's defiance he is unk
illable He only was because she hadn\u2019t done anything."'''

text_model.make_sentence_with_start("I will")
'''I will ban you without a special kind of dull tbh back If you need to check bloc?''''

multiple sentences?

Perhaps I am missing something, but is there a way to create multiple sentences? I was able to use the example code to create individual sentences, but when I put two sentences next to each other they seem massively disjointed. Just wondering if there is a way to generate a paragraph.

POSified text conflicting with json?

I'm working on having my text_model save to/load from json. If I use text_model = markovify.Text(foo) everything works OK, but if I use text_model = POSifiedText(foo), after reconstituting it back from json I get output like this:

I::PRP want::VBP to::TO share::NN it::PRP with::IN whomever::NN wants::VBZ it.::JJ

Exclude sentences that match the source text exactly

Every so often make_short_sentence() generates a sentence that is directly from the corpus. I could make a loop to check until it doesn't match, but is there an easier way?

"_BEGIN" in output of make_sentence_with_start()

With Python 2.7.9 and markovify v0.7.0, when I use model.make_sentence_with_start() with strict=True, I am getting the literal string ___BEGIN__ as the first word of the sentence. Occasionally I even get this with strict=False, but much less often (my corpus text is ~6 MB and many sentences start with and contain the word I requested).

For example, it produced ___BEGIN__ Who is right? with this:

import markovify

text = open('corpus.txt').read()
model = markovify.Text(text, state_size=3)
print(model.make_sentence_with_start('Who'))

Issue with make_sentence_with_start

File "/mnt/c/Users/Asdf/Documents/1 - Git/AsdfTestBot/src/atbMarkov.py", line 38, in getMarkov
ret = builtins.markov.make_sentence_with_start(lastWord, tries=30)
File "/usr/local/lib/python3.4/dist-packages/markovify/text.py", line 156, in make_sentence_with_start
return self.make_sentence(init_state, *_kwargs)
File "/usr/local/lib/python3.4/dist-packages/markovify/text.py", line 124, in make_sentence
words = prefix + self.chain.walk(init_state)
File "/usr/local/lib/python3.4/dist-packages/markovify/chain.py", line 114, in walk
return list(self.gen(init_state))
File "/usr/local/lib/python3.4/dist-packages/markovify/chain.py", line 103, in gen
next_word = self.move(state)
File "/usr/local/lib/python3.4/dist-packages/markovify/chain.py", line 89, in move
choices, weights = zip(_self.model[state].items())
KeyError: ('_BEGIN', 'men')

I am always sending a single word -- I used the normal method to get the sentence "Men, all this stuff you hear about America not wanting to fight, wanting to fight, wanting to fight, wanting to stay out of him all the time if he expects to keep on advancing." I then used make_sentence_with_start("men", tries=30), and that stack trace above is the result. This happens for any word.

What am I doing wrong?

text_model.make_sentence() always outputs none, even with a script running it for an infinite number of tries.

import markovify

with open("data", encoding="utf-8") as f:
    text = f.read()
text_model = markovify.Text(text, state_size=7)

txt = []
while len(txt) in range(0,5):
        txt.append(text_model.make_sentence(tries=100))
        txt = [x for x in txt if x is not None]

print(txt)

This script just hangs on my screen. Reinstalling the library did nothing for me, trying different corpus did not work either.

small datasets

I've constructed a VERY small corpus of sentences.

I like you
you like pie

From these, I thought I would be able to generate

I like pie

but make_sentence(test_output=False) only generates the two sentences which are in the original corpus.

KeyError: ('_BEGIN', '_BEGIN')

I'm using a simple text_model.make_short_sentence(140). Nothing new here.

I can't seem to figure this out, it was working just fine. Here's the traceback:

Traceback (most recent call last):
  File "bot.py", line 51, in <module>
    main()
  File "bot.py", line 34, in main
    markov_text = marko.gen_markov(f=filepath)
  File "/home/icyphox/code/memebot/modules/marko.py", line 18, in gen_markov
    model = markovify.Text(text)
  File "/home/icyphox/code/memebot/.env/lib/python3.6/site-packages/markovify/text.py", line 37, in __init__
    self.chain = chain or Chain(self.parsed_sentences, state_size)
  File "/home/icyphox/code/memebot/.env/lib/python3.6/site-packages/markovify/chain.py", line 45, in __init__
    self.precompute_begin_state()
  File "/home/icyphox/code/memebot/.env/lib/python3.6/site-packages/markovify/chain.py", line 80, in precompute_begin_state
    choices, weights = zip(*self.model[begin_state].items())
KeyError: ('___BEGIN__', '___BEGIN__')

appending a line to the markov chain?

edit: whoops i posted without including a body
is there a way to append a line to the chain?

jsvine / markovify Goto Github PK

markovify's Introduction

Markovify

Why Markovify?

Installation

Basic Usage

Advanced Usage

Specifying the model's state size

Combining models

Compiling a model

Working with messy texts

Extending markovify.Text

Exporting

Generating markovify.Text models from very large corpora

Markovify In The Wild

Thanks

markovify's People

Contributors

Stargazers

Watchers

Forkers

markovify's Issues

Environment

Description

Stacktrace

Recommend Projects

Recommend Topics

Recommend Org

Extending `markovify.Text`

Generating `markovify.Text` models from very large corpora