Giter Club home page Giter Club logo

markovify's Introduction

CI Version Build status Code coverage Support Python versions

Markovify

Markovify is a simple, extensible Markov chain generator. Right now, its primary use is for building Markov models of large corpora of text and generating random sentences from that. However, in theory, it could be used for other applications.

Why Markovify?

Some reasons:

  • Simplicity. "Batteries included," but it is easy to override key methods.

  • Models can be stored as JSON, allowing you to cache your results and save them for later.

  • Text parsing and sentence generation methods are highly extensible, allowing you to set your own rules.

  • Relies only on pure-Python libraries, and very few of them.

  • Tested on Python 3.7, 3.8, 3.9, and 3.10.

Installation

pip install markovify

Basic Usage

import markovify

# Get raw text as string.
with open("/path/to/my/corpus.txt") as f:
    text = f.read()

# Build the model.
text_model = markovify.Text(text)

# Print five randomly-generated sentences
for i in range(5):
    print(text_model.make_sentence())

# Print three randomly-generated sentences of no more than 280 characters
for i in range(3):
    print(text_model.make_short_sentence(280))

Notes:

  • The usage examples here assume you are trying to markovify text. If you would like to use the underlying markovify.Chain class, which is not text-specific, check out the (annotated) source code.

  • Markovify works best with large, well-punctuated texts. If your text does not use .s to delineate sentences, put each sentence on a newline, and use the markovify.NewlineText class instead of markovify.Text class.

  • If you have accidentally read the input text as one long sentence, markovify will be unable to generate new sentences from it due to a lack of beginning and ending delimiters. This issue can occur if you have read a newline delimited file using the markovify.Text command instead of markovify.NewlineText. To check this, the command [key for key in txt.chain.model.keys() if "___BEGIN__" in key] command will return all of the possible sentence-starting words and should return more than one result.

  • By default, the make_sentence method tries a maximum of 10 times per invocation, to make a sentence that does not overlap too much with the original text. If it is successful, the method returns the sentence as a string. If not, it returns None. To increase or decrease the number of attempts, use the tries keyword argument, e.g., call .make_sentence(tries=100).

  • By default, markovify.Text tries to generate sentences that do not simply regurgitate chunks of the original text. The default rule is to suppress any generated sentences that exactly overlaps the original text by 15 words or 70% of the sentence's word count. You can change this rule by passing max_overlap_ratio and/or max_overlap_total to the make_sentence method. Alternatively, this check can be disabled entirely by passing test_output as False.

Advanced Usage

Specifying the model's state size

State size is a number of words the probability of a next word depends on.

By default, markovify.Text uses a state size of 2. But you can instantiate a model with a different state size. E.g.,:

text_model = markovify.Text(text, state_size=3)

Combining models

With markovify.combine(...), you can combine two or more Markov chains. The function accepts two arguments:

  • models: A list of markovify objects to combine. Can be instances of markovify.Chain or markovify.Text (or their subclasses), but all must be of the same type.
  • weights: Optional. A list — the exact length of models — of ints or floats indicating how much relative emphasis to place on each source. Default: [ 1, 1, ... ].

For instance:

model_a = markovify.Text(text_a)
model_b = markovify.Text(text_b)

model_combo = markovify.combine([ model_a, model_b ], [ 1.5, 1 ])

This code snippet would combine model_a and model_b, but, it would also place 50% more weight on the connections from model_a.

Compiling a model

Once a model has been generated, it may also be compiled for improved text generation speed and reduced size.

text_model = markovify.Text(text)
text_model = text_model.compile()

Models may also be compiled in-place:

text_model = markovify.Text(text)
text_model.compile(inplace = True)

Currently, compiled models may not be combined with other models using markovify.combine(...). If you wish to combine models, do that first and then compile the result.

Working with messy texts

Starting with v0.7.2, markovify.Text accepts two additional parameters: well_formed and reject_reg.

  • Setting well_formed = False skips the step in which input sentences are rejected if they contain one of the 'bad characters' (i.e. ()[]'")

  • Setting reject_reg to a regular expression of your choice allows you change the input-sentence rejection pattern. This only applies if well_formed is True, and if the expression is non-empty.

Extending markovify.Text

The markovify.Text class is highly extensible; most methods can be overridden. For example, the following POSifiedText class uses NLTK's part-of-speech tagger to generate a Markov model that obeys sentence structure better than a naive model. (It works; however, be warned: pos_tag is very slow.)

import markovify
import nltk
import re

class POSifiedText(markovify.Text):
    def word_split(self, sentence):
        words = re.split(self.word_split_pattern, sentence)
        words = [ "::".join(tag) for tag in nltk.pos_tag(words) ]
        return words

    def word_join(self, words):
        sentence = " ".join(word.split("::")[0] for word in words)
        return sentence

Or, you can use spaCy which is way faster:

import markovify
import re
import spacy

nlp = spacy.load("en_core_web_sm")

class POSifiedText(markovify.Text):
    def word_split(self, sentence):
        return ["::".join((word.orth_, word.pos_)) for word in nlp(sentence)]

    def word_join(self, words):
        sentence = " ".join(word.split("::")[0] for word in words)
        return sentence

The most useful markovify.Text models you can override are:

  • sentence_split
  • sentence_join
  • word_split
  • word_join
  • test_sentence_input
  • test_sentence_output

For details on what they do, see the (annotated) source code.

Exporting

It can take a while to generate a Markov model from a large corpus. Sometimes you'll want to generate once and reuse it later. To export a generated markovify.Text model, use my_text_model.to_json(). For example:

corpus = open("sherlock.txt").read()

text_model = markovify.Text(corpus, state_size=3)
model_json = text_model.to_json()
# In theory, here you'd save the JSON to disk, and then read it back later.

reconstituted_model = markovify.Text.from_json(model_json)
reconstituted_model.make_short_sentence(280)

>>> 'It cost me something in foolscap, and I had no idea that he was a man of evil reputation among women.'

You can also export the underlying Markov chain on its own — i.e., excluding the original corpus and the state_size metadata — via my_text_model.chain.to_json().

Generating markovify.Text models from very large corpora

By default, the markovify.Text class loads, and retains, your textual corpus, so that it can compare generated sentences with the original (and only emit novel sentences). However, with very large corpora, loading the entire text at once (and retaining it) can be memory-intensive. To overcome this, you can (a) tell Markovify not to retain the original:

with open("path/to/my/huge/corpus.txt") as f:
    text_model = markovify.Text(f, retain_original=False)

print(text_model.make_sentence())

And (b) read in the corpus line-by-line or file-by-file and combine them into one model at each step:

combined_model = None
for (dirpath, _, filenames) in os.walk("path/to/my/huge/corpus"):
    for filename in filenames:
        with open(os.path.join(dirpath, filename)) as f:
            model = markovify.Text(f, retain_original=False)
            if combined_model:
                combined_model = markovify.combine(models=[combined_model, model])
            else:
                combined_model = model

print(combined_model.make_sentence())

Markovify In The Wild

Have other examples? Pull requests welcome.

Thanks

Many thanks to the following GitHub users for contributing code and/or ideas:

Initially developed at BuzzFeed.

markovify's People

Contributors

ammgws avatar avinassh avatar ben-bay avatar brienna avatar bziarkowski avatar erichalldev avatar eumiro avatar ewgraf avatar fitnr avatar jsvine avatar kade-robertson avatar maryszmary avatar matthewscholefield avatar monosans avatar ntratcliff avatar nwithan8 avatar orf avatar otakumegane avatar pmlandwehr avatar portasynthinca3 avatar rokala avatar schollz avatar serhiistets avatar smalawi avatar stepjue avatar swartzcr avatar sylv-lej avatar terisikk avatar thallada avatar weiss-d avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

markovify's Issues

Make markovify.Text accept a file-like object or generator to reduce memory footprint when using large files?

I've written a script to recursively find all files with a given extension, generate a chain for each, and (once all files have an associated chain), combine them into one mega-chain and store it.
I'm running this on a vary large directory (~1.4 G), and while coding my script I was aware that holding all of that in ram (as markkovify.Text only accepts strings) would probably be an issue.
I was correct; not 2 seconds after having run it the process was killed.
Is there a way to modify .Text and .NewlineText so they can accept (and properly process, of course), a generator or file-like object to iterate over?
I have no problem implementing this myself and filing a pull request, I'm just unsure how to deal with sentence splitting along chunks.

make_sentence_with_start() returns error if only one word is used

It used to just use memory and do nothing, but now it gives:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Users\Me\AppData\Local\Programs\Python\Python35-32\lib\site-packages\mark
ovify\text.py", line 140, in make_sentence_with_start
    return self.make_sentence(init_state, **kwargs)
  File "C:\Users\Me\AppData\Local\Programs\Python\Python35-32\lib\site-packages\mark
ovify\text.py", line 122, in make_sentence
    words = prefix + self.chain.walk(init_state)
  File "C:\Users\Me\AppData\Local\Programs\Python\Python35-32\lib\site-packages\mark
ovify\chain.py", line 114, in walk
    return list(self.gen(init_state))
  File "C:\Users\Me\AppData\Local\Programs\Python\Python35-32\lib\site-packages\mark
ovify\chain.py", line 103, in gen
    next_word = self.move(state)
  File "C:\Users\Me\AppData\Local\Programs\Python\Python35-32\lib\site-packages\mark
ovify\chain.py", line 89, in move
    choices, weights = zip(*self.model[state].items())
KeyError: ('I',)

This is in Python 3.5.2

Code:

with open("BlocSpeaks.txt", "r", encoding="utf-8") as txt:
    text = text.read()

text_model = markovify.Text(text)

text_model.make_sentence_with_start("I")
#Returns error as seen above

text_model.make_sentence_with_start("I dont")

'''"I dont play BLOC tho fucking homofag I can not longer stand Commy's defiance he is unk
illable He only was because she hadn\u2019t done anything."'''

text_model.make_sentence_with_start("I will")
'''I will ban you without a special kind of dull tbh back If you need to check bloc?''''

multiple sentences?

Perhaps I am missing something, but is there a way to create multiple sentences? I was able to use the example code to create individual sentences, but when I put two sentences next to each other they seem massively disjointed. Just wondering if there is a way to generate a paragraph.

Example of how to use .make_sentence(init_state = something)?

What exactly must be the format of the something? I tried putting a string of a word, putting a tuple with the word, putting a list with the word. Despite trying a lot of formats i keep getting this error

line 213, in make_sentence
    words = prefix + self.chain.walk(init_state)
  File "C:\Users\JGC\AppData\Local\Programs\Python\Python37\lib\site-packages\markovify\chain.py", line 137, in walk
    return list(self.gen(init_state))
  File "C:\Users\JGC\AppData\Local\Programs\Python\Python37\lib\site-packages\markovify\chain.py", line 126, in gen
    next_word = self.move(state)
  File "C:\Users\JGC\AppData\Local\Programs\Python\Python37\lib\site-packages\markovify\chain.py", line 112, in move
    choices, weights = zip(*self.model[state].items())
KeyError: 'was'

I'm using markovify 0.8.0 on a python 3.7 environment.

markovify.combine(...) always returns a markovify.Text regardless of input type

While all the inputs to markovify.combine must be the same class, and that class doesn't have to be a Text, the output is always a Text. This means that any methods I've overridden in my own custom class disappear when I combine two models. While not necessarily a bug, this should at least be noted or documented for combine().

Issue with make_sentence_with_start

File "/mnt/c/Users/Asdf/Documents/1 - Git/AsdfTestBot/src/atbMarkov.py", line 38, in getMarkov
ret = builtins.markov.make_sentence_with_start(lastWord, tries=30)
File "/usr/local/lib/python3.4/dist-packages/markovify/text.py", line 156, in make_sentence_with_start
return self.make_sentence(init_state, *_kwargs)
File "/usr/local/lib/python3.4/dist-packages/markovify/text.py", line 124, in make_sentence
words = prefix + self.chain.walk(init_state)
File "/usr/local/lib/python3.4/dist-packages/markovify/chain.py", line 114, in walk
return list(self.gen(init_state))
File "/usr/local/lib/python3.4/dist-packages/markovify/chain.py", line 103, in gen
next_word = self.move(state)
File "/usr/local/lib/python3.4/dist-packages/markovify/chain.py", line 89, in move
choices, weights = zip(_self.model[state].items())
KeyError: ('_BEGIN', 'men')

I am always sending a single word -- I used the normal method to get the sentence "Men, all this stuff you hear about America not wanting to fight, wanting to fight, wanting to fight, wanting to stay out of him all the time if he expects to keep on advancing." I then used make_sentence_with_start("men", tries=30), and that stack trace above is the result. This happens for any word.

What am I doing wrong?

AttributeError: 'Text' object has no attribute 'parsed_sentences'

Hi, I'm trying to combine some text models and this is the erro that I got.

Traceback (most recent call last): File "generate_files.py", line 57, in <module> main() File "generate_files.py", line 49, in main combined_model = markovify.combine([last_day_model, last_week_model, last_month_model, last_six_month_model], [10, 5, 2, 1]) File "/home/lorenzo/project/m5bot/lib/python3.4/site-packages/markovify/utils.py", line 50, in combine combined_sentences += m.parsed_sentences AttributeError: 'Text' object has no attribute 'parsed_sentences'

Something similar to begins with

Hi, this may be the wrong place to open this issue as it is more of a question/feature request, but is their anyway to make something similar to the make sentence beginning with method that makes a sentence ending with a certain string? Someone told me I could just use the beginning with sentence and reverse it, but the grammar would then be completely off I am sure. Is their a way to make it so that it creates a sentence that ENDS with a certain string? Any help would be greatly greatly appreciate. Thank you :)

make_sentence_that_contains

Is it possible to create a Markov chain that contains a key word?

I'm thinking of a chat bot situation. For example, the word "computer" is used in the utterance, then this command would be called: text_model.make_sentence_that_contains("computer")

If so, could you please either add this to markovify, or show me how I could write the code for my personal use.

Many thanks.

KeyError: ('___BEGIN__', '___BEGIN__')

I'm using a simple text_model.make_short_sentence(140). Nothing new here.

I can't seem to figure this out, it was working just fine. Here's the traceback:

Traceback (most recent call last):
  File "bot.py", line 51, in <module>
    main()
  File "bot.py", line 34, in main
    markov_text = marko.gen_markov(f=filepath)
  File "/home/icyphox/code/memebot/modules/marko.py", line 18, in gen_markov
    model = markovify.Text(text)
  File "/home/icyphox/code/memebot/.env/lib/python3.6/site-packages/markovify/text.py", line 37, in __init__
    self.chain = chain or Chain(self.parsed_sentences, state_size)
  File "/home/icyphox/code/memebot/.env/lib/python3.6/site-packages/markovify/chain.py", line 45, in __init__
    self.precompute_begin_state()
  File "/home/icyphox/code/memebot/.env/lib/python3.6/site-packages/markovify/chain.py", line 80, in precompute_begin_state
    choices, weights = zip(*self.model[begin_state].items())
KeyError: ('___BEGIN__', '___BEGIN__')

Merge models

Hi, if i have separate models built and saved as json, how can i merge them before generating a sentence?
Thanks

Creating chains from very large files.

I am working on a project for generating HTML pages from a Markov chain. My current process is to copy the text of all .html pages in a folder into a single .txt file, then create a Markov model from that file.

The problem with this is that markovify loads the entire file into memory to create the model, and I have ~50GB of files.

Is there any way to create the model line-by-line / in chunks / read from disk instead of the whole thing at once?

Shouldn't loading in a previous model take far less time?

As in, when I do this:

corpus = open('sherlock.txt').read()
corpus = re.sub( '\s+', ' ', corpus ).strip()
text_model = POSifiedText(corpus, state_size=4)
model1_json = text_model.chain.to_json()
with open(data.txt, 'w') as outfile:  
    json.dump(model1_json, outfile)

with open('data.txt') as json_file:  
	model2_json = json.load(json_file)

and....

NEW1_MODEL = POSifiedText.from_json(model2_json)
NEW2_MODEL = POSifiedText(corpus, state_size=4)

How come the NEW1_MODEL takes the exact same amount of time to generate as opposed to the NEW2_MODEL? Shouldn't it take far less time, because it's using an already generated model as opposed to a fresh corpus? Is there a huge error in my reasoning here?

Model portability

Hi, I was wondering how difficult it would be to extend the underlying classes, so that the model can be stored, say, redis.

NLP Example doesn't work

The NLP example provided in the README currently causes markovify to spit out aload of Nones instead of sentances.

Your definition of Markov states is problematic

I noticed that in chain.py --> line 66, you defined your "follow" as a single element in a "run".

However, the real state could be a bigram or ngram depends on the state-size. So clearly there is a mismatch between the current state and the following state that it is transiting to.

Did I misunderstand you?

make_sentence_that_contains

I'm sorry to re-open this Feature Request (#52), but I've been trying the solution approach of using the while loop and with a corpus of 6mb it takes almost 30 seconds (or more) to generate a sentence containing the specific word.
It's somehow possible for you to implement this feature in code so it would be more fast to create sentences?

Thank you very much, and continue the good work!

markovify.Text(...) giving an error

markovify.Text(...) method give an error:
AttributeError: 'module' object has no attribute 'Text'

I'm using python 2.7...I just did pip install markovify and tried to run the sample code.

Problems with json exporting

  1. The README mentions markovify.Text.from_json(), which doesn't exist -- should be something like markovify.Text.from_chain(markovify.Chain.from_json()).

  2. If you use text_model = markovify.NewlineText(text, state_size=3), export it as json and then load it back, when you use text_model.make_short_sentence(140) you get verbatim sentences out of the corpus. My corpus is composed of lines with fewer than 140 characters, but in any case, if I don't export the text_model to json and then load it back, sentences are generated correctly.

Test fails straight out of the box

I've cloned the repository, and tried running the unittest test.test_itertext. This test doesn't require to set up the sherlock model. It reads the text files that come with the package and makes the models inside the test, so i didn't have any input into it. The error i keep getting is this:

(base) C:\Users\JGC\Desktop\Trabalhos\Python\markovify>python -m unittest test.test_itertext
EE.E
======================================================================
ERROR: test_from_json_without_retaining (test.test_itertext.MarkovifyTest)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "C:\Users\JGC\Desktop\Trabalhos\Python\markovify\test\test_itertext.py", line 25, in test_from_json_without_retaining 
    original_model = markovify.Text(f, retain_original=False)
  File "C:\Users\JGC\Desktop\Trabalhos\Python\markovify\markovify\text.py", line 53, in __init__
    parsed = parsed_sentences or self.generate_corpus(input_text)
  File "C:\Users\JGC\Desktop\Trabalhos\Python\markovify\markovify\text.py", line 152, in generate_corpus
    for line in text:
  File "C:\Users\JGC\anaconda3\lib\encodings\cp1252.py", line 23, in decode
    return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 3552: character maps to <undefined>

======================================================================
ERROR: test_from_mult_files_without_retaining (test.test_itertext.MarkovifyTest)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "C:\Users\JGC\Desktop\Trabalhos\Python\markovify\test\test_itertext.py", line 37, in test_from_mult_files_without_retaining
    models.append(markovify.Text(f, retain_original=False))
  File "C:\Users\JGC\Desktop\Trabalhos\Python\markovify\markovify\text.py", line 53, in __init__
    parsed = parsed_sentences or self.generate_corpus(input_text)
  File "C:\Users\JGC\Desktop\Trabalhos\Python\markovify\markovify\text.py", line 152, in generate_corpus
    for line in text:
  File "C:\Users\JGC\anaconda3\lib\encodings\cp1252.py", line 23, in decode
    return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 3552: character maps to <undefined>

======================================================================
ERROR: test_without_retaining (test.test_itertext.MarkovifyTest)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "C:\Users\JGC\Desktop\Trabalhos\Python\markovify\test\test_itertext.py", line 18, in test_without_retaining
    senate_model = markovify.Text(f, retain_original=False)
  File "C:\Users\JGC\Desktop\Trabalhos\Python\markovify\markovify\text.py", line 53, in __init__
    parsed = parsed_sentences or self.generate_corpus(input_text)
  File "C:\Users\JGC\Desktop\Trabalhos\Python\markovify\markovify\text.py", line 152, in generate_corpus
    for line in text:
  File "C:\Users\JGC\anaconda3\lib\encodings\cp1252.py", line 23, in decode
    return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 3552: character maps to <undefined>

----------------------------------------------------------------------
Ran 4 tests in 0.725s

FAILED (errors=3)

Running a conda 3.7.6 environment on Windows 10.

The method make_sentence_with_start() seems like it only finds sentences that start with the seed.

It seems like make_sentence_with_start() is only looking through the corpus for lines/sentences that begin with the string you pass it, as opposed to searching the entire sentence for that word. For instance, I can add "I like to program with markov chains", and then trying to make_sentence_with_start("markov") will fail. However, if the log contains "markov chains are neat", then the chain will succeed, because the sentence begins with markov. Is there any way to make it so that it searches the entire sentence for the word, instead of just the beginning? For reference, here's the method we're using to build a markov chain:

def markovstring(prefix=None):
    with open ("log.txt", 'r', encoding='utf8') as file:
        text = file.read()
        text_model = markovify.NewlineText(text)
        if prefix is not None:
            try:
                message = text_model.make_sentence_with_start(prefix)
                for x in range(10):
                    message = text_model.make_sentence_with_start(prefix)
                if message is not None:
                    return message
                else:
                    return "Cannot create a chain with that string " + prefix
            except:
                return "Cannot create a chain with the string " + prefix
        else:
            message = text_model.make_sentence()
            while message is None:
                message = text_model.make_sentence()
            return message

NewlineText bug

AttributeError: 'module' object has no attribute 'NewlineText' happes on text file with new lines

import markovify

with open("phrases.txt") as f:
    content = f.read()

comment_model = markovify.NewlineText(content)
print(comment_model.make_sentence())

aaaand it won't open up text file even with Text. Returns None

Cannot combine Text models if they contain ()"'[] in the string

Environment

  • markovify-0.6.4
  • Python 3.6.3
  • Arch Linux

Description

If you are creating a new Text or NewlineText model and the string contains any of ()"'[], you will get a KeyError when trying to combine it with another model.

Here's some example code that demonstrates this issue. (Note that you can get titles-33(1)(a).txt from https://github.com/wragge/sekritfiles/blob/master/data/titles-33(1)(a).txt.)

#!/usr/bin/env python3

import markovify

with open('titles-33(1)(a).txt') as f:
    text = f.read()

model = markovify.NewlineText(text)
print(model.make_sentence())

broken_strings = [
        'this string is not broken',
        'this string contains (',
        '(',
        ')',
        '[',
        ']',
        '"',
        "'",
        ]

for broken in broken_strings:
    try:
        new_model = markovify.NewlineText(broken)
        combined_model = markovify.combine([model, new_model])
    except KeyError:
        print('Broken string:', broken)

Stacktrace

Traceback (most recent call last):
  File "./broken.py", line 23, in <module>
    new_model = markovify.NewlineText(broken)
  File "<redacted>/env/lib/python3.6/site-packages/markovify/text.py", line 36, in __init__
    self.chain = chain or Chain(self.parsed_sentences, state_size)
  File "<redacted>/env/lib/python3.6/site-packages/markovify/chain.py", line 45, in __init__
    self.precompute_begin_state()
  File "<redacted>/env/lib/python3.6/site-packages/markovify/chain.py", line 80, in precompute_begin_state
    choices, weights = zip(*self.model[begin_state].items())
KeyError: ('___BEGIN__', '___BEGIN__')

proper way to use to_json?

I am trying to generate a corpus with markovify and then save as json to file. Is this possible? I've been trying to use the generate_corpus() and to_json but keep getting an error that corpus must be a list of lists.

Thanks

Question about storing models as JSON

The readme mentions :

Models can be stored as JSON, allowing you to cache your results and save them for later.

Just to clarify, is this is only referring to markovify.Chain instances? Looking at the source I can only find methods for loading a Markov chain from a file/object, but not for a markovify.Text models.

What is valid text input for markovify.Text?

I'm trying out this functionality in conjunction with gensim's doc2vec (https://github.com/piskvorky/gensim/blob/develop/docs/notebooks/doc2vec-IMDB.ipynb) I'm getting an error when I try to utilize the ACL IMDB prepared text from the gensim example. I've attached a sample document (sandbox.txt) that causes this error as well as 2 samples that work (jayz.txt & sherlock.txt)

I'd like to identify what the difference is between these files that triggers the error in 1 but not the other two. Thanks!

jayz.txt
sandbox.txt
sherlock.txt


IndexError Traceback (most recent call last)
in ()
4 text = f.read()
5
----> 6 testModel = markovify.Text(text)
7 tool = grammar_check.LanguageTool("en-US")
8

C:\Anaconda\lib\site-packages\markovify\text.pyc in init(self, input_text, state_size, chain)
22 self.rejoined_text = self.sentence_join(map(self.word_join, runs))
23 state_size = state_size or 2
---> 24 self.chain = chain or Chain(runs, state_size)
25
26 def sentence_split(self, text):

C:\Anaconda\lib\site-packages\markovify\chain.pyc in init(self, corpus, state_size, model)
36 """
37 self.state_size = state_size
---> 38 self.model = model or self.build(corpus, self.state_size)
39
40 def build(self, corpus, state_size):

C:\Anaconda\lib\site-packages\markovify\chain.pyc in build(self, corpus, state_size)
46 appears.
47 """
---> 48 if (type(corpus) != list) or (type(corpus[0]) != list):
49 raise Exception("corpus must be list of lists")
50

IndexError: list index out of range

Doesn't work with http://www.anc.org/data/oanc/ngram/

Even with #93, it still returns None.

params = {
    'key': '* love *',
    'print_max': 100,
    'freq_threshold': 0,
    'output_style': 'sentence',
    'output_aux': 0,
    'print_format': 'text',
    'sort': None
}

r = requests.get('http://www.anc.org/cgi-bin/ngrams.cgi', params=params)
text = BeautifulSoup(r.text, 'html.parser').text
for punc in string.punctuation:
    text = text.replace(punc, '')

using make_sentence_with_start?

What is the proper way to call this function? When I try to pass a string of two words I get an error:

Also, is there a way to feed with only one words? Like Say I want the generated text to start with "You", is there a way to do this?

Thank you

Traceback (most recent call last):
  File "reddit.py", line 6, in <module>
    print(text_model.make_sentence_with_start('you need'))
  File "C:\Python3\lib\site-packages\markovify\text.py", line 139, in make_sentence_with_start
    return self.make_sentence(init_state, **kwargs)
  File "C:\Python3\lib\site-packages\markovify\text.py", line 122, in make_sentence
    words = prefix + self.chain.walk(init_state)
  File "C:\Python3\lib\site-packages\markovify\chain.py", line 98, in walk
    return list(self.gen(init_state))
  File "C:\Python3\lib\site-packages\markovify\chain.py", line 87, in gen
    next_word = self.move(state)
  File "C:\Python3\lib\site-packages\markovify\chain.py", line 73, in move
    choices, weights = zip(*self.model[state].items())
KeyError: ('you', 'need')

Feature request: generate sentence from given input word

Does the API provide a way to generate a sentence that contains the input word?

eg.
Input word: accessed
Output: He went to the bank because somebody had accessed his account.

If the lib doesn't currently support this, but it can be achieved with simple modification, feel free to point me in the right direction and I'll have a go.

make_short_sentence can get stuck in loop

make_short_sentence is not limited to a number of attempts like make_sentence is, and can get stuck in the loop if make_sentence can not produce a sentence short enough.

make_sentence never succeeds with seemingly normal input

This is the corpus, which always fails: https://my.mixtape.moe/synvpe.txt
This is a slightly modified corpus, which always succeeds, but always gives the same output: https://my.mixtape.moe/yrlbvb.txt
This is the code i'm using to to test the markov output:

#!/usr/bin/env python3
import markovify

toots_str = open("tootstr.txt", "r").read()
model = markovify.NewlineText(toots_str)
print(model.make_sentence(tries=100000))

Is there anything I'm doing wrong here? I'm using this code to create a bot. I've created over a dozen other bots using the exact same source code, and this is the first to have this issue.

make_sentence_with_start() don't working

It's really broken? I can use make_sentence(), but when I try to use make_sentence_with_start(), then I got errors:

Traceback (most recent call last):
File "C:\test.py", line 12, in
string = text_model.make_sentence_with_start("Hello")
File "c:\python\lib\site-packages\markovify\text.py", line 254, in make_sentence_with_start
output = self.make_sentence(init_state, **kwargs)
File "c:\python\lib\site-packages\markovify\text.py", line 197, in make_sentence
words = prefix + self.chain.walk(init_state)
File "c:\python\lib\site-packages\markovify\chain.py", line 118, in walk
return list(self.gen(init_state))
File "c:\python\lib\site-packages\markovify\chain.py", line 107, in gen
next_word = self.move(state)
File "c:\python\lib\site-packages\markovify\chain.py", line 93, in move
choices, weights = zip(*self.model[state].items())
KeyError: ('_BEGIN', 'Hello')

Code:

import markovify
import codecs

with codecs.open("file.txt", encoding='utf-8') as f:
    text = f.read()

# Build the model.
text_model = markovify.Text(text)

#string = text_model.make_sentence(tries=100)
string = text_model.make_sentence_with_start("Hello")

Markovify 0.7.2, Python 3.7.4

POSified text conflicting with json?

I'm working on having my text_model save to/load from json. If I use text_model = markovify.Text(foo) everything works OK, but if I use text_model = POSifiedText(foo), after reconstituting it back from json I get output like this:

I::PRP want::VBP to::TO share::NN it::PRP with::IN whomever::NN wants::VBZ it.::JJ

"All `models` must be of the same type."

Hello,

There’s currently a check in utils.combine that raises an exception if all models are not of the same type. Given that get_model_dict is called on each model why do we need to assert that all models are of the same type?

My problem is that I inherit classes from Text and type(my_model) doesn’t take that into account. See a minimal example below:

class MyText(markovify.Text):
    pass

class MyOtherText(MyText):
    pass

markovify.combine([MyText("foo"), MyOtherText("bar")])
# Raises: ValueError: All `models` must be of the same type.

Question about combining models

If I do markovify.combine() with no weighting, is it effectively the same as training one model on the texts of all the combined models? Asking because I'd like to train a model on a lot of text files, and it works out easier to create a bunch of different ones and then combine them, as long as that works the way I'm expecting it to.

small datasets

I've constructed a VERY small corpus of sentences.

I like you
you like pie

From these, I thought I would be able to generate

I like pie

but make_sentence(test_output=False) only generates the two sentences which are in the original corpus.

"___BEGIN__" in output of make_sentence_with_start()

With Python 2.7.9 and markovify v0.7.0, when I use model.make_sentence_with_start() with strict=True, I am getting the literal string ___BEGIN__ as the first word of the sentence. Occasionally I even get this with strict=False, but much less often (my corpus text is ~6 MB and many sentences start with and contain the word I requested).

For example, it produced ___BEGIN__ Who is right? with this:

import markovify

text = open('corpus.txt').read()
model = markovify.Text(text, state_size=3)
print(model.make_sentence_with_start('Who'))

Scoring generated text

Perhaps I'm missing something, but is there a function here to score the generated text, I'm planning to use Markovify for predicting whether a given text is anomalous to the trained model or not

State Size Limit

Is there a state size limit?

A state size > 5 just takes forever to run.

Issue with saving model to json

I'm using python 2.7 and when I try to save the model to json, I get the following error

UnicodeDecodeError: 'utf8' codec can't decode bytes in position 40-42: invalid continuation byte

Full traceback:
Traceback (most recent call last): File "python/markov.py", line 23, in <module> model_json = text_model.to_json() File "/usr/local/lib/python2.7/dist-packages/markovify/text.py", line 57, in to_json return json.dumps(self.to_dict()) File "/usr/local/lib/python2.7/dist-packages/markovify/text.py", line 49, in to_dict "chain": self.chain.to_json(), File "/usr/local/lib/python2.7/dist-packages/markovify/chain.py", line 124, in to_json return json.dumps(list(self.model.items())) File "/usr/lib/python2.7/json/__init__.py", line 244, in dumps return _default_encoder.encode(obj) File "/usr/lib/python2.7/json/encoder.py", line 207, in encode chunks = self.iterencode(o, _one_shot=True) File "/usr/lib/python2.7/json/encoder.py", line 270, in iterencode return _iterencode(o, 0) UnicodeDecodeError: 'utf8' codec can't decode bytes in position 40-42: invalid continuation byte

Seed first word of sentence

Hello,

I was just wondering if it is possible to seed the first word of the sentence you want to create?

So I can have my sentence always start with the word I want?

Many thanks!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.