bnosac / doc2vec Goto Github PK

View Code? Open in Web Editor NEW

47.0 4.0 5.0 3.27 MB

Distributed Representations of Sentences and Documents

License: Other

R 34.23% C++ 63.77% Makefile 0.36% C 1.64%

doc2vec embeddings r-package word2vec paragraph2vec natural-language-processing

doc2vec's People

Contributors

Stargazers

Watchers

Forkers

kyodocn martinbaumgaertner abhisaw stjordanis cluoma

doc2vec's Issues

Example plot

Hi.

I did not managed to reproduce the example graph in the readme :

https://github.com/bnosac/doc2vec/blob/master/tools/example-viz.png

the one that preceed :

library(doc2vec)
library(word2vec)
library(uwot)
library(dbscan)
data(be_parliament_2020, package = "doc2vec")
x      <- data.frame(doc_id = be_parliament_2020$doc_id,
                     text   = be_parliament_2020$text_nl,
                     stringsAsFactors = FALSE)
x$text <- txt_clean_word2vec(x$text)
x      <- subset(x, txt_count_words(text) < 1000)

d2v    <- paragraph2vec(x, type = "PV-DBOW", dim = 50, 
                        lr = 0.05, iter = 10,
                        window = 15, hs = TRUE, negative = 0,
                        sample = 0.00001, min_count = 5, 
                        threads = 1)
model  <- top2vec(d2v, 
                  control.dbscan = list(minPts = 50), 
                  control.umap = list(n_neighbors = 15L, n_components = 3), umap = tumap, 
                  trace = TRUE)
info   <- summary(model, top_n = 7)
info$topwords

I tried several function from textplot, without succes.

Thanks for any hints.

predict.paragraph2vec crashes with words greater than 103 chars long

It took me a little while to hunt down the cause of this crash...

It does this on my machine at the very least. This is on R 3.6.3.


library(doc2vec)

corpus <-data.frame(doc_id=1,text="here are some words for training the model")
model <- paragraph2vec(x = corpus, type = "PV-DM", dim = 10	, iter = 20,min_count=1)

# this text will successfully run
successtext <- "GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG"
nchar(successtext)
predict(model, newdata = list(a=successtext), type = "embedding", which = "docs")

# this text will cause a crash
failtext <- "GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG"
nchar(failtext)
predict(model, newdata = list(a=failtext), type = "embedding", which = "docs")

opticskxi as an alternative to hdbscan

opticskxi: OPTICS K-Xi Density-Based Clustering https://cran.r-project.org/web/packages/opticskxi/index.html

Error: Too many files are open

library(doc2vec)
library(stringr)

text <- c("We gather tonight knowing that this generation of heroes has made the United States safer and more respected around the world. For the first time in nine years, there are no Americans fighting in Iraq. (Applause.) For the first time in two decades, Osama bin Laden is not a threat to this country. (Applause.) Most of al Qaeda's top lieutenants have been defeated. The Taliban's momentum has been broken, and some troops in Afghanistan have begun to come home. These achievements are a testament to the courage, selflessness and teamwork of America's Armed Forces. At a time when too many of our institutions have let us down, they exceed all expectations. They're not consumed with personal ambition. They don't obsess over their differences. They focus on the mission at hand. They work together. Imagine what we could accomplish if we followed their example. (Applause.) Think about the America within our reach A country that leads the world in educating its people. An America that attracts a new generation of high-tech manufacturing and high-paying jobs. A future where we're in control of our own energy, and our security and prosperity aren't so tied to unstable parts of the world. An economy built to last, where hard work pays off, and responsibility is rewarded. We can do this. I know we can, because we've done it before. At the end of World War II, when another generation of heroes returned home from combat, they built the strongest economy and middle class the world has ever known. My grandfather, a veteran of Patton's Army, got the chance to go to college on the GI Bill. My grandmother, who worked on a bomber assembly line, was part of a workforce that turned out the best products on Earth. The two of them shared the optimism of a nation that had triumphed over a depression and fascism. They understood they were part of something larger; that they were contributing to a story of success that every American had a chance to share -- the basic American promise that if you worked hard, you could do well enough to raise a family, own a home, send your kids to college, and put a little away for retirement. The defining issue of our time is how to keep that promise alive. No challenge is more urgent. No debate is more important. We can either settle for a country where a shrinking number of people do really well while a growing number of Americans barely get by, or we can restore an economy where everyone gets a fair shot, and everyone does their fair share, and everyone plays by the same set of rules. What's at stake aren't Democratic values or Republican values, but American values. And we have to reclaim them. Let's remember how we got here. Long before the recession, jobs and manufacturing began leaving our shores. Technology made businesses more efficient, but also made some jobs obsolete. Folks at the top saw their incomes rise like never before, but most hardworking Americans struggled with costs that were growing, paychecks that weren't, and personal debt that kept piling up. In 2008, the house of cards collapsed. We learned that mortgages had been sold to people who couldn't afford or understand them. Banks had made huge bets and bonuses with other people's money. Regulators had looked the other way, or didn't have the authority to stop the bad behavior. It was wrong. It was irresponsible. And it plunged our economy into a crisis that put millions out of work, saddled us with more debt, and left innocent, hardworking Americans holding the bag. In the six months before I took office, we lost nearly 4 million jobs. And we lost another 4 million before our policies were in full effect. Those are the facts. But so are these: In the last 22 months, businesses have created more than 3 million jobs. Last year, they created the most jobs since 2005. American manufacturers are hiring again, creating jobs for the first time since the late 1990s. Together, we've agreed to cut the deficit by more than $2 trillion. And we've put in place new rules to hold Wall Street accountable, so a crisis like this never happens again. (Applause.) The state of our Union is getting stronger. And we've come too far to turn back now. As long as I'm President, I will work with anyone in this chamber to build on this momentum. But I intend to fight obstruction with action, and I will oppose any effort to return to the very same policies that brought on this economic crisis in the first place. (Applause.) No, we will not go back to an economy weakened by outsourcing, bad debt, and phony financial profits. Tonight, I want to speak about how we move forward, and lay out a blueprint for an economy that's built to last -- an economy built on American manufacturing, American energy, skills for American workers, and a renewal of American values. Now, this blueprint begins with American manufacturing. On the day I took office, our auto industry was on the verge of collapse. Some even said we should let it die. With a million jobs at stake, I refused to let that happen. In exchange for help, we demanded responsibility. We got workers and automakers to settle their differences. We got the industry to retool and restructure. Today, General Motors is back on top as the world's number-one automaker. Chrysler has grown faster in the U.S. than any major car company. Ford is investing billions in U.S. plants and factories. And together, the entire industry added nearly 160,000 jobs. Tonight marks the eighth year that I've come here to report on the State of the Union. And for this final one, I'm going to try to make it a little shorter. I know some of you are antsy to get back to Iowa. I've been there. I'll be shaking hands afterwards if you want some tips. (Laughter.) And I understand that because it's an election season, expectations for what we will achieve this year are low. But, Mr. Speaker, I appreciate the constructive approach that you and the other leaders took at the end of last year to pass a budget and make tax cuts permanent for working families. So I hope we can work together this year on some bipartisan priorities like criminal justice reform -- (applause) -- and helping people who are battling prescription drug abuse and heroin abuse. (Applause.) So, who knows, we might surprise the cynics again. But tonight, I want to go easy on the traditional list of proposals for the year ahead. Don't worry, I've got plenty, from helping students learn to write computer code to personalizing medical treatments for patients. And I will keep pushing for progress on the work that I believe still needs to be done. Fixing a broken immigration system. (Applause.) Protecting our kids from gun violence. (Applause.) Equal pay for equal work. (Applause.) Paid leave. (Applause.) Raising the minimum wage. (Applause.) All these things still matter to hardworking families. They're still the right thing to do. And I won't let up until they get done. But for my final address to this chamber, I don't want to just talk about next year. I want to focus on the next five years, the next 10 years, and beyond. I want to focus on our future. We live in a time of extraordinary change -- change that's reshaping the way we live, the way we work, our planet, our place in the world. It's change that promises amazing medical breakthroughs, but also economic disruptions that strain working families. It promises education for girls in the most remote villages, but also connects terrorists plotting an ocean away. It's change that can broaden opportunity, or widen inequality. And whether we like it or not, the pace of this change will only accelerate. America has been through big changes before -- wars and depression, the influx of new immigrants, workers fighting for a fair deal, movements to expand civil rights. Each time, there have been those who told us to fear the future who claimed we could slam the brakes on change; who promised to restore past glory if we just got some group or idea that was threatening America under control. And each time, we overcame those fears. We did not, in the words of Lincoln, adhere to the dogmas of the quiet past. Instead we thought anew, and acted anew. We made change work for us, always extending America's promise outward, to the next frontier, to more people. And because we did -- because we saw opportunity where others saw only peril -- we emerged stronger and better than before. What was true then can be true now. Our unique strengths as a nation -- our optimism and work ethic, our spirit of discovery, our diversity, our commitment to rule of law -- these things give us everything we need to ensure prosperity and security for generations to come. In fact, it's that spirit that made the progress of these past seven years possible.
It's how we recovered from the worst economic crisis in generations. It's how we reformed our health care system, and reinvented our energy sector; how we delivered more care and benefits to our troops and veterans, and how we secured the freedom in every state to marry the person we love. But such progress is not inevitable. It's the result of choices we make together. And we face such choices right now. Will we respond to the changes of our time with fear, turning inward as a nation, turning against each other as a people? Or will we face the future with confidence in who we are, in what we stand for, in the incredible things that we can do together? So let's talk about the future, and four big questions that I believe we as a country have to answer -- regardless of who the next President is, or who controls the next Congress. First, how do we give everyone a fair shot at opportunity and security in this new economy? (Applause.) Second, how do we make technology work for us, and not against us -- especially when it comes to solving urgent challenges like climate change? (Applause.) Third, how do we keep America safe and lead the world without becoming its policeman? (Applause.) And finally, how can we make our politics reflect what's best in us, and not what's worst? Let me start with the economy, and a basic fact: The United States of America, right now, has the strongest, most durable economy in the world. (Applause.) We're in the middle of the longest streak of private sector job creation in history. (Applause.) More than 14 million new jobs, the strongest two years of job growth since the 90s, an unemployment rate cut in half. Our auto industry just had its best year ever. (Applause.) That's just part of a manufacturing surge that's created nearly 900,000 new jobs in the past six years. And we've done all this while cutting our deficits by almost three-quarters. (Applause.) Anyone claiming that America's economy is in decline is peddling fiction. (Applause.) Now, what is true -- and the reason that a lot of Americans feel anxious -- is that the economy has been changing in profound ways, changes that started long before the Great Recession hit; changes that have not let up. Today, technology doesn't just replace jobs on the assembly line, but any job where work can be automated. Companies in a global economy can locate anywhere, and they face tougher competition. As a result, workers have less leverage for a raise. Companies have less loyalty to their communities. And more and more wealth and income is concentrated at the very top. All these trends have squeezed workers, even when they have jobs; even when the economy is growing. It's made it harder for a hardworking family to pull itself out of poverty, harder for young people to start their careers, tougher for workers to retire when they want to. And although none of these trends are unique to America, they do offend our uniquely American belief that everybody who works hard should get a fair shot. For the past seven years, our goal has been a growing economy that works also better for everybody. We've made progress. But we need to make more. And despite all the political arguments that we've had these past few years, there are actually some areas where Americans broadly agree. We gather tonight knowing that this generation of heroes has made the United States safer and more respected around the world. For the first time in nine years, there are no Americans fighting in Iraq. (Applause.) For the first time in two decades, Osama bin Laden is not a threat to this country. (Applause.) Most of al Qaeda's top lieutenants have been defeated. The Taliban's momentum has been broken, and some troops in Afghanistan have begun to come home. These achievements are a testament to the courage, selflessness and teamwork of America's Armed Forces. At a time when too many of our institutions have let us down, they exceed all expectations. They're not consumed with personal ambition. They don't obsess over their differences. They focus on the mission at hand. They work together. Imagine what we could accomplish if we followed their example. (Applause.) Think about the America within our reach: A country that leads the world in educating its people. An America that attracts a new generation of high-tech manufacturing and high-paying jobs. A future where we're in control of our own energy, and our security and prosperity aren't so tied to unstable parts of the world. An economy built to last, where hard work pays off, and responsibility is rewarded. We can do this. I know we can, because we've done it before. At the end of World War II, when another generation of heroes returned home from combat, they built the strongest economy and middle class the world has ever known. (Applause.) My grandfather, a veteran of Patton's Army, got the chance to go to college on the GI Bill. My grandmother, who worked on a bomber assembly line, was part of a workforce that turned out the best products on Earth. The two of them shared the optimism of a nation that had triumphed over a depression and fascism. They understood they were part of something larger; that they were contributing to a story of success that every American had a chance to share -- the basic American promise that if you worked hard, you could do well enough to raise a family, own a home, send your kids to college, and put a little away for retirement. The defining issue of our time is how to keep that promise alive. No challenge is more urgent. No debate is more important. We can either settle for a country where a shrinking number of people do really well while a growing number of Americans barely get by, or we can restore an economy where everyone gets a fair shot, and everyone does their fair share, and everyone plays by the same set of rules. (Applause.) What's at stake aren't Democratic values or Republican values, but American values. And we have to reclaim them. Let's remember how we got here. Long before the recession, jobs and manufacturing began leaving our shores. Technology made businesses more efficient, but also made some jobs obsolete. Folks at the top saw their incomes rise like never before, but most hardworking Americans struggled with costs that were growing, paychecks that weren't, and personal debt that kept piling up. In 2008, the house of cards collapsed. We learned that mortgages had been sold to people who couldn't afford or understand them. Banks had made huge bets and bonuses with other people's money. Regulators had looked the other way, or didn't have the authority to stop the bad behavior. It was wrong. It was irresponsible. And it plunged our economy into a crisis that put millions out of work, saddled us with more debt, and left innocent, hardworking Americans holding the bag. In the six months before I took office, we lost nearly 4 million jobs. And we lost another 4 million before our policies were in full effect.
")

text_list_full <- list()
text_list <- list()

#====Error 1: Error 1
generate_list_fun <- function(i) {
for (x in 1:10) {
t_1 <- word(text, x + i + 10, x + i + 30)
t_2 <- word(text, x + i + 4, x + i + 28)
t_3 <- word(text, x + i + 5, x + i + 30)
t_4 <- word(text, x + i + 20, x + i + 50)
t_5 <- word(text, x + i + 2, x + i + 50)

text_list[[x]] <- data.frame(doc_id = sprintf(c(paste0("doc_", x+4*x+1),
                                                paste0("doc_", x+4*x+2),
                                                paste0("doc_", x+4*x+3),
                                                paste0("doc_", x+4*x+4),
                                                paste0("doc_", x+4*x+5))),
                             text = c(t_1, t_2, t_3, t_4, t_5),
                             year = 2000 + i)

}

text_list_full[[i]] <- text_list
}

text_list_full <- lapply(1:20, generate_list_fun)

Create p2vec model

model_p2v_list <- list()
temp_list <- list()

paragraph2vec_list_fun <- function (i){
for (x in 1:(length(text_list_full[[i]]))) {
print(showConnections(all = FALSE))
print(paste(i,"_", x))

model <- paragraph2vec(text_list_full[[i]][[x]], type = "PV-DBOW",
                       dim = 200, iter = 20,
                       min_count = 3, lr = 0.05, threads = 4)
temp_list[[x]] <- model

}

model_p2v_list[[i]] <- temp_list
}

model_p2v_list <- lapply(1:length(text_list_full), paragraph2vec_list_fun) # This gives the error "training data file not found"

model_p2v_list <- lapply(1:7, paragraph2vec_list_fun) # This works, but afterwards, it will start giving me an error again.

model_p2v_list_2 <- lapply(5:7, paragraph2vec_list_fun)

need some testers

never mind

Possible Memory Leak in top2vec

I have been running doc2vec and top2vec on a Unix server. However, as I increase the data size I ran into the following error:

*** caught segfault ***
address 0x7ef3c55842c8, cause 'memory not mapped'

This is happening when calling the following top2vec code:

t2v <- top2vec(d2v,
control.dbscan = list(minPts = 25),
control.umap = list(n_neighbors = 100L, n_components = 2, metric = "cosine"), umap = tumap,
trace = FALSE)

Which refers to the following doc2vec line (note I have also tried using doc2vec models with fewer dimensions (50) and iterations (25)):

d2v <- paragraph2vec(x = sample_text, type = "PV-DBOW", dim = 100, iter = 50,
min_count = 10, lr = 0.05, threads = 6)

This only happens when running t2v on my "full" data, which has 137649 rows, 4 columns (doc_id, date, origin, text) and takes up around 200mb. When running on a subset of the data (20% sample), I do not run into this error. With the full data, doc2vec runs correctly; the issue is only with top2vec.

This seems to happen regardless of the options I specify for t2v (I've tried with different combinations of minPts, n_neighbors, and n_components). I've also tried increasing the amount of RAM. With this same dataset, I've tried using as much as 600GB at a time, with the same error.

I am happy to provide any other information that may be useful, and can email the data itself if that would be helpful.

Here is the traceback:

Traceback:
1: mrd(xdist, core_dist)
2: (function (x, minPts, gen_hdbscan_tree = FALSE, gen_simplified_tree = FALSE) { if (.matrixlike(x) && !inherits(x, "dist")) { x <- as.matrix(x) if (!is.numeric(x)) stop("hdbscan expects numerical data") xdist <- dist(x, method = "euclidean") } else if (inherits(x, "dist")) { xdist <- x } else { stop("hdbscan expects a matrix-coercible object of numerical data, and xdist to be a 'dist' object (or not supplied).") } core_dist <- kNNdist(x, k = minPts - 1) n <- attr(xdist, "Size") mrd <- mrd(xdist, core_dist) mst <- prims(mrd, n) hc <- hclustMergeOrder(mst, order(mst[, 3])) hc$call <- match.call() res <- computeStability(hc, minPts, compute_glosh = TRUE) res <- extractUnsupervised(res) cl <- attr(res, "cluster") sl <- attr(res, "salient_clusters") prob <- rep(0, length(cl)) for (cid in sl) { ccl <- res[[as.character(cid)]] max_f <- max(core_dist[which(cl == cid)]) pr <- (max_f - core_dist[which(cl == cid)])/max_f prob[cl == cid] <- pr } if (any(cl == 0)) { cluster <- match(cl, c(0, sl)) - 1 } else { cluster <- match(cl, sl) } cl_map <- structure(sl, names = unique(cluster[hc$order][cluster[hc$order] != 0])) cluster_scores <- sapply(sl, function(sl_cid) res[[as.character(sl_cid)]]$stability) names(cluster_scores) <- names(cl_map) attr(res, "cl_map") <- cl_map out <- structure(list(cluster = cluster, minPts = minPts, cluster_scores = cluster_scores, membership_prob = prob, outlier_scores = attr(res, "glosh"), hc = hc), class = "hdbscan", hdbscan = res) if (gen_hdbscan_tree) { out$hdbscan_tree = buildDendrogram(hc) } if (gen_simplified_tree) { out$simplified_tree = simplifiedTree(res) } return(out)})(minPts = 25, x = c(-0.0329070008023855, -0.0510561382993338, 0.31927777168907, 2.53887701866783, -0.197800866387713, -0.0769658005460379, -0.141690722726214, 0.215857990957868, -0.170577040933001, 0.418069847799909, -0.367248765252459, -0.183592072747576, -3.19788619043671, -0.377654544137346, 1.30107212898887, 0.265027531363142, -0.136775246880877, 0.636537083365095, 0.761293419577253, -0.307090512536394, 0.400113590933454, 0.612990864493025, -3.08819507720314, 1.46403838036217, 0.912720688559187, -0.17281960608803, -1.50698255660378, -0.165271273873675, 0.103059776999128, 3.35260630486168, -0.259427062295305, -0.0712256348355886, 0.125617989279402, -0.27269434096657, -0.398004523537981, 0.534049042441023, -3.35195884766422, -0.436888686440813, 0.530528553702009, 0.447638043142927, -0.156622878335344, 0.605052956320417, 0.509713658072126, -4.14617001178108, -0.241759053490984, -0.243766061089861, 0.715384491659773, 1.73730564949669, -0.216015807412493, -1.56626986625038, -0.520728579781878, -3.89345022203766, -2.38994168402992, -0.457976571343767, -4.31358020665489, -2.2884445107206, -0.439960948251116, -4.19196390273415, -2.454303852342, -0.342407695077288, -4.55869232776009, -2.29146622779213, -0.0608339226468679, -0.498709193490374, -0.208178273461687, 0.0719955050722483, -3.76645163359486, -2.34971570136391, -2.45159172179543, -0.011860362313616, -0.380645505212176, -4.75415836932503, -0.243525973580706, 0.257605084158552, 0.0955443465486887, 2.65294791100181, 4.32672406075157, -0.166711322091448, 4.23832465050377, -1.14851283194863, 2.85808659432091, 2.91992665169395, 4.11696768639244, -2.65539663197838, -0.684868565820086, 0.0440962397829416, 4.12063742516197, 3.36015892861046, 2.63998938439049, 2.25346947548546, 3.19387198326744, 2.07401228783287, -0.316493502877581, -0.262502185128557, 1.12873745796837, 0.292615183569563, -0.332694760583269, 0.148568876959455, 0.171530255056989, -0.321022263787615, 0.587905415274275, 0.266159542776716, 2.31162405846275, 2.15795899269737, 0.0531840407625559, 1.00326491234459, -0.444835892938006, -0.517042628548968, 0.0712685668245676, -0.228896609567034, -0.433341017983782, -4.83604668738686, -4.84771465422951, -4.82384740474068, -4.73267149092995, -4.78702329757057, -0.184857121728289, 0.137701758124006, -0.323441497109759, -0.196935406945574, -0.514315358422625, -0.169835797570574, -2.27167653205239, -0.291279307626116, 1.14935017464317, -0.281418076775896, -2.83764975430809, -0.0430350220426199, 3.23981524346031, -0.343746892236101, 0.222118862845075, -0.239912024758684, 0.616745957113874, -0.307937852166521, -0.130656472466814, 0.129127272345197, -2.13983034255348, -0.236182442925799, 0.202927597738874, -0.236794940255511, -0.0531713879331228, -0.311888924859392, -0.391226521752703, 1.88042403099693, 0.366949566580427, 0.286372193075788, -0.11731742980324, -0.166264048837053, 0.63383150932945, -0.0246810829862234, -0.0281255161985037, 0.106927164770735, -0.27637242438637, -0.155985823891985, -1.22118734481178, -0.0537018692716238, -0.235035172723162, 0.78052044747032, -2.10444699885689, -0.229012242577898, 1.9815955245272, -0.157938710473406, -0.0347750103696463, 1.24469519493736, 0.551852711416853, 1.55943013069786, -1.31541370513283, 2.28270531532921, -3.14784016373001, 1.65187264321007, -0.344153395913469, 0.242526539541853, 1.43331766960777, -0.167527667306292, 3.05519915459312, 2.07082844612755, -0.391653291009295, 1.16398812172569, 0.522867211081159, -0.118953696511614, -0.857514849923479, 0.707190998770368, 0.0620434367433909, 0.929068573691023, 1.31006146309532, 0.465960034109724, 1.88661862251915, -1.43217562796913, 2.24886370537437, 1.55984593269981, -0.350220910333025, 0.0823924624697092, 3.17227221367515, 2.22453976509727, -0.298243752740252, -3.44783330621563, 1.25892639992393, 0.612939842917097, -0.932937852166521, 0.751771458365095, -0.0796437180265066, 0.0695803248659494, 0.619783886648786, 0.607982643820417, -0.322912446282732, -0.217329493783343, -0.500463238976824, -5.04296158912026, -4.81553637149178, 0.248554238058698, -4.42120861174904, -4.82453464629494, -0.383761397622454, -0.0846614754422781, -1.07703899504982, -0.210983029626238, -0.412444821618426, -1.29818033339821, -0.170177928231585, -1.98086117865883, 0.324098356939924, 1.63329077599205, 1.98234892723717, -0.623602858804094, 0.145401962973249, -0.286433926843035, -0.774450055383074, -0.21771501662575, -0.133796206735003, 0.919029244162214, -0.278431168816912, -0.977221480630266, -0.125824919961321, 0.0422182166353586, 0.0604941928163889, -0.613816491387713, -0.411679021142351, -0.661084643624651, -0.0911018765195486, -0.190765134118426, -0.212877026818621, -1.18815993430458, -0.25946568610512, 0.170583256460798, 0.283740528799665, -6.09404563071572, 2.97684813378013, 2.27240801689781, 2.55227471230186, -1.82800804736458, 2.43285418389, 2.78031254646934, 2.74097586510338, 3.01722336647667, 2.08739567635216, 3.56992293236412, 4.08850003121055, 0.689634808279646, -0.471400252602923, 2.91557122109093, 3.34113264916099, 3.50812483666099, 3.15320206520714, -1.30213593604408, -0.378706208489763, -0.649160376809466, 2.46073771355308, -0.245192519448626, -1.72508417727791, -4.76835965277992, -0.256590834878313, -0.252418748162615, 2.82091284630455, -0.233392707131731, -4.74632035853707, -0.186841956399309, -0.344854108117449, 3.06035519478477, -4.57136147382103, -5.16596185328804, -1.29047607543312, -0.226802340768206, -4.7419174826368, 0.176206835486067, -1.5842163479551, 0.0854582869783762, -0.0367193138822195, 1.53171587822593, 2.11590720055259, -4.84130679728829, -0.451010457299578, -0.00751732947670192, -0.160587540887224, -1.60486053588234, -0.294502726815569, -4.78683685424172, -4.22128152015053, -1.12536167266213, -0.3083362496122, 0.740309246756208, -4.78091870906197, -0.487423173211443, 2.79731036064781, -4.51877360703789, -4.71318780543648, -0.277513734124529, -0.347368947289812, -0.376001826547014, -3.05268930556618, -0.0252311146482107, -2.91930716397606, -4.73591219546639, -0.355184308312761, -0.861971846841204, 0.427985199667585, -0.298479548714983, 1.09996558067955, -0.245578519128191, -1.43292998435341, 0.108910568930281, 0.336952694632185, -0.209662429116594, 0.669828423239363, 2.31232977745689, -1.22503518225991, 2.11726237175621, -0.412832490228045, 0.473070152975691, -0.578250399850237, -4.75037514331185, -0.196096173547136, -0.130308142922747, 1.36480570671715, -0.272620192788469, 3.51678849098839, -0.269890061639178, -0.274938813470232, -0.476217023156511, -0.652882090829241, -0.0501427567227957, -4.7687762892469, -0.448002568505633, -1.66389798285805, -3.66359101521217, -2.33136927249275, -4.7909817612394, -2.53429614665352, -0.942775956414568, -0.399054042123186, -2.44937121035897, -3.29428999783837, -4.81725525023781, -1.55176662566506, -1.55379437568032, -0.212881556771624, -0.125969401620257, 0.964581497885359, 0.0238509261385325, 2.79596520302452, 1.19639159081138, 1.11229992745079, -0.34817098738991, -0.423876515649187, -0.373279324792254, -0.249542228005755, -0.379079333566057, -0.274327985070574, -0.343841305993426, -1.11404203536354, -0.194974652551043, 0.577235707022322, -0.00770377280555934, 0.7233700835482, -0.101436606667864, -0.416516534112322, -0.361040583871233, 1.1137189948336, 2.98849154350914, 0.743658074118269, 0.0169775569216135, -0.46082591178261, 0.390957840658796, -0.374566308282244, -0.421959630273211, -0.466870061181414, -0.191645852349627, -0.362447730325091, 1.14317704079307, -0.372092238686907, 0.371094473578108, 0.00698710320152074, 1.04684258339561, 0.146469601370466, -0.323873511575091, 0.409598358847273, -0.360558024667132, -0.081912270806658, 2.95470524666466, 3.53493262169517, -0.393639317773211, 0.459333428122175, 0.907476433493269, 0.11545205948509, -0.264226189874041, -0.294573537133562, -0.294564238808977, -0.285635939858782, -0.159150830529558, -0.421795836709368, 3.44882679817833, -0.0787765896543142, -0.0108737862332937, 0.180952318884504, 1.02876473305382, 0.00838995812095433, -0.829011908791887, -4.75659834506355, -0.242485753320086, -0.746758929513323, -2.75293116929375, -0.609251490853655, -0.0299913799985525, -3.28620650532089, -0.431553832314837, 0.630290516592634, 0.512053021170271, 0.718517311789167, 0.328461416937483, 0.611931332327497, -0.106153002999651, -1.54916476371132, 2.51386023399986, 2.6255130851046, 2.91035128472008, 2.52646637795128, 1.97924519417442, 2.35376978752769, 1.198638924338, 1.80682755348839, 3.0093956076876, -1.94178520800911, 2.222245224692, -4.35300349356972, -0.737226954720843, 4.25545073387779, 2.45605135796226, 2.8155479514376, 1.03238392708458, 3.80362416145958, 2.93301487801231, 4.50513554451622, -0.435025683663714, -0.253192893288958, -0.212964049599993, -0.411410800240862, -0.172416440270769, 0.535430916525495, -0.273459903024065, -0.414951077721941, 0.121569403387678, -0.455059281609881, -0.264105311654436, -0.470468274377215, 1.65037728188194, -0.383880368493426, -0.302979937814104, -0.419511071465838, -0.138471118234026, -4.69355951907479, -0.257550946496355, 0.513857372976911, 0.595844753958357, -0.171005002282488, -0.522448531411516, -0.19066309096657, 1.93...
3: do.call(dbscan::hdbscan, control.dbscan)
4: top2vec(d2v, control.dbscan = list(minPts = 25), control.umap = list(n_neighbors = 100L, n_components = 2, metric = "cosine"), umap = tumap, trace = FALSE)
An irrecoverable exception occurred. R is aborting now ...

Here is the output of sessionInfo():

Matrix products: default
BLAS: /software/free/R/R-4.0.0/lib/R/lib/libRblas.so
LAPACK: /software/free/R/R-4.0.0/lib/R/lib/libRlapack.so

Random number generation:
RNG: L'Ecuyer-CMRG
Normal: Inversion
Sample: Rejection

locale:
[1] LC_CTYPE=C.UTF-8 LC_NUMERIC=C LC_TIME=C.UTF-8
[4] LC_COLLATE=C.UTF-8 LC_MONETARY=C.UTF-8 LC_MESSAGES=C.UTF-8
[7] LC_PAPER=C.UTF-8 LC_NAME=C LC_ADDRESS=C
[10] LC_TELEPHONE=C LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats graphics grDevices utils datasets methods base

other attached packages:
[1] ranger_0.12.1 vctrs_0.3.7 rlang_0.4.10
[4] mosaicCore_0.9.0 yardstick_0.0.8 workflowsets_0.0.2
[7] workflows_0.2.2 tune_0.1.5 tidyr_1.1.3
[10] tibble_3.1.1 rsample_0.0.9 recipes_0.1.16
[13] purrr_0.3.4 parsnip_0.1.5 modeldata_0.1.0
[16] infer_0.5.4 ggplot2_3.3.3 dplyr_1.0.5
[19] dials_0.0.9 scales_1.1.1 broom_0.7.6
[22] tidymodels_0.1.3 lubridate_1.7.10 gsubfn_0.7
[25] proto_1.0.0 data.table_1.13.6 dbscan_1.1-8
[28] uwot_0.1.10 Matrix_1.3-2 stringr_1.4.0
[31] doc2vec_0.2.0 futile.logger_1.4.3

loaded via a namespace (and not attached):
[1] splines_4.0.0 foreach_1.5.1 here_0.1
[4] prodlim_2019.11.13 assertthat_0.2.1 conflicted_1.0.4
[7] GPfit_1.0-8 globals_0.14.0 ipred_0.9-11
[10] pillar_1.6.0 backports_1.2.0 lattice_0.20-41
[13] glue_1.4.2 pROC_1.17.0.1 digest_0.6.27
[16] pryr_0.1.4 hardhat_0.1.5 colorspace_2.0-0
[19] plyr_1.8.6 timeDate_3043.102 pkgconfig_2.0.3
[22] lhs_1.1.1 DiceDesign_1.9 listenv_0.8.0
[25] RSpectra_0.16-0 gower_0.2.2 lava_1.6.9
[28] generics_0.1.0 ellipsis_0.3.1 withr_2.3.0
[31] furrr_0.2.2 nnet_7.3-14 cli_2.4.0
[34] survival_3.2-7 magrittr_1.5 crayon_1.3.4
[37] memoise_1.1.0 ps_1.4.0 fansi_0.4.1
[40] future_1.21.0 parallelly_1.24.0 MASS_7.3-53
[43] class_7.3-17 tools_4.0.0 formatR_1.7
[46] lifecycle_1.0.0 munsell_0.5.0 lambda.r_1.2.4
[49] compiler_4.0.0 grid_4.0.0 rstudioapi_0.13
[52] iterators_1.0.13 RcppAnnoy_0.0.18 gtable_0.3.0
[55] codetools_0.2-18 DBI_1.1.0 R6_2.5.0
[58] utf8_1.1.4 rprojroot_1.3-2 futile.options_1.0.1
[61] stringi_1.5.3 parallel_4.0.0 Rcpp_1.0.6
[64] rpart_4.1-15 tidyselect_1.1.0

predict.top2vec

add predict.top2vec as dbscan has now a predict.hdbscan method in version 1.1-9 from dbscan
mhahsler/dbscan@ef257e8
mhahsler/dbscan#32

space part of dictionary

find out why "" is part of the word dictionary
and if nothing is provided in a tokenised list, we still get the "" embedding

fix variable-length arrays

rcpp_doc2vec.cpp: In function ‘Rcpp::DataFrame paragraph2vec_nearest(SEXP, std::string, std::size_t, std::string)’:
rcpp_doc2vec.cpp:96:14: warning: ISO C++ forbids variable length array ‘knn_items’ [-Wvla]
   96 |   knn_item_t knn_items[top_n];
      |              ^~~~~~~~~
rcpp_doc2vec.cpp: In function ‘Rcpp::List paragraph2vec_nearest_sentence(SEXP, Rcpp::List, std::size_t)’:
rcpp_doc2vec.cpp:148:16: warning: ISO C++ forbids variable length array ‘knn_items’ [-Wvla]
  148 |     knn_item_t knn_items[top_n];
      |                ^~~~~~~~~

default parameter values

Finally a solid doc2vec implementation in R. Many thanks! I have a relatively minor suggestion: I feel that the default parameter values might be underselling the power of this method. I know everyone can change the default settings, but in reality most users just want to "press play". When I look at most doc2vec applications in Python - the go to text analysis language for most - they go for more demanding settings. For example, the top2vec module uses roughly the following default parameter values (from https://github.com/ddangelov/Top2Vec/blob/master/top2vec/Top2Vec.py):

model <- paragraph2vec(x = x, type = "PV-DBOW", dim = 300, iter = 40, hs = TRUE, window = 15, negative = 0, sample = 0.00001)

These values are surely not 100% scientific, but I think the authors have experimented quite a bit before arriving to them. I think they are a useful starting point.

The default values as you have them now make the process very fast but the resulting embeddings might often be quite poor. Negative subsampling, in particular, has been in some contexts associated with hurting the quality of the semantic space. I can also say that in my use case the default settings are not ideal, while the ones above yield pretty solid results within a reasonable time. Just a suggestion.

test on some data in prod

similarities to docs in same category

Feature request : possibility to use a pretrained word vector as starting point for doc2vec

Hi,

And thank you to bring doc2vec to R !!

This issue is a feature request: do you think it would be possible to allow doc2vec algorithm to use pretrained word vectors?

It would be interesting for instance if you have learned word vectors on a large corpus, and then you would like to use this as a starting point for the doc2vec algorithm on a smaller corpus.

I hope this is clear and appropriate.

Kind regards,
Dominique

boom

library(doc2vec)

corpus <-data.frame(doc_id=1,text="here are some words for training the model")
model <- paragraph2vec(x = corpus, type = "PV-DM", dim = 10	, iter = 20,min_count=1)

# this text will successfully run
successtext <- "GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG"
nchar(successtext)
predict(model, newdata = list(a=successtext), type = "embedding", which = "docs")

# this text will cause a crash
failtext <- "GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG"
nchar(failtext)
predict(model, newdata = list(a=failtext), type = "embedding", which = "docs")

posix_memalign

fix posix_memalign for windows, solaris (look to similar setup on crfsuite)
make sure memory is released

C stack overflow

Might this be caused by printing the evolution of the progress...
Tested on a dataset with 1.000.000 docs and works perfectly using 4 threads with trace = FALSE but not with trace = TRUE
Probably caused by using threads even if threads=1, namely at https://github.com/bnosac/doc2vec/blob/master/src/doc2vec/Doc2Vec.cpp#L94
Looks like scoring itself is not impacted.

test

@pprablanc would you be interested in testing out this package providing document vectors?

as.matrix(model, which = "docs") returns also (some) words

while working with the package I encountered what I assume is an error. Trying to extract document embeddings using as.matrix(model, which = "docs"), the resulting matrix contains both documents and words (not all words). I assume it should only contain document vectors. You mention in the README that documents should not be longer than 1000 words. Mine are longer, however. Could this cause the problem? The model seems to work fine otherwise.

==1826472== Memcheck, a memory error detector
==1826472== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al.
==1826472== Using Valgrind-3.16.1 and LibVEX; rerun with -h for copyright info
==1826472== Command: /data/blackswan/ripley/R/R-devel-vg/bin/exec/R --vanilla
==1826472== 

R Under development (unstable) (2020-12-09 r79601) -- "Unsuffered Consequences"
Copyright (C) 2020 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu (64-bit)

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.

  Natural language support but running in an English locale

R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.

Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.

> pkgname <- "doc2vec"
> source(file.path(R.home("share"), "R", "examples-header.R"))
> options(warn = 1)
> library('doc2vec')
> 
> base::assign(".oldSearch", base::search(), pos = 'CheckExEnv')
> base::assign(".old_wd", base::getwd(), pos = 'CheckExEnv')
> cleanEx()
> nameEx("as.matrix.paragraph2vec")
> ### * as.matrix.paragraph2vec
> 
> flush(stderr()); flush(stdout())
> 
> ### Name: as.matrix.paragraph2vec
> ### Title: Get the document or word vectors of a paragraph2vec model
> ### Aliases: as.matrix.paragraph2vec
> 
> ### ** Examples
> 
> ## Don't show: 
> if(require(tokenizers.bpe) & require(udpipe)){
+ ## End(Don't show)
+ library(tokenizers.bpe)
+ library(udpipe)
+ data(belgium_parliament, package = "tokenizers.bpe")
+ x <- subset(belgium_parliament, language %in% "french")
+ x <- subset(x, nchar(text) > 0 & txt_count(text, pattern = " ") < 1000)
+ 
+ model <- paragraph2vec(x = x, type = "PV-DM",   dim = 15,  iter = 5)
+ 
+ embedding <- as.matrix(model, which = "docs")
+ embedding <- as.matrix(model, which = "words")
+ embedding <- as.matrix(model, which = "docs", normalize = FALSE)
+ embedding <- as.matrix(model, which = "words", normalize = FALSE)
+ ## Don't show: 
+ } # End of main if statement running only if the required packages are installed
Loading required package: tokenizers.bpe
Loading required package: udpipe
==1826472== Warning: set address range perms: large range [0x2ffcf040, 0x47d47440) (undefined)
> ## End(Don't show)
> 
> 
> 
> cleanEx()

detaching ‘package:udpipe’, ‘package:tokenizers.bpe’

> nameEx("paragraph2vec")
> ### * paragraph2vec
> 
> flush(stderr()); flush(stdout())
> 
> ### Name: paragraph2vec
> ### Title: Train a paragraph2vec also known as doc2vec model on text
> ### Aliases: paragraph2vec
> 
> ### ** Examples
> 
> ## Don't show: 
> if(require(tokenizers.bpe) & require(udpipe)){
+ ## End(Don't show)
+ library(tokenizers.bpe)
+ library(udpipe)
+ ## Take data and standardise it a bit
+ data(belgium_parliament, package = "tokenizers.bpe")
+ str(belgium_parliament)
+ x <- subset(belgium_parliament, language %in% "french")
+ x$text   <- tolower(x$text)
+ x$text   <- gsub("[^[:alpha:]]", " ", x$text)
+ x$text   <- gsub("[[:space:]]+", " ", x$text)
+ x$text   <- trimws(x$text)
+ x$nwords <- txt_count(x$text, pattern = " ")
+ x <- subset(x, nwords < 1000 & nchar(text) > 0)
+ 
+ ## Build the model
+ model <- paragraph2vec(x = x, type = "PV-DM",   dim = 15,  iter = 5)
+ str(model)
+ embedding <- as.matrix(model, which = "words")
+ embedding <- as.matrix(model, which = "docs")
+ head(embedding)
+ 
+ ## Get vocabulary
+ vocab <- summary(model, type = "vocabulary",  which = "docs")
+ vocab <- summary(model, type = "vocabulary",  which = "words")
+ ## Don't show: 
+ } # End of main if statement running only if the required packages are installed
Loading required package: tokenizers.bpe
Loading required package: udpipe
'data.frame':	2000 obs. of  3 variables:
 $ doc_id  : chr  "http://data.dekamer.be/v0/qrva/54-B144-14-1021-2017201819553" "http://data.dekamer.be/v0/qrva/54-B141-4-1075-2017201820260" "http://data.dekamer.be/v0/qrva/54-B143-4-1074-2017201820256" "http://data.dekamer.be/v0/qrva/54-B143-4-1076-2017201820265" ...
 $ text    : chr  "Percentage vrouwen met een eenoudergezin. \n\n In Wallonie werden de eenoudergezinnen onlangs gescreend. Daarui"| __truncated__ "Bescherming van de gegevens van kinderen. \n\n Op 25 mei 2018 zal de Algemene Verordening Gegevensbescherming ("| __truncated__ "Snel breedbandinternet. \n\n In het kader van het Plan voor ultrasnel internet in Belgie hebt u in 2015 uw voor"| __truncated__ "Rapport van UNICEF. - 'Danger in the air'. \n\n UNICEF heeft recent een nieuw rapport over de impact van luchtv"| __truncated__ ...
 $ language: Factor w/ 2 levels "dutch","french": 1 1 1 1 1 1 1 1 1 1 ...
==1826472== Invalid read of size 8
==1826472==    at 0x17343E09: WMD::~WMD() (packages/tests-vg/doc2vec/src/doc2vec/WMD.cpp:27)
==1826472==    by 0x1733A8A4: Doc2Vec::~Doc2Vec() (packages/tests-vg/doc2vec/src/doc2vec/Doc2Vec.cpp:31)
==1826472==    by 0x17349F0F: standard_delete_finalizer<Doc2Vec> (R-devel/site-library/Rcpp/include/Rcpp/XPtr.h:30)
==1826472==    by 0x17349F0F: standard_delete_finalizer<Doc2Vec> (R-devel/site-library/Rcpp/include/Rcpp/XPtr.h:29)
==1826472==    by 0x17349F0F: finalizer_wrapper<Doc2Vec, Rcpp::standard_delete_finalizer<Doc2Vec> > (R-devel/site-library/Rcpp/include/Rcpp/XPtr.h:47)
==1826472==    by 0x17349F0F: void Rcpp::finalizer_wrapper<Doc2Vec, &(void Rcpp::standard_delete_finalizer<Doc2Vec>(Doc2Vec*))>(SEXPREC*) (R-devel/site-library/Rcpp/include/Rcpp/XPtr.h:34)
==1826472==    by 0x52BD9C: R_RunWeakRefFinalizer (svn/R-devel/src/main/memory.c:1469)
==1826472==    by 0x52BFC8: RunFinalizers.isra.0 (svn/R-devel/src/main/memory.c:1536)
==1826472==    by 0x4DC8D4: bc_check_sigint (svn/R-devel/src/main/eval.c:5529)
==1826472==    by 0x4DC8D4: bcEval (svn/R-devel/src/main/eval.c:6723)
==1826472==    by 0x4EFFB7: Rf_eval (svn/R-devel/src/main/eval.c:727)
==1826472==    by 0x4F19CD: R_execClosure (svn/R-devel/src/main/eval.c:1897)
==1826472==    by 0x4F26C3: Rf_applyClosure (svn/R-devel/src/main/eval.c:1823)
==1826472==    by 0x532645: dispatchMethod (svn/R-devel/src/main/objects.c:436)
==1826472==    by 0x5329F2: Rf_usemethod (svn/R-devel/src/main/objects.c:486)
==1826472==    by 0x532DA4: do_usemethod (svn/R-devel/src/main/objects.c:565)
==1826472==  Address 0x1afb26c0 is 48 bytes inside a block of size 80 free'd
==1826472==    at 0x483BEDD: operator delete(void*) (/builddir/build/BUILD/valgrind-3.16.1/coregrind/m_replacemalloc/vg_replace_malloc.c:584)
==1826472==    by 0x1733A892: Doc2Vec::~Doc2Vec() (packages/tests-vg/doc2vec/src/doc2vec/Doc2Vec.cpp:30)
==1826472==    by 0x17349F0F: standard_delete_finalizer<Doc2Vec> (R-devel/site-library/Rcpp/include/Rcpp/XPtr.h:30)
==1826472==    by 0x17349F0F: standard_delete_finalizer<Doc2Vec> (R-devel/site-library/Rcpp/include/Rcpp/XPtr.h:29)
==1826472==    by 0x17349F0F: finalizer_wrapper<Doc2Vec, Rcpp::standard_delete_finalizer<Doc2Vec> > (R-devel/site-library/Rcpp/include/Rcpp/XPtr.h:47)
==1826472==    by 0x17349F0F: void Rcpp::finalizer_wrapper<Doc2Vec, &(void Rcpp::standard_delete_finalizer<Doc2Vec>(Doc2Vec*))>(SEXPREC*) (R-devel/site-library/Rcpp/include/Rcpp/XPtr.h:34)
==1826472==    by 0x52BD9C: R_RunWeakRefFinalizer (svn/R-devel/src/main/memory.c:1469)
==1826472==    by 0x52BFC8: RunFinalizers.isra.0 (svn/R-devel/src/main/memory.c:1536)
==1826472==    by 0x4DC8D4: bc_check_sigint (svn/R-devel/src/main/eval.c:5529)
==1826472==    by 0x4DC8D4: bcEval (svn/R-devel/src/main/eval.c:6723)
==1826472==    by 0x4EFFB7: Rf_eval (svn/R-devel/src/main/eval.c:727)
==1826472==    by 0x4F19CD: R_execClosure (svn/R-devel/src/main/eval.c:1897)
==1826472==    by 0x4F26C3: Rf_applyClosure (svn/R-devel/src/main/eval.c:1823)
==1826472==    by 0x532645: dispatchMethod (svn/R-devel/src/main/objects.c:436)
==1826472==    by 0x5329F2: Rf_usemethod (svn/R-devel/src/main/objects.c:486)
==1826472==    by 0x532DA4: do_usemethod (svn/R-devel/src/main/objects.c:565)
==1826472==  Block was alloc'd at
==1826472==    at 0x483AE7D: operator new(unsigned long) (/builddir/build/BUILD/valgrind-3.16.1/coregrind/m_replacemalloc/vg_replace_malloc.c:342)
==1826472==    by 0x1733B912: Doc2Vec::train(char const*, int, int, int, int, int, int, float, float, int, int, int) (packages/tests-vg/doc2vec/src/doc2vec/Doc2Vec.cpp:85)
==1826472==    by 0x17345C02: paragraph2vec_train(char const*, int, int, int, int, int, int, double, double, int, int, int) (packages/tests-vg/doc2vec/src/rcpp_doc2vec.cpp:15)
==1826472==    by 0x1734E7E6: _doc2vec_paragraph2vec_train (packages/tests-vg/doc2vec/src/RcppExports.cpp:26)
==1826472==    by 0x49CDC5: R_doDotCall (svn/R-devel/src/main/dotcode.c:645)
==1826472==    by 0x49D3E3: do_dotcall (svn/R-devel/src/main/dotcode.c:1281)
==1826472==    by 0x4D34A6: bcEval (svn/R-devel/src/main/eval.c:7115)
==1826472==    by 0x4EFFB7: Rf_eval (svn/R-devel/src/main/eval.c:727)
==1826472==    by 0x4F19CD: R_execClosure (svn/R-devel/src/main/eval.c:1897)
==1826472==    by 0x4F26C3: Rf_applyClosure (svn/R-devel/src/main/eval.c:1823)
==1826472==    by 0x4DF55D: bcEval (svn/R-devel/src/main/eval.c:7083)
==1826472==    by 0x4EFFB7: Rf_eval (svn/R-devel/src/main/eval.c:727)
==1826472== 
==1826472== Invalid read of size 8
==1826472==    at 0x17343E37: WMD::~WMD() (packages/tests-vg/doc2vec/src/doc2vec/WMD.cpp:27)
==1826472==    by 0x1733A8A4: Doc2Vec::~Doc2Vec() (packages/tests-vg/doc2vec/src/doc2vec/Doc2Vec.cpp:31)
==1826472==    by 0x17349F0F: standard_delete_finalizer<Doc2Vec> (R-devel/site-library/Rcpp/include/Rcpp/XPtr.h:30)
==1826472==    by 0x17349F0F: standard_delete_finalizer<Doc2Vec> (R-devel/site-library/Rcpp/include/Rcpp/XPtr.h:29)
==1826472==    by 0x17349F0F: finalizer_wrapper<Doc2Vec, Rcpp::standard_delete_finalizer<Doc2Vec> > (R-devel/site-library/Rcpp/include/Rcpp/XPtr.h:47)
==1826472==    by 0x17349F0F: void Rcpp::finalizer_wrapper<Doc2Vec, &(void Rcpp::standard_delete_finalizer<Doc2Vec>(Doc2Vec*))>(SEXPREC*) (R-devel/site-library/Rcpp/include/Rcpp/XPtr.h:34)
==1826472==    by 0x52BD9C: R_RunWeakRefFinalizer (svn/R-devel/src/main/memory.c:1469)
==1826472==    by 0x52BFC8: RunFinalizers.isra.0 (svn/R-devel/src/main/memory.c:1536)
==1826472==    by 0x4DC8D4: bc_check_sigint (svn/R-devel/src/main/eval.c:5529)
==1826472==    by 0x4DC8D4: bcEval (svn/R-devel/src/main/eval.c:6723)
==1826472==    by 0x4EFFB7: Rf_eval (svn/R-devel/src/main/eval.c:727)
==1826472==    by 0x4F19CD: R_execClosure (svn/R-devel/src/main/eval.c:1897)
==1826472==    by 0x4F26C3: Rf_applyClosure (svn/R-devel/src/main/eval.c:1823)
==1826472==    by 0x532645: dispatchMethod (svn/R-devel/src/main/objects.c:436)
==1826472==    by 0x5329F2: Rf_usemethod (svn/R-devel/src/main/objects.c:486)
==1826472==    by 0x532DA4: do_usemethod (svn/R-devel/src/main/objects.c:565)
==1826472==  Address 0x1afb26c0 is 48 bytes inside a block of size 80 free'd
==1826472==    at 0x483BEDD: operator delete(void*) (/builddir/build/BUILD/valgrind-3.16.1/coregrind/m_replacemalloc/vg_replace_malloc.c:584)
==1826472==    by 0x1733A892: Doc2Vec::~Doc2Vec() (packages/tests-vg/doc2vec/src/doc2vec/Doc2Vec.cpp:30)
==1826472==    by 0x17349F0F: standard_delete_finalizer<Doc2Vec> (R-devel/site-library/Rcpp/include/Rcpp/XPtr.h:30)
==1826472==    by 0x17349F0F: standard_delete_finalizer<Doc2Vec> (R-devel/site-library/Rcpp/include/Rcpp/XPtr.h:29)
==1826472==    by 0x17349F0F: finalizer_wrapper<Doc2Vec, Rcpp::standard_delete_finalizer<Doc2Vec> > (R-devel/site-library/Rcpp/include/Rcpp/XPtr.h:47)
==1826472==    by 0x17349F0F: void Rcpp::finalizer_wrapper<Doc2Vec, &(void Rcpp::standard_delete_finalizer<Doc2Vec>(Doc2Vec*))>(SEXPREC*) (R-devel/site-library/Rcpp/include/Rcpp/XPtr.h:34)
==1826472==    by 0x52BD9C: R_RunWeakRefFinalizer (svn/R-devel/src/main/memory.c:1469)
==1826472==    by 0x52BFC8: RunFinalizers.isra.0 (svn/R-devel/src/main/memory.c:1536)
==1826472==    by 0x4DC8D4: bc_check_sigint (svn/R-devel/src/main/eval.c:5529)
==1826472==    by 0x4DC8D4: bcEval (svn/R-devel/src/main/eval.c:6723)
==1826472==    by 0x4EFFB7: Rf_eval (svn/R-devel/src/main/eval.c:727)
==1826472==    by 0x4F19CD: R_execClosure (svn/R-devel/src/main/eval.c:1897)
==1826472==    by 0x4F26C3: Rf_applyClosure (svn/R-devel/src/main/eval.c:1823)
==1826472==    by 0x532645: dispatchMethod (svn/R-devel/src/main/objects.c:436)
==1826472==    by 0x5329F2: Rf_usemethod (svn/R-devel/src/main/objects.c:486)
==1826472==    by 0x532DA4: do_usemethod (svn/R-devel/src/main/objects.c:565)
==1826472==  Block was alloc'd at
==1826472==    at 0x483AE7D: operator new(unsigned long) (/builddir/build/BUILD/valgrind-3.16.1/coregrind/m_replacemalloc/vg_replace_malloc.c:342)
==1826472==    by 0x1733B912: Doc2Vec::train(char const*, int, int, int, int, int, int, float, float, int, int, int) (packages/tests-vg/doc2vec/src/doc2vec/Doc2Vec.cpp:85)
==1826472==    by 0x17345C02: paragraph2vec_train(char const*, int, int, int, int, int, int, double, double, int, int, int) (packages/tests-vg/doc2vec/src/rcpp_doc2vec.cpp:15)
==1826472==    by 0x1734E7E6: _doc2vec_paragraph2vec_train (packages/tests-vg/doc2vec/src/RcppExports.cpp:26)
==1826472==    by 0x49CDC5: R_doDotCall (svn/R-devel/src/main/dotcode.c:645)
==1826472==    by 0x49D3E3: do_dotcall (svn/R-devel/src/main/dotcode.c:1281)
==1826472==    by 0x4D34A6: bcEval (svn/R-devel/src/main/eval.c:7115)
==1826472==    by 0x4EFFB7: Rf_eval (svn/R-devel/src/main/eval.c:727)
==1826472==    by 0x4F19CD: R_execClosure (svn/R-devel/src/main/eval.c:1897)
==1826472==    by 0x4F26C3: Rf_applyClosure (svn/R-devel/src/main/eval.c:1823)
==1826472==    by 0x4DF55D: bcEval (svn/R-devel/src/main/eval.c:7083)
==1826472==    by 0x4EFFB7: Rf_eval (svn/R-devel/src/main/eval.c:727)
==1826472== 
==1826472== Warning: set address range perms: large range [0x2ffcf028, 0x47d47458) (noaccess)
==1826472== Warning: set address range perms: large range [0x2f7cf040, 0x47547440) (undefined)
List of 3
 $ model  :<externalptr> 
 $ data   :List of 4
  ..$ file        : chr "/tmp/RtmpEAhEtr/textspace_1bdea875e9828c.txt"
  ..$ n           : num 203713
  ..$ n_vocabulary: num 4254
  ..$ n_docs      : num 999
 $ control:List of 9
  ..$ min_count: int 5
  ..$ dim      : int 15
  ..$ window   : int 5
  ..$ iter     : int 5
  ..$ lr       : num 0.05
  ..$ skipgram : logi FALSE
  ..$ hs       : int 0
  ..$ negative : int 5
  ..$ sample   : num 0.001
 - attr(*, "class")= chr "paragraph2vec_trained"
> ## End(Don't show)
> 
> 
> 
> cleanEx()

detaching ‘package:udpipe’, ‘package:tokenizers.bpe’

> nameEx("paragraph2vec_similarity")
> ### * paragraph2vec_similarity
> 
> flush(stderr()); flush(stdout())
> 
> ### Name: paragraph2vec_similarity
> ### Title: Similarity between document / word vectors as used in
> ###   paragraph2vec
> ### Aliases: paragraph2vec_similarity
> 
> ### ** Examples
> 
> x <- matrix(rnorm(6), nrow = 2, ncol = 3)
> rownames(x) <- c("word1", "word2")
> y <- matrix(rnorm(15), nrow = 5, ncol = 3)
> rownames(y) <- c("doc1", "doc2", "doc3", "doc4", "doc5")
> 
> paragraph2vec_similarity(x, y)
            doc1       doc2      doc3       doc4       doc5
word1 -0.6364508  0.3676014  1.760565 -0.5530176 -0.6067031
word2  0.7247061 -1.6298525 -4.101116  1.2512209 -0.5480451
> paragraph2vec_similarity(x, y, top_n = 1)
  term1 term2 similarity rank
1 word2  doc4   1.251221    1
2 word1  doc3   1.760565    1
> paragraph2vec_similarity(x, y, top_n = 2)
  term1 term2 similarity rank
1 word2  doc4  1.2512209    1
2 word2  doc1  0.7247061    2
3 word1  doc3  1.7605649    1
4 word1  doc2  0.3676014    2
> paragraph2vec_similarity(x, y, top_n = +Inf)
   term1 term2 similarity rank
1  word2  doc4  1.2512209    1
2  word2  doc1  0.7247061    2
3  word2  doc5 -0.5480451    3
4  word2  doc2 -1.6298525    4
5  word2  doc3 -4.1011158    5
6  word1  doc3  1.7605649    1
7  word1  doc2  0.3676014    2
8  word1  doc4 -0.5530176    3
9  word1  doc5 -0.6067031    4
10 word1  doc1 -0.6364508    5
> paragraph2vec_similarity(y, y)
           doc1       doc2      doc3        doc4       doc5
doc1  0.3898270  0.1024135 -0.596029  0.28007612 0.70449051
doc2  0.1024135  1.8218900  2.576073 -0.36378296 2.01146409
doc3 -0.5960290  2.5760733  5.910824 -2.17949696 1.72465356
doc4  0.2800761 -0.3637830 -2.179497  1.71145042 0.03355426
doc5  0.7044905  2.0114641  1.724654  0.03355426 3.13202074
> paragraph2vec_similarity(y, y, top_n = 1)
  term1 term2 similarity rank
1  doc5  doc5  3.1320207    1
2  doc4  doc4  1.7114504    1
3  doc3  doc3  5.9108240    1
4  doc2  doc3  2.5760733    1
5  doc1  doc5  0.7044905    1
> paragraph2vec_similarity(y, y, top_n = 2)
   term1 term2 similarity rank
1   doc5  doc5  3.1320207    1
2   doc5  doc2  2.0114641    2
3   doc4  doc4  1.7114504    1
4   doc4  doc1  0.2800761    2
5   doc3  doc3  5.9108240    1
6   doc3  doc2  2.5760733    2
7   doc2  doc3  2.5760733    1
8   doc2  doc5  2.0114641    2
9   doc1  doc5  0.7044905    1
10  doc1  doc1  0.3898270    2
> paragraph2vec_similarity(y, y, top_n = +Inf)
   term1 term2  similarity rank
1   doc5  doc5  3.13202074    1
2   doc5  doc2  2.01146409    2
3   doc5  doc3  1.72465356    3
4   doc5  doc1  0.70449051    4
5   doc5  doc4  0.03355426    5
6   doc4  doc4  1.71145042    1
7   doc4  doc1  0.28007612    2
8   doc4  doc5  0.03355426    3
9   doc4  doc2 -0.36378296    4
10  doc4  doc3 -2.17949696    5
11  doc3  doc3  5.91082401    1
12  doc3  doc2  2.57607334    2
13  doc3  doc5  1.72465356    3
14  doc3  doc1 -0.59602900    4
15  doc3  doc4 -2.17949696    5
16  doc2  doc3  2.57607334    1
17  doc2  doc5  2.01146409    2
18  doc2  doc2  1.82189002    3
19  doc2  doc1  0.10241352    4
20  doc2  doc4 -0.36378296    5
21  doc1  doc5  0.70449051    1
22  doc1  doc1  0.38982695    2
23  doc1  doc4  0.28007612    3
24  doc1  doc2  0.10241352    4
25  doc1  doc3 -0.59602900    5
> 
> 
> 
> cleanEx()
> nameEx("predict.paragraph2vec")
> ### * predict.paragraph2vec
> 
> flush(stderr()); flush(stdout())
> 
> ### Name: predict.paragraph2vec
> ### Title: Predict functionalities for a paragraph2vec model
> ### Aliases: predict.paragraph2vec
> 
> ### ** Examples
> 
> ## Don't show: 
> if(require(tokenizers.bpe) & require(udpipe)){
+ ## End(Don't show)
+ library(tokenizers.bpe)
+ library(udpipe)
+ data(belgium_parliament, package = "tokenizers.bpe")
+ x <- belgium_parliament
+ x <- subset(x, language %in% "dutch")
+ x <- subset(x, nchar(text) > 0 & txt_count(text, pattern = " ") < 1000)
+ x$doc_id <- sprintf("doc_%s", 1:nrow(x))
+ x$text   <- tolower(x$text)
+ x$text   <- gsub("[^[:alpha:]]", " ", x$text)
+ x$text   <- gsub("[[:space:]]+", " ", x$text)
+ x$text   <- trimws(x$text)
+ 
+ ## Build model
+ model <- paragraph2vec(x = x, type = "PV-DM",   dim = 15,  iter = 5)
+ 
+ sentences <- list(
+   example = c("geld", "diabetes"),
+   hi = c("geld", "diabetes", "koning"),
+   test = c("geld"),
+   nothing = character(), 
+   repr = c("geld", "diabetes", "koning"))
+   
+ ## Get embeddings (type =  'embedding')
+ predict(model, newdata = c("geld", "koning", "unknownword", NA, "</s>", ""), 
+                type = "embedding", which = "words")
+ predict(model, newdata = c("doc_1", "doc_10", "unknowndoc", NA, "</s>"), 
+                type = "embedding", which = "docs")
+ predict(model, sentences, type = "embedding")
+ 
+ ## Get most similar items (type =  'nearest')
+ predict(model, newdata = c("doc_1", "doc_10"), type = "nearest", which = "doc2doc")
+ predict(model, newdata = c("geld", "koning"), type = "nearest", which = "word2doc")
+ predict(model, newdata = c("geld", "koning"), type = "nearest", which = "word2word")
+ predict(model, newdata = sentences, type = "nearest", which = "sent2doc", top_n = 7)
+ 
+ ## Similar way on extracting similarities
+ emb <- predict(model, sentences, type = "embedding")
+ emb_docs <- as.matrix(model, type = "docs")
+ paragraph2vec_similarity(emb, emb_docs, top_n = 3)
+ ## Don't show: 
+ } # End of main if statement running only if the required packages are installed
Loading required package: tokenizers.bpe
Loading required package: udpipe
==1826472== Warning: set address range perms: large range [0x2f7cf028, 0x47547458) (noaccess)
==1826472== Warning: set address range perms: large range [0x2f7cf040, 0x47547440) (undefined)
     term1   term2 similarity rank
1     test doc_285  0.9791637    1
2     test doc_807  0.9765480    2
3     test doc_195  0.9696226    3
4     repr doc_424  0.9917132    1
5     repr doc_101  0.9901011    2
6     repr doc_199  0.9894081    3
7  nothing doc_523  0.7684932    1
8  nothing doc_807  0.6819410    2
9  nothing doc_923  0.6805237    3
10      hi doc_424  0.9917132    1
11      hi doc_101  0.9901011    2
12      hi doc_199  0.9894081    3
13 example doc_424  0.9853744    1
14 example doc_199  0.9818371    2
15 example doc_790  0.9786496    3
> ## End(Don't show)
> 
> 
> 
> cleanEx()

detaching ‘package:udpipe’, ‘package:tokenizers.bpe’

> nameEx("read.paragraph2vec")
> ### * read.paragraph2vec
> 
> flush(stderr()); flush(stdout())
> 
> ### Name: read.paragraph2vec
> ### Title: Read a binary paragraph2vec model from disk
> ### Aliases: read.paragraph2vec
> 
> ### ** Examples
> 
> ## Don't show: 
> if(require(tokenizers.bpe) & require(udpipe)){
+ ## End(Don't show)
+ library(tokenizers.bpe)
+ library(udpipe)
+ data(belgium_parliament, package = "tokenizers.bpe")
+ x <- subset(belgium_parliament, language %in% "french")
+ x <- subset(x, nchar(text) > 0 & txt_count(text, pattern = " ") < 1000)
+ 
+ ## Don't show: 
+ model <- paragraph2vec(x = head(x, 5), 
+                        type = "PV-DM", dim = 5, iter = 1, min_count = 0)
+ ## End(Don't show)
+ path <- "mymodel.bin"
+ ## Don't show: 
+ path <- tempfile(pattern = "paragraph2vec", fileext = ".bin")
+ ## End(Don't show)
+ write.paragraph2vec(model, file = path)
+ model <- read.paragraph2vec(file = path)
+ 
+ vocab <- summary(model, type = "vocabulary", which = "docs")
+ vocab <- summary(model, type = "vocabulary", which = "words")
+ embedding <- as.matrix(model, which = "docs")
+ embedding <- as.matrix(model, which = "words")
+ ## Don't show: 
+ file.remove(path)
+ ## End(Don't show)
+ ## Don't show: 
+ } # End of main if statement running only if the required packages are installed
Loading required package: tokenizers.bpe
Loading required package: udpipe
==1826472== Warning: set address range perms: large range [0x60efa040, 0x78c72440) (undefined)
==1826472== Warning: set address range perms: large range [0x87155040, 0x9eecd440) (undefined)
[1] TRUE
> ## End(Don't show)
> 
> 
> 
> cleanEx()

detaching ‘package:udpipe’, ‘package:tokenizers.bpe’

> nameEx("write.paragraph2vec")
> ### * write.paragraph2vec
> 
> flush(stderr()); flush(stdout())
> 
> ### Name: write.paragraph2vec
> ### Title: Save a paragraph2vec model to disk
> ### Aliases: write.paragraph2vec
> 
> ### ** Examples
> 
> ## Don't show: 
> if(require(tokenizers.bpe) & require(udpipe)){
+ ## End(Don't show)
+ library(tokenizers.bpe)
+ library(udpipe)
+ data(belgium_parliament, package = "tokenizers.bpe")
+ x <- subset(belgium_parliament, language %in% "french")
+ x <- subset(x, nchar(text) > 0 & txt_count(text, pattern = " ") < 1000)
+ 
+ ## Don't show: 
+ model <- paragraph2vec(x = head(x, 5), 
+                        type = "PV-DM", dim = 5, iter = 1, min_count = 0)
+ ## End(Don't show)
+ path <- "mymodel.bin"
+ ## Don't show: 
+ path <- tempfile(pattern = "paragraph2vec", fileext = ".bin")
+ ## End(Don't show)
+ write.paragraph2vec(model, file = path)
+ model <- read.paragraph2vec(file = path)
+ 
+ vocab <- summary(model, type = "vocabulary", which = "docs")
+ vocab <- summary(model, type = "vocabulary", which = "words")
+ embedding <- as.matrix(model, which = "docs")
+ embedding <- as.matrix(model, which = "words")
+ ## Don't show: 
+ file.remove(path)
+ ## End(Don't show)
+ ## Don't show: 
+ } # End of main if statement running only if the required packages are installed
Loading required package: tokenizers.bpe
Loading required package: udpipe
==1826472== Warning: set address range perms: large range [0x87155028, 0x9eecd458) (noaccess)
==1826472== Warning: set address range perms: large range [0x60efa028, 0x78c72458) (noaccess)
==1826472== Warning: set address range perms: large range [0x2f7cf028, 0x47547458) (noaccess)
==1826472== Warning: set address range perms: large range [0x2f7cf040, 0x47547440) (undefined)
==1826472== Warning: set address range perms: large range [0x60efa040, 0x78c72440) (undefined)
[1] TRUE
> ## End(Don't show)
> 
> 
> 
> ### * <FOOTER>
> ###
> cleanEx()

detaching ‘package:udpipe’, ‘package:tokenizers.bpe’

> options(digits = 7L)
> base::cat("Time elapsed: ", proc.time() - base::get("ptime", pos = 'CheckExEnv'),"\n")
Time elapsed:  626.076 12.991 644.965 0 0 
> grDevices::dev.off()
null device 
          1 
> ###
> ### Local variables: ***
> ### mode: outline-minor ***
> ### outline-regexp: "\\(> \\)?### [*]+" ***
> ### End: ***
> quit('no')
==1826472== 
==1826472== HEAP SUMMARY:
==1826472==     in use at exit: 1,481,452,553 bytes in 107,402 blocks
==1826472==   total heap usage: 4,495,675 allocs, 4,388,273 frees, 5,369,422,879 bytes allocated
==1826472== 
==1826472== 9 bytes in 1 blocks are possibly lost in loss record 13 of 2,458
==1826472==    at 0x483CAE9: calloc (/builddir/build/BUILD/valgrind-3.16.1/coregrind/m_replacemalloc/vg_replace_malloc.c:760)
==1826472==    by 0x17342FAF: Vocabulary::addWordToVocab(char const*) (packages/tests-vg/doc2vec/src/doc2vec/Vocab.cpp:87)
==1826472==    by 0x1734353A: Vocabulary::loadFromTrainFile(char const*) (packages/tests-vg/doc2vec/src/doc2vec/Vocab.cpp:67)
==1826472==    by 0x1734391C: Vocabulary::Vocabulary(char const*, int, bool) (packages/tests-vg/doc2vec/src/doc2vec/Vocab.cpp:15)
==1826472==    by 0x1733B8DF: Doc2Vec::train(char const*, int, int, int, int, int, int, float, float, int, int, int) (packages/tests-vg/doc2vec/src/doc2vec/Doc2Vec.cpp:83)
==1826472==    by 0x17345C02: paragraph2vec_train(char const*, int, int, int, int, int, int, double, double, int, int, int) (packages/tests-vg/doc2vec/src/rcpp_doc2vec.cpp:15)
==1826472==    by 0x1734E7E6: _doc2vec_paragraph2vec_train (packages/tests-vg/doc2vec/src/RcppExports.cpp:26)
==1826472==    by 0x49CDC5: R_doDotCall (svn/R-devel/src/main/dotcode.c:645)
==1826472==    by 0x49D3E3: do_dotcall (svn/R-devel/src/main/dotcode.c:1281)
==1826472==    by 0x4D34A6: bcEval (svn/R-devel/src/main/eval.c:7115)
==1826472==    by 0x4EFFB7: Rf_eval (svn/R-devel/src/main/eval.c:727)
==1826472==    by 0x4F19CD: R_execClosure (svn/R-devel/src/main/eval.c:1897)
==1826472== 
==1826472== 20 bytes in 4 blocks are definitely lost in loss record 25 of 2,458
==1826472==    at 0x483CAE9: calloc (/builddir/build/BUILD/valgrind-3.16.1/coregrind/m_replacemalloc/vg_replace_malloc.c:760)
==1826472==    by 0x17342FAF: Vocabulary::addWordToVocab(char const*) (packages/tests-vg/doc2vec/src/doc2vec/Vocab.cpp:87)
==1826472==    by 0x1734358C: Vocabulary::loadFromTrainFile(char const*) (packages/tests-vg/doc2vec/src/doc2vec/Vocab.cpp:45)
==1826472==    by 0x1734391C: Vocabulary::Vocabulary(char const*, int, bool) (packages/tests-vg/doc2vec/src/doc2vec/Vocab.cpp:15)
==1826472==    by 0x1733B8DF: Doc2Vec::train(char const*, int, int, int, int, int, int, float, float, int, int, int) (packages/tests-vg/doc2vec/src/doc2vec/Doc2Vec.cpp:83)
==1826472==    by 0x17345C02: paragraph2vec_train(char const*, int, int, int, int, int, int, double, double, int, int, int) (packages/tests-vg/doc2vec/src/rcpp_doc2vec.cpp:15)
==1826472==    by 0x1734E7E6: _doc2vec_paragraph2vec_train (packages/tests-vg/doc2vec/src/RcppExports.cpp:26)
==1826472==    by 0x49CDC5: R_doDotCall (svn/R-devel/src/main/dotcode.c:645)
==1826472==    by 0x49D3E3: do_dotcall (svn/R-devel/src/main/dotcode.c:1281)
==1826472==    by 0x4D34A6: bcEval (svn/R-devel/src/main/eval.c:7115)
==1826472==    by 0x4EFFB7: Rf_eval (svn/R-devel/src/main/eval.c:727)
==1826472==    by 0x4F19CD: R_execClosure (svn/R-devel/src/main/eval.c:1897)
==1826472== 
==1826472== 44 bytes in 1 blocks are possibly lost in loss record 36 of 2,458
==1826472==    at 0x483CAE9: calloc (/builddir/build/BUILD/valgrind-3.16.1/coregrind/m_replacemalloc/vg_replace_malloc.c:760)
==1826472==    by 0x17343C76: Vocabulary::load(_IO_FILE*) (packages/tests-vg/doc2vec/src/doc2vec/Vocab.cpp:278)
==1826472==    by 0x1733B701: Doc2Vec::load(_IO_FILE*) (packages/tests-vg/doc2vec/src/doc2vec/Doc2Vec.cpp:292)
==1826472==    by 0x1734525A: paragraph2vec_load_model(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) (packages/tests-vg/doc2vec/src/rcpp_doc2vec.cpp:58)
==1826472==    by 0x1734DE3D: _doc2vec_paragraph2vec_load_model (packages/tests-vg/doc2vec/src/RcppExports.cpp:48)
==1826472==    by 0x49CF2F: R_doDotCall (svn/R-devel/src/main/dotcode.c:598)
==1826472==    by 0x49D3E3: do_dotcall (svn/R-devel/src/main/dotcode.c:1281)
==1826472==    by 0x4D34A6: bcEval (svn/R-devel/src/main/eval.c:7115)
==1826472==    by 0x4EFFB7: Rf_eval (svn/R-devel/src/main/eval.c:727)
==1826472==    by 0x4F19CD: R_execClosure (svn/R-devel/src/main/eval.c:1897)
==1826472==    by 0x4F26C3: Rf_applyClosure (svn/R-devel/src/main/eval.c:1823)
==1826472==    by 0x4DF55D: bcEval (svn/R-devel/src/main/eval.c:7083)
==1826472== 
==1826472== 112 bytes in 2 blocks are definitely lost in loss record 62 of 2,458
==1826472==    at 0x483B582: operator new[](unsigned long) (/builddir/build/BUILD/valgrind-3.16.1/coregrind/m_replacemalloc/vg_replace_malloc.c:431)
==1826472==    by 0x17343D61: WMD::WMD(Doc2Vec*) (packages/tests-vg/doc2vec/src/doc2vec/WMD.cpp:14)
==1826472==    by 0x1733B82E: Doc2Vec::load(_IO_FILE*) (packages/tests-vg/doc2vec/src/doc2vec/Doc2Vec.cpp:307)
==1826472==    by 0x1734525A: paragraph2vec_load_model(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) (packages/tests-vg/doc2vec/src/rcpp_doc2vec.cpp:58)
==1826472==    by 0x1734DE3D: _doc2vec_paragraph2vec_load_model (packages/tests-vg/doc2vec/src/RcppExports.cpp:48)
==1826472==    by 0x49CF2F: R_doDotCall (svn/R-devel/src/main/dotcode.c:598)
==1826472==    by 0x49D3E3: do_dotcall (svn/R-devel/src/main/dotcode.c:1281)
==1826472==    by 0x4D34A6: bcEval (svn/R-devel/src/main/eval.c:7115)
==1826472==    by 0x4EFFB7: Rf_eval (svn/R-devel/src/main/eval.c:727)
==1826472==    by 0x4F19CD: R_execClosure (svn/R-devel/src/main/eval.c:1897)
==1826472==    by 0x4F26C3: Rf_applyClosure (svn/R-devel/src/main/eval.c:1823)
==1826472==    by 0x4DF55D: bcEval (svn/R-devel/src/main/eval.c:7083)
==1826472== 
==1826472== 120 bytes in 2 blocks are possibly lost in loss record 63 of 2,458
==1826472==    at 0x483CAE9: calloc (/builddir/build/BUILD/valgrind-3.16.1/coregrind/m_replacemalloc/vg_replace_malloc.c:760)
==1826472==    by 0x17342FAF: Vocabulary::addWordToVocab(char const*) (packages/tests-vg/doc2vec/src/doc2vec/Vocab.cpp:87)
==1826472==    by 0x17343497: Vocabulary::loadFromTrainFile(char const*) (packages/tests-vg/doc2vec/src/doc2vec/Vocab.cpp:53)
==1826472==    by 0x1734391C: Vocabulary::Vocabulary(char const*, int, bool) (packages/tests-vg/doc2vec/src/doc2vec/Vocab.cpp:15)
==1826472==    by 0x1733B904: Doc2Vec::train(char const*, int, int, int, int, int, int, float, float, int, int, int) (packages/tests-vg/doc2vec/src/doc2vec/Doc2Vec.cpp:84)
==1826472==    by 0x17345C02: paragraph2vec_train(char const*, int, int, int, int, int, int, double, double, int, int, int) (packages/tests-vg/doc2vec/src/rcpp_doc2vec.cpp:15)
==1826472==    by 0x1734E7E6: _doc2vec_paragraph2vec_train (packages/tests-vg/doc2vec/src/RcppExports.cpp:26)
==1826472==    by 0x49CDC5: R_doDotCall (svn/R-devel/src/main/dotcode.c:645)
==1826472==    by 0x49D3E3: do_dotcall (svn/R-devel/src/main/dotcode.c:1281)
==1826472==    by 0x4D34A6: bcEval (svn/R-devel/src/main/eval.c:7115)
==1826472==    by 0x4EFFB7: Rf_eval (svn/R-devel/src/main/eval.c:727)
==1826472==    by 0x4F19CD: R_execClosure (svn/R-devel/src/main/eval.c:1897)
==1826472== 
==1826472== 309 bytes in 7 blocks are definitely lost in loss record 127 of 2,458
==1826472==    at 0x483CAE9: calloc (/builddir/build/BUILD/valgrind-3.16.1/coregrind/m_replacemalloc/vg_replace_malloc.c:760)
==1826472==    by 0x17343C01: Vocabulary::load(_IO_FILE*) (packages/tests-vg/doc2vec/src/doc2vec/Vocab.cpp:272)
==1826472==    by 0x1733B71A: Doc2Vec::load(_IO_FILE*) (packages/tests-vg/doc2vec/src/doc2vec/Doc2Vec.cpp:294)
==1826472==    by 0x1734525A: paragraph2vec_load_model(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) (packages/tests-vg/doc2vec/src/rcpp_doc2vec.cpp:58)
==1826472==    by 0x1734DE3D: _doc2vec_paragraph2vec_load_model (packages/tests-vg/doc2vec/src/RcppExports.cpp:48)
==1826472==    by 0x49CF2F: R_doDotCall (svn/R-devel/src/main/dotcode.c:598)
==1826472==    by 0x49D3E3: do_dotcall (svn/R-devel/src/main/dotcode.c:1281)
==1826472==    by 0x4D34A6: bcEval (svn/R-devel/src/main/eval.c:7115)
==1826472==    by 0x4EFFB7: Rf_eval (svn/R-devel/src/main/eval.c:727)
==1826472==    by 0x4F19CD: R_execClosure (svn/R-devel/src/main/eval.c:1897)
==1826472==    by 0x4F26C3: Rf_applyClosure (svn/R-devel/src/main/eval.c:1823)
==1826472==    by 0x4DF55D: bcEval (svn/R-devel/src/main/eval.c:7083)
==1826472== 
==1826472== 960 bytes in 24 blocks are possibly lost in loss record 196 of 2,458
==1826472==    at 0x483CAE9: calloc (/builddir/build/BUILD/valgrind-3.16.1/coregrind/m_replacemalloc/vg_replace_malloc.c:760)
==1826472==    by 0x17343632: Vocabulary::createHuffmanTree() (packages/tests-vg/doc2vec/src/doc2vec/Vocab.cpp:169)
==1826472==    by 0x1733B8DF: Doc2Vec::train(char const*, int, int, int, int, int, int, float, float, int, int, int) (packages/tests-vg/doc2vec/src/doc2vec/Doc2Vec.cpp:83)
==1826472==    by 0x17345C02: paragraph2vec_train(char const*, int, int, int, int, int, int, double, double, int, int, int) (packages/tests-vg/doc2vec/src/rcpp_doc2vec.cpp:15)
==1826472==    by 0x1734E7E6: _doc2vec_paragraph2vec_train (packages/tests-vg/doc2vec/src/RcppExports.cpp:26)
==1826472==    by 0x49CDC5: R_doDotCall (svn/R-devel/src/main/dotcode.c:645)
==1826472==    by 0x49D3E3: do_dotcall (svn/R-devel/src/main/dotcode.c:1281)
==1826472==    by 0x4D34A6: bcEval (svn/R-devel/src/main/eval.c:7115)
==1826472==    by 0x4EFFB7: Rf_eval (svn/R-devel/src/main/eval.c:727)
==1826472==    by 0x4F19CD: R_execClosure (svn/R-devel/src/main/eval.c:1897)
==1826472==    by 0x4F26C3: Rf_applyClosure (svn/R-devel/src/main/eval.c:1823)
==1826472==    by 0x4DF55D: bcEval (svn/R-devel/src/main/eval.c:7083)
==1826472== 
==1826472== 1,144 bytes in 2 blocks are possibly lost in loss record 209 of 2,458
==1826472==    at 0x483B582: operator new[](unsigned long) (/builddir/build/BUILD/valgrind-3.16.1/coregrind/m_replacemalloc/vg_replace_malloc.c:431)
==1826472==    by 0x173406E4: UnWeightedDocument::UnWeightedDocument(Doc2Vec*, TaggedDocument*) (packages/tests-vg/doc2vec/src/doc2vec/TaggedBrownCorpus.cpp:123)
==1826472==    by 0x1734401A: WMD::loadFromDoc2Vec() (packages/tests-vg/doc2vec/src/doc2vec/WMD.cpp:67)
==1826472==    by 0x17345C02: paragraph2vec_train(char const*, int, int, int, int, int, int, double, double, int, int, int) (packages/tests-vg/doc2vec/src/rcpp_doc2vec.cpp:15)
==1826472==    by 0x1734E7E6: _doc2vec_paragraph2vec_train (packages/tests-vg/doc2vec/src/RcppExports.cpp:26)
==1826472==    by 0x49CDC5: R_doDotCall (svn/R-devel/src/main/dotcode.c:645)
==1826472==    by 0x49D3E3: do_dotcall (svn/R-devel/src/main/dotcode.c:1281)
==1826472==    by 0x4D34A6: bcEval (svn/R-devel/src/main/eval.c:7115)
==1826472==    by 0x4EFFB7: Rf_eval (svn/R-devel/src/main/eval.c:727)
==1826472==    by 0x4F19CD: R_execClosure (svn/R-devel/src/main/eval.c:1897)
==1826472==    by 0x4F26C3: Rf_applyClosure (svn/R-devel/src/main/eval.c:1823)
==1826472==    by 0x4DF55D: bcEval (svn/R-devel/src/main/eval.c:7083)
==1826472== 
==1826472== 5,502 bytes in 662 blocks are definitely lost in loss record 303 of 2,458
==1826472==    at 0x483CAE9: calloc (/builddir/build/BUILD/valgrind-3.16.1/coregrind/m_replacemalloc/vg_replace_malloc.c:760)
==1826472==    by 0x17343C01: Vocabulary::load(_IO_FILE*) (packages/tests-vg/doc2vec/src/doc2vec/Vocab.cpp:272)
==1826472==    by 0x1733B701: Doc2Vec::load(_IO_FILE*) (packages/tests-vg/doc2vec/src/doc2vec/Doc2Vec.cpp:292)
==1826472==    by 0x1734525A: paragraph2vec_load_model(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) (packages/tests-vg/doc2vec/src/rcpp_doc2vec.cpp:58)
==1826472==    by 0x1734DE3D: _doc2vec_paragraph2vec_load_model (packages/tests-vg/doc2vec/src/RcppExports.cpp:48)
==1826472==    by 0x49CF2F: R_doDotCall (svn/R-devel/src/main/dotcode.c:598)
==1826472==    by 0x49D3E3: do_dotcall (svn/R-devel/src/main/dotcode.c:1281)
==1826472==    by 0x4D34A6: bcEval (svn/R-devel/src/main/eval.c:7115)
==1826472==    by 0x4EFFB7: Rf_eval (svn/R-devel/src/main/eval.c:727)
==1826472==    by 0x4F19CD: R_execClosure (svn/R-devel/src/main/eval.c:1897)
==1826472==    by 0x4F26C3: Rf_applyClosure (svn/R-devel/src/main/eval.c:1823)
==1826472==    by 0x4DF55D: bcEval (svn/R-devel/src/main/eval.c:7083)
==1826472== 
==1826472== 7,319 bytes in 662 blocks are definitely lost in loss record 320 of 2,458
==1826472==    at 0x483CAE9: calloc (/builddir/build/BUILD/valgrind-3.16.1/coregrind/m_replacemalloc/vg_replace_malloc.c:760)
==1826472==    by 0x17343CA8: Vocabulary::load(_IO_FILE*) (packages/tests-vg/doc2vec/src/doc2vec/Vocab.cpp:280)
==1826472==    by 0x1733B701: Doc2Vec::load(_IO_FILE*) (packages/tests-vg/doc2vec/src/doc2vec/Doc2Vec.cpp:292)
==1826472==    by 0x1734525A: paragraph2vec_load_model(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) (packages/tests-vg/doc2vec/src/rcpp_doc2vec.cpp:58)
==1826472==    by 0x1734DE3D: _doc2vec_paragraph2vec_load_model (packages/tests-vg/doc2vec/src/RcppExports.cpp:48)
==1826472==    by 0x49CF2F: R_doDotCall (svn/R-devel/src/main/dotcode.c:598)
==1826472==    by 0x49D3E3: do_dotcall (svn/R-devel/src/main/dotcode.c:1281)
==1826472==    by 0x4D34A6: bcEval (svn/R-devel/src/main/eval.c:7115)
==1826472==    by 0x4EFFB7: Rf_eval (svn/R-devel/src/main/eval.c:727)
==1826472==    by 0x4F19CD: R_execClosure (svn/R-devel/src/main/eval.c:1897)
==1826472==    by 0x4F26C3: Rf_applyClosure (svn/R-devel/src/main/eval.c:1823)
==1826472==    by 0x4DF55D: bcEval (svn/R-devel/src/main/eval.c:7083)
==1826472== 
==1826472== 8,000 bytes in 1 blocks are possibly lost in loss record 1,151 of 2,458
==1826472==    at 0x483CAE9: calloc (/builddir/build/BUILD/valgrind-3.16.1/coregrind/m_replacemalloc/vg_replace_malloc.c:760)
==1826472==    by 0x17340461: TaggedDocument::TaggedDocument() (packages/tests-vg/doc2vec/src/doc2vec/TaggedBrownCorpus.cpp:88)
==1826472==    by 0x17340B7F: TaggedBrownCorpus::TaggedBrownCorpus(char const*, long long, long long) (packages/tests-vg/doc2vec/src/doc2vec/TaggedBrownCorpus.cpp:12)
==1826472==    by 0x1733B637: Doc2Vec::initTrainModelThreads(char const*, int, int) (packages/tests-vg/doc2vec/src/doc2vec/Doc2Vec.cpp:123)
==1826472==    by 0x1733B99D: Doc2Vec::train(char const*, int, int, int, int, int, int, float, float, int, int, int) (packages/tests-vg/doc2vec/src/doc2vec/Doc2Vec.cpp:91)
==1826472==    by 0x17345C02: paragraph2vec_train(char const*, int, int, int, int, int, int, double, double, int, int, int) (packages/tests-vg/doc2vec/src/rcpp_doc2vec.cpp:15)
==1826472==    by 0x1734E7E6: _doc2vec_paragraph2vec_train (packages/tests-vg/doc2vec/src/RcppExports.cpp:26)
==1826472==    by 0x49CDC5: R_doDotCall (svn/R-devel/src/main/dotcode.c:645)
==1826472==    by 0x49D3E3: do_dotcall (svn/R-devel/src/main/dotcode.c:1281)
==1826472==    by 0x4D34A6: bcEval (svn/R-devel/src/main/eval.c:7115)
==1826472==    by 0x4EFFB7: Rf_eval (svn/R-devel/src/main/eval.c:727)
==1826472==    by 0x4F19CD: R_execClosure (svn/R-devel/src/main/eval.c:1897)
==1826472== 
==1826472== 14,560 bytes in 91 blocks are possibly lost in loss record 1,199 of 2,458
==1826472==    at 0x483CAE9: calloc (/builddir/build/BUILD/valgrind-3.16.1/coregrind/m_replacemalloc/vg_replace_malloc.c:760)
==1826472==    by 0x17343645: Vocabulary::createHuffmanTree() (packages/tests-vg/doc2vec/src/doc2vec/Vocab.cpp:170)
==1826472==    by 0x1733B8DF: Doc2Vec::train(char const*, int, int, int, int, int, int, float, float, int, int, int) (packages/tests-vg/doc2vec/src/doc2vec/Doc2Vec.cpp:83)
==1826472==    by 0x17345C02: paragraph2vec_train(char const*, int, int, int, int, int, int, double, double, int, int, int) (packages/tests-vg/doc2vec/src/rcpp_doc2vec.cpp:15)
==1826472==    by 0x1734E7E6: _doc2vec_paragraph2vec_train (packages/tests-vg/doc2vec/src/RcppExports.cpp:26)
==1826472==    by 0x49CDC5: R_doDotCall (svn/R-devel/src/main/dotcode.c:645)
==1826472==    by 0x49D3E3: do_dotcall (svn/R-devel/src/main/dotcode.c:1281)
==1826472==    by 0x4D34A6: bcEval (svn/R-devel/src/main/eval.c:7115)
==1826472==    by 0x4EFFB7: Rf_eval (svn/R-devel/src/main/eval.c:727)
==1826472==    by 0x4F19CD: R_execClosure (svn/R-devel/src/main/eval.c:1897)
==1826472==    by 0x4F26C3: Rf_applyClosure (svn/R-devel/src/main/eval.c:1823)
==1826472==    by 0x4DF55D: bcEval (svn/R-devel/src/main/eval.c:7083)
==1826472== 
==1826472== 29,232 bytes in 661 blocks are definitely lost in loss record 1,679 of 2,458
==1826472==    at 0x483CAE9: calloc (/builddir/build/BUILD/valgrind-3.16.1/coregrind/m_replacemalloc/vg_replace_malloc.c:760)
==1826472==    by 0x17343C76: Vocabulary::load(_IO_FILE*) (packages/tests-vg/doc2vec/src/doc2vec/Vocab.cpp:278)
==1826472==    by 0x1733B701: Doc2Vec::load(_IO_FILE*) (packages/tests-vg/doc2vec/src/doc2vec/Doc2Vec.cpp:292)
==1826472==    by 0x1734525A: paragraph2vec_load_model(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) (packages/tests-vg/doc2vec/src/rcpp_doc2vec.cpp:58)
==1826472==    by 0x1734DE3D: _doc2vec_paragraph2vec_load_model (packages/tests-vg/doc2vec/src/RcppExports.cpp:48)
==1826472==    by 0x49CF2F: R_doDotCall (svn/R-devel/src/main/dotcode.c:598)
==1826472==    by 0x49D3E3: do_dotcall (svn/R-devel/src/main/dotcode.c:1281)
==1826472==    by 0x4D34A6: bcEval (svn/R-devel/src/main/eval.c:7115)
==1826472==    by 0x4EFFB7: Rf_eval (svn/R-devel/src/main/eval.c:727)
==1826472==    by 0x4F19CD: R_execClosure (svn/R-devel/src/main/eval.c:1897)
==1826472==    by 0x4F26C3: Rf_applyClosure (svn/R-devel/src/main/eval.c:1823)
==1826472==    by 0x4DF55D: bcEval (svn/R-devel/src/main/eval.c:7083)
==1826472== 
==1826472== 100,400 bytes in 1,004 blocks are possibly lost in loss record 2,144 of 2,458
==1826472==    at 0x483CAE9: calloc (/builddir/build/BUILD/valgrind-3.16.1/coregrind/m_replacemalloc/vg_replace_malloc.c:760)
==1826472==    by 0x17340482: TaggedDocument::TaggedDocument() (packages/tests-vg/doc2vec/src/doc2vec/TaggedBrownCorpus.cpp:89)
==1826472==    by 0x17340B7F: TaggedBrownCorpus::TaggedBrownCorpus(char const*, long long, long long) (packages/tests-vg/doc2vec/src/doc2vec/TaggedBrownCorpus.cpp:12)
==1826472==    by 0x1733B637: Doc2Vec::initTrainModelThreads(char const*, int, int) (packages/tests-vg/doc2vec/src/doc2vec/Doc2Vec.cpp:123)
==1826472==    by 0x1733B99D: Doc2Vec::train(char const*, int, int, int, int, int, int, float, float, int, int, int) (packages/tests-vg/doc2vec/src/doc2vec/Doc2Vec.cpp:91)
==1826472==    by 0x17345C02: paragraph2vec_train(char const*, int, int, int, int, int, int, double, double, int, int, int) (packages/tests-vg/doc2vec/src/rcpp_doc2vec.cpp:15)
==1826472==    by 0x1734E7E6: _doc2vec_paragraph2vec_train (packages/tests-vg/doc2vec/src/RcppExports.cpp:26)
==1826472==    by 0x49CDC5: R_doDotCall (svn/R-devel/src/main/dotcode.c:645)
==1826472==    by 0x49D3E3: do_dotcall (svn/R-devel/src/main/dotcode.c:1281)
==1826472==    by 0x4D34A6: bcEval (svn/R-devel/src/main/eval.c:7115)
==1826472==    by 0x4EFFB7: Rf_eval (svn/R-devel/src/main/eval.c:727)
==1826472==    by 0x4F19CD: R_execClosure (svn/R-devel/src/main/eval.c:1897)
==1826472== 
==1826472== 118,577 bytes in 13,315 blocks are definitely lost in loss record 2,180 of 2,458
==1826472==    at 0x483CAE9: calloc (/builddir/build/BUILD/valgrind-3.16.1/coregrind/m_replacemalloc/vg_replace_malloc.c:760)
==1826472==    by 0x17342FAF: Vocabulary::addWordToVocab(char const*) (packages/tests-vg/doc2vec/src/doc2vec/Vocab.cpp:87)
==1826472==    by 0x1734353A: Vocabulary::loadFromTrainFile(char const*) (packages/tests-vg/doc2vec/src/doc2vec/Vocab.cpp:67)
==1826472==    by 0x1734391C: Vocabulary::Vocabulary(char const*, int, bool) (packages/tests-vg/doc2vec/src/doc2vec/Vocab.cpp:15)
==1826472==    by 0x1733B8DF: Doc2Vec::train(char const*, int, int, int, int, int, int, float, float, int, int, int) (packages/tests-vg/doc2vec/src/doc2vec/Doc2Vec.cpp:83)
==1826472==    by 0x17345C02: paragraph2vec_train(char const*, int, int, int, int, int, int, double, double, int, int, int) (packages/tests-vg/doc2vec/src/rcpp_doc2vec.cpp:15)
==1826472==    by 0x1734E7E6: _doc2vec_paragraph2vec_train (packages/tests-vg/doc2vec/src/RcppExports.cpp:26)
==1826472==    by 0x49CDC5: R_doDotCall (svn/R-devel/src/main/dotcode.c:645)
==1826472==    by 0x49D3E3: do_dotcall (svn/R-devel/src/main/dotcode.c:1281)
==1826472==    by 0x4D34A6: bcEval (svn/R-devel/src/main/eval.c:7115)
==1826472==    by 0x4EFFB7: Rf_eval (svn/R-devel/src/main/eval.c:727)
==1826472==    by 0x4F19CD: R_execClosure (svn/R-devel/src/main/eval.c:1897)
==1826472== 
==1826472== 128,591 bytes in 3,000 blocks are definitely lost in loss record 2,215 of 2,458
==1826472==    at 0x483CAE9: calloc (/builddir/build/BUILD/valgrind-3.16.1/coregrind/m_replacemalloc/vg_replace_malloc.c:760)
==1826472==    by 0x17342FAF: Vocabulary::addWordToVocab(char const*) (packages/tests-vg/doc2vec/src/doc2vec/Vocab.cpp:87)
==1826472==    by 0x17343497: Vocabulary::loadFromTrainFile(char const*) (packages/tests-vg/doc2vec/src/doc2vec/Vocab.cpp:53)
==1826472==    by 0x1734391C: Vocabulary::Vocabulary(char const*, int, bool) (packages/tests-vg/doc2vec/src/doc2vec/Vocab.cpp:15)
==1826472==    by 0x1733B904: Doc2Vec::train(char const*, int, int, int, int, int, int, float, float, int, int, int) (packages/tests-vg/doc2vec/src/doc2vec/Doc2Vec.cpp:84)
==1826472==    by 0x17345C02: paragraph2vec_train(char const*, int, int, int, int, int, int, double, double, int, int, int) (packages/tests-vg/doc2vec/src/rcpp_doc2vec.cpp:15)
==1826472==    by 0x1734E7E6: _doc2vec_paragraph2vec_train (packages/tests-vg/doc2vec/src/RcppExports.cpp:26)
==1826472==    by 0x49CDC5: R_doDotCall (svn/R-devel/src/main/dotcode.c:645)
==1826472==    by 0x49D3E3: do_dotcall (svn/R-devel/src/main/dotcode.c:1281)
==1826472==    by 0x4D34A6: bcEval (svn/R-devel/src/main/eval.c:7115)
==1826472==    by 0x4EFFB7: Rf_eval (svn/R-devel/src/main/eval.c:727)
==1826472==    by 0x4F19CD: R_execClosure (svn/R-devel/src/main/eval.c:1897)
==1826472== 
==1826472== 323,824 (224 direct, 323,600 indirect) bytes in 4 blocks are definitely lost in loss record 2,365 of 2,458
==1826472==    at 0x483AE7D: operator new(unsigned long) (/builddir/build/BUILD/valgrind-3.16.1/coregrind/m_replacemalloc/vg_replace_malloc.c:342)
==1826472==    by 0x1733B621: Doc2Vec::initTrainModelThreads(char const*, int, int) (packages/tests-vg/doc2vec/src/doc2vec/Doc2Vec.cpp:123)
==1826472==    by 0x1733B99D: Doc2Vec::train(char const*, int, int, int, int, int, int, float, float, int, int, int) (packages/tests-vg/doc2vec/src/doc2vec/Doc2Vec.cpp:91)
==1826472==    by 0x17345C02: paragraph2vec_train(char const*, int, int, int, int, int, int, double, double, int, int, int) (packages/tests-vg/doc2vec/src/rcpp_doc2vec.cpp:15)
==1826472==    by 0x1734E7E6: _doc2vec_paragraph2vec_train (packages/tests-vg/doc2vec/src/RcppExports.cpp:26)
==1826472==    by 0x49CDC5: R_doDotCall (svn/R-devel/src/main/dotcode.c:645)
==1826472==    by 0x49D3E3: do_dotcall (svn/R-devel/src/main/dotcode.c:1281)
==1826472==    by 0x4D34A6: bcEval (svn/R-devel/src/main/eval.c:7115)
==1826472==    by 0x4EFFB7: Rf_eval (svn/R-devel/src/main/eval.c:727)
==1826472==    by 0x4F19CD: R_execClosure (svn/R-devel/src/main/eval.c:1897)
==1826472==    by 0x4F26C3: Rf_applyClosure (svn/R-devel/src/main/eval.c:1823)
==1826472==    by 0x4DF55D: bcEval (svn/R-devel/src/main/eval.c:7083)
==1826472== 
==1826472== 531,160 bytes in 13,279 blocks are definitely lost in loss record 2,403 of 2,458
==1826472==    at 0x483CAE9: calloc (/builddir/build/BUILD/valgrind-3.16.1/coregrind/m_replacemalloc/vg_replace_malloc.c:760)
==1826472==    by 0x17343632: Vocabulary::createHuffmanTree() (packages/tests-vg/doc2vec/src/doc2vec/Vocab.cpp:169)
==1826472==    by 0x1733B8DF: Doc2Vec::train(char const*, int, int, int, int, int, int, float, float, int, int, int) (packages/tests-vg/doc2vec/src/doc2vec/Doc2Vec.cpp:83)
==1826472==    by 0x17345C02: paragraph2vec_train(char const*, int, int, int, int, int, int, double, double, int, int, int) (packages/tests-vg/doc2vec/src/rcpp_doc2vec.cpp:15)
==1826472==    by 0x1734E7E6: _doc2vec_paragraph2vec_train (packages/tests-vg/doc2vec/src/RcppExports.cpp:26)
==1826472==    by 0x49CDC5: R_doDotCall (svn/R-devel/src/main/dotcode.c:645)
==1826472==    by 0x49D3E3: do_dotcall (svn/R-devel/src/main/dotcode.c:1281)
==1826472==    by 0x4D34A6: bcEval (svn/R-devel/src/main/eval.c:7115)
==1826472==    by 0x4EFFB7: Rf_eval (svn/R-devel/src/main/eval.c:727)
==1826472==    by 0x4F19CD: R_execClosure (svn/R-devel/src/main/eval.c:1897)
==1826472==    by 0x4F26C3: Rf_applyClosure (svn/R-devel/src/main/eval.c:1823)
==1826472==    by 0x4DF55D: bcEval (svn/R-devel/src/main/eval.c:7083)
==1826472== 
==1826472== 799,848 (24,144 direct, 775,704 indirect) bytes in 1,006 blocks are definitely lost in loss record 2,415 of 2,458
==1826472==    at 0x483AE7D: operator new(unsigned long) (/builddir/build/BUILD/valgrind-3.16.1/coregrind/m_replacemalloc/vg_replace_malloc.c:342)
==1826472==    by 0x17344008: WMD::loadFromDoc2Vec() (packages/tests-vg/doc2vec/src/doc2vec/WMD.cpp:67)
==1826472==    by 0x17345C02: paragraph2vec_train(char const*, int, int, int, int, int, int, double, double, int, int, int) (packages/tests-vg/doc2vec/src/rcpp_doc2vec.cpp:15)
==1826472==    by 0x1734E7E6: _doc2vec_paragraph2vec_train (packages/tests-vg/doc2vec/src/RcppExports.cpp:26)
==1826472==    by 0x49CDC5: R_doDotCall (svn/R-devel/src/main/dotcode.c:645)
==1826472==    by 0x49D3E3: do_dotcall (svn/R-devel/src/main/dotcode.c:1281)
==1826472==    by 0x4D34A6: bcEval (svn/R-devel/src/main/eval.c:7115)
==1826472==    by 0x4EFFB7: Rf_eval (svn/R-devel/src/main/eval.c:727)
==1826472==    by 0x4F19CD: R_execClosure (svn/R-devel/src/main/eval.c:1897)
==1826472==    by 0x4F26C3: Rf_applyClosure (svn/R-devel/src/main/eval.c:1823)
==1826472==    by 0x4DF55D: bcEval (svn/R-devel/src/main/eval.c:7083)
==1826472==    by 0x4EFFB7: Rf_eval (svn/R-devel/src/main/eval.c:727)
==1826472== 
==1826472== 2,109,120 bytes in 13,182 blocks are definitely lost in loss record 2,435 of 2,458
==1826472==    at 0x483CAE9: calloc (/builddir/build/BUILD/valgrind-3.16.1/coregrind/m_replacemalloc/vg_replace_malloc.c:760)
==1826472==    by 0x17343645: Vocabulary::createHuffmanTree() (packages/tests-vg/doc2vec/src/doc2vec/Vocab.cpp:170)
==1826472==    by 0x1733B8DF: Doc2Vec::train(char const*, int, int, int, int, int, int, float, float, int, int, int) (packages/tests-vg/doc2vec/src/doc2vec/Doc2Vec.cpp:83)
==1826472==    by 0x17345C02: paragraph2vec_train(char const*, int, int, int, int, int, int, double, double, int, int, int) (packages/tests-vg/doc2vec/src/rcpp_doc2vec.cpp:15)
==1826472==    by 0x1734E7E6: _doc2vec_paragraph2vec_train (packages/tests-vg/doc2vec/src/RcppExports.cpp:26)
==1826472==    by 0x49CDC5: R_doDotCall (svn/R-devel/src/main/dotcode.c:645)
==1826472==    by 0x49D3E3: do_dotcall (svn/R-devel/src/main/dotcode.c:1281)
==1826472==    by 0x4D34A6: bcEval (svn/R-devel/src/main/eval.c:7115)
==1826472==    by 0x4EFFB7: Rf_eval (svn/R-devel/src/main/eval.c:727)
==1826472==    by 0x4F19CD: R_execClosure (svn/R-devel/src/main/eval.c:1897)
==1826472==    by 0x4F26C3: Rf_applyClosure (svn/R-devel/src/main/eval.c:1823)
==1826472==    by 0x4DF55D: bcEval (svn/R-devel/src/main/eval.c:7083)
==1826472== 
==1826472== LEAK SUMMARY:
==1826472==    definitely lost: 2,954,310 bytes in 45,784 blocks
==1826472==    indirectly lost: 1,099,304 bytes in 4,002 blocks
==1826472==      possibly lost: 125,237 bytes in 1,126 blocks
==1826472==    still reachable: 1,477,273,702 bytes in 56,490 blocks
==1826472==         suppressed: 0 bytes in 0 blocks
==1826472== Reachable blocks (those to which a pointer was found) are not shown.
==1826472== To see them, rerun with: --leak-check=full --show-leak-kinds=all
==1826472== 
==1826472== For lists of detected and suppressed errors, rerun with: -s
==1826472== ERROR SUMMARY: 3041 errors from 22 contexts (suppressed: 0 from 0)

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.