Giter Club home page Giter Club logo

Comments (22)

mkleinert avatar mkleinert commented on June 17, 2024

@bstarling Can you clarify "Can handle volumes provided in background"?

from assemble.

bstarling avatar bstarling commented on June 17, 2024

@mkleinert Edited to be more clear. I was referring to this section

  • an average case popular news archive is about ~250,000 rows (800 mb in raw json unzipped). Adding about ~75,000-100,000 rows a year.
  • Our largest identified source comes from a web forum and has about 3-5 million historical posts and adding thousands per day (TBD how much of the archive we store in live data set)

from assemble.

 avatar commented on June 17, 2024

So, I'd like to get the conversation started, I want to say first I would love to help regardless of the technology chosen.

  1. Most of this data I sense we would want to keep for a while (possibly permanently) if that's the case volume would be a strong constraint.
  2. Since we will be retrieving data from lots of sources, variety is a strong consideration as well. Though, it looks like we need mostly text data stores, and blobs and images are less of a concern.
  3. One that I am not sure about is velocity and the temperature of the data, will be doing a lot of simultaneous reading and writing, is there a correlation between recency of data and its need to be queried.
  4. Will we need "near real-time" or streaming capabilities, this means non-blocking asynchronous io access capabilities, or some additional streaming layer on top.
  5. Complexity of queries (joins, complex aggregations, object-graphs?)
  6. Are we wedded to some cloud provider solution (probably), and are we further constrained you think to AWS (also probably) ?

That being said, it sounds like we need something that has a very flexible or "schemaless" structure (i.e. schema-on-write), we need something that is used to being part of an analytics pipeline for both batch and streaming, and something that can deal with json natively or near natively, I don't think we need to worry too much about consistency or transaction, let me know if that sounds wrong. My thoughts at this point would be to avoid relational and graph databases for now and focus on either document databases or other NoSQL Dbs, meaning not simply a filesystem like Hadoop or S3.

That means, DynamoDB, HBase, Mongo kinds of solutions maybe even Cassandra, but it's a flexible columnar store, I believe all offered by Amazon (except maybe Mongo).

from assemble.

bstarling avatar bstarling commented on June 17, 2024
  1. Agree, in discussion of our options I think we should not throw out options just because it cannot handle every foreseeable volume requirement. Perhaps we decide to archive anything over 2 years old. I could see a world where we go with easy solution over storing everything in existence. (IE forum posts from 2007 probably not that important).
  2. Good point! There has been talk of storing images, I can see many use cases for doing that. We could store images on S3 and just provide a asset location in the data store? Not sure what all options we have here?
  3. I do not think the data has to be super available. I would say ease of use over pure availability. Even the fastest scrapers are producing no more than 250 records a minute.
  4. For the tweets/social media stuff probably yes but we have not considered if tweets and the web data would be stored together or separate.
  5. Right now all the article/text data is in one object I am not sure what kind of joins would be possible if we wanted to combine twitter/facebook/article data together. Hopefully others will chime in.
  6. Just because we already have a process setup AWS > others but I think a good long term solution we could work with any provider. Managing our own servers and services might be too much for community/volunteer org though.

Would be good to get some feedback from others who have worked with the data or have a vision for how people within our org (or outside) may use or want to use the data. Tagging in @gati @wwymak @hadoopjax @nataliaking @rachelanddata @zacherybohon to see if they have more to add.

from assemble.

wwymak avatar wwymak commented on June 17, 2024

I'm not a data engineer by trade but from my work some opinions I have:

  1. The large volumes of data may incur extra costs but if we go with a very cheap option for the data that is very far in the past (and zip the stored data!) it means we don't spend a fortune for storing data but it's still there for whoever needs it. (e.g. S3 has Glacier store for archives which is much cheaper than the higher availability options)
  2. Since the data we have from different sources are so varied, my personal choice will be for noSQL such as mongo (AWS don't have mongo as a service but it is trivial to set one up in a virtual machine). For other blob data such as images etc it makes more sense to store in s3 since databases isn't really for storing non standard data-- e.g in mongoDB each document isn't allowed more than 15MB or so.
    In regards to joins etc in the database -- mongoDB isn't really setup for joins but if you need to combine dataset from multiple collections as long as there are fields that are standardised, e.g. timestamp should be stored as iso datetime its not hard to have a layer on top of database that does these combinations for you. Obviously this may be less optimal than doing it in the database itself...
    If we need text searching mongo also has a connector to elasticsearch (and also to neo4j if we want to connect our text data to a graph db)
  3. for ease of access -- again, if we are using databases then they should handle simulateneous requests without issues?
  4. Cloud services such as AWS, Google, Azure etc have fairly similar data store/compute options e.g. they more or less all have something similar to S3 , lambda, virtual servers etc. So as long as we keep our 'save data to xx' as a separate module in our tools whoever needs to use them can swap in whatever they need (including streaming to their own local setup) This means we are not tied to one cloud provider.

from assemble.

mkleinert avatar mkleinert commented on June 17, 2024

I think we are all leaning towards NoSQL solution, but I don't want to put words in anyones mouth. That would be my opinion according to the requirements. As much as I am not a big fan of Mongo that may be the best solution. I don't have any exposure to DynamoDB but that sounds like an equally good solution. My only hesitation to Dynamo would be you would be tying the solution to AWS

Also I am guessing we are looking for something that could possible support an unknown number and type of use cases. Mongo seems to really become a bit of a standard and most tools and systems support some type of Mongo connector or integration. But if we do have a few specific use cases it may be good to outline them now to see if the solution easily helps accomplish them.

from assemble.

rachelanddata avatar rachelanddata commented on June 17, 2024

My vote would be for MongoDB. I think it's important to also look at future use cases as well. Many use Hadoop for cheap storage/archiving as well as for any kind of ETL/batch processing - so if there's going to be any of that in the future, I would strongly recommend Hadoop with a MongoDB instance. Those familiar with SQL may also much rather prefer Hadoops Pig to Mongo's query language (especially for joins) so that could come in handy.

from assemble.

bstarling avatar bstarling commented on June 17, 2024

Not a final decision but who has experience installing/configuring mongoDB? If we can get a group together I think it makes sense to do a trial run with a subset of the data. I can provide the AWS instance, just let me know what you need.

From this conversation I also gather we should keep the historic files raw files somewhere for future use/load to other tools. Given how cheap S3 (or even glacier) is I think this is definitely doable.

Great input everyone. Let's keep the discussion going.

FWIW I played around with DynamoDB briefly while it is nice AWS manages everything it does look like it could get expensive pretty fast. Probably overkill for our current requirements.

from assemble.

rachelanddata avatar rachelanddata commented on June 17, 2024

from assemble.

bstarling avatar bstarling commented on June 17, 2024

Great thanks for @rachelanddata anyone else willing to lend a hand? @wwymak perhaps? I had assumed we would run it on an AWS EC2 instance but are there better options?

from assemble.

wwymak avatar wwymak commented on June 17, 2024

yeah, I am more than happy to help set up and populate the relavant mongoDBs :) If our pipeline is mostly in AWS then it make sense to use AWS EC2. ( We can do something like use a lambda function trigger to run every time new data arrive to write to database) There are other options e.g. mongolabs if we really don't want to use AWS ;)

from assemble.

metame avatar metame commented on June 17, 2024

While I'm not a big fan of mongo for production apps (although I happen to be a MongoDB Certified Developer 😄 ), I think this is a pretty good use case for it. And it would allow us to switch hosts if we ever get some donated infrastructure or something like that too. Not sure what my other time commitments will allow, but I'd like to help as possible.

from assemble.

rachelanddata avatar rachelanddata commented on June 17, 2024

from assemble.

wwymak avatar wwymak commented on June 17, 2024

How do we want to go about this? Maybe if @bstarling can set up a ec2 instance and we set up the mongodb, and we build a parser that saves one of the #far-right json files to the mongo? And also set up e.g. authentication? And then once we have a proof of concept we build a pipeline that automatically streams new data into the db?

from assemble.

ASRagab avatar ASRagab commented on June 17, 2024

I did a very rough proof of concept over in this branch in the discursive repo. It reads the index_twitter_stream that writes the file to s3/ES and dumps in a mongodb. My particular instance was a sandboxed env hosted in AWS but managed by mLab, they over free 0.5 GB sandboxes. As you can see below (hopefully) it looks like it parses the tweet fairly well, even getting the types right, I think (double check). I think maybe the next step is to write some kind of ETL out of S3 and into Mongo?
screen shot 2017-01-30 at 12 56 49 am

from assemble.

wwymak avatar wwymak commented on June 17, 2024

are we going with just dumping everything from the tweet model into mongo ? If we are going to do that then I think what we should be saving to s3 is the whole tweet object rather than the model?

Or if we are doing any extra parsing on top of the tweet model to save to mongo?

In any case @bstarling has set up a ec2 instance and I have got mongo setup on the default port on it-- let @bstarling know so he can give ssh access and we can test out various processes?

from assemble.

bstarling avatar bstarling commented on June 17, 2024

Re tweet model I still think it's best to let user define the data they want to save. For most purposes the main fields are fine, I think the only projects that would be using the d4d data store are the defined long term collection efforts. Ad hoc exploratory gathering can still go to csv/json/sqlite etc.

Separately, as part of the master plan (evil laugh) we are working on a configurable option which will dump the entire tweet to a local/s3 file.

from assemble.

mkleinert avatar mkleinert commented on June 17, 2024

@bstarling should we add a pipeline code in the scraping code for Mongo? Also should the solution be something like a docker container(s) or puppet/chef scripts? That way it can take off some of the work from any users may need to do to get this up and running.

from assemble.

bstarling avatar bstarling commented on June 17, 2024

For the spiders we'll need a more central infrastructure. We should get to a point where we have deploy scripts but I don't know if we need to make it so friendly that anyone can create an instance. Definitely have the option of piping stuff straight into mongo using pipelines (I've already done it with DynamoDB). Whether we go that route or batch/file load later is still TBD.

from assemble.

kgorman avatar kgorman commented on June 17, 2024

Hey guys, @gati pinged me on this, I thought I would chime in a bit, hopefully helpful.

First and foremost I am excited to help out with the project in any capacity. Good work this.

My company (http://www.eventador.io/) is happy to donate a production quality cluster to the cause. I suspect it will more than handle both archive/query as well as provide easy real-time capabilities that were mentioned above. We have notebooks built in, a SQL backend (or MongoDB or Elastic if you want).

To be quite honest, we would love to partner so we can get more real-world use cases and you guys can stretch our thinking and make us better at what we do. We are passionate about data, and this project is awesome. Hopefully win/win.

We are a real-time data processing platform, which, by design, is a super-set of just simply storing data. To start, you can immediately start putting data into the front door in JSON format, and it would flow into PostgreSQL/PipelineDB. If you really wanted MongoDB we could make that happen. We have a history with it. ;-). We have built in Jupyter notebooks as well. Also, it's secured via IP whitelist, no open-door MongoDB mistakes. We are currently adding Apache Storm computability to our platform, so you guys can use that too.

That said, if the relationship didn't work out, the data is in MongoDB format or whatever and you can simply more it somewhere else. The project owns the data as any customer would.

So if you guys think it's a fit then we are game. If not, no worries, I still want to help!

from assemble.

bstarling avatar bstarling commented on June 17, 2024

Taking @kgorman up on this awesome offer. Join us in #eventador if you'd like to participate.

from assemble.

bstarling avatar bstarling commented on June 17, 2024

Closing this as it looks like Eventador is our path forward.

from assemble.

Related Issues (13)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.