submitted 4 months ago by Dust0741@lemmy.world to c/selfhost@lemmy.ml

18 comments fedilink hide all child comments

Say I have a large txt or CSV file with data I want to search. And say I have several files.

What is the best way to index and make this data searchable? I've been using grep, but it is not ideal.

Is there any self hostable docker container for indexing and searching this? Or maybe should I use SQL?

top 18 comments

sorted by: hot top controversial new old

[-] h0bbl3s@lemmy.world 13 points 4 months ago* (last edited 4 months ago)

You can import CSV files directly into an SQLite database. Then you are free to do whatever sql searches or manipulations you want. Python and golang both have great SQLite libraries from personal experience, but I'd be surprised if there is any language that doesn't have a decent one.

If you are running Linux most distros have an SQLite gui browser in the repos that is pretty powerful.

[-] Ephera@lemmy.ml 1 points 4 months ago

I'd be surprised if there is any language that doesn't have a decent one.

Yeah, SQLite provides a library implemented in C. Because C doesn't require a runtime, it's possible for other languages to call into this C library. All you need is a relatively thin wrapper library, which provides an API that feels good to use in the respective language.

[-] Zannsolo@lemmy.world 5 points 4 months ago

Import it into access(Satan whispers quietly into you ear)

[-] eager_eagle@lemmy.world 5 points 4 months ago* (last edited 4 months ago)

Excel / OnlyOffice?

I love self-hosted tools, but you can do a lot on a spreadsheet.

Btw, if the files are not too large, you can query them using SQL without even hosting a database just by using Pandas. This avoids the problem of updating entries and handling migrations in case the CSVs change over time.

[-] Dust0741@lemmy.world 2 points 4 months ago

Files won't change and are hundreds of GBs

[-] brownmustardminion@lemmy.ml 1 points 4 months ago

Are they roughly 55GB compressed?

[-] Dust0741@lemmy.world 2 points 4 months ago

👀

[-] brownmustardminion@lemmy.ml 1 points 4 months ago

🕵️

[-] GissaMittJobb@lemmy.ml 1 points 4 months ago

Spill the beans!

[-] morbidcactus@lemmy.ca 1 points 4 months ago* (last edited 4 months ago)

Could use Polars, afaik it supports streaming from CSVs too, and frankly the syntax is so much nicer than pandas coming from spark land.

Do you need to persist? What are you doing with them? A really common pattern for analytics is landing those in something like Parquet, Delta, less frequently seen Avro or ORC and then working right off that. If they don't change, it's an option. 100 gigs of CSVs will take some time to write to a database depending on resources, tools, db flavour, tbf writing into a compressed format takes time too, but saves you managing databases (unless you want to, just presenting some alternates)

Could look at a document db, again, will take time to ingest and index, but definitely another tool, I've touched elastic and stood up mongo before, but Solr is around and built on top of lucene which I knew elastic was but apparently so is mongo.

Edit: searchable? I'd look into a document db, it's quite literally what they're meant for, all of those I mentioned are used for enterprise search.

[-] eager_eagle@lemmy.world 1 points 4 months ago

ok, database it is then

[-] helios@social.ggbox.fr 4 points 4 months ago

Importing that data into a RDBMS would be ideal. I'd use PostgreSQL for this but any other would work.

[-] morbidcactus@lemmy.ca 3 points 4 months ago* (last edited 4 months ago)

Postgres runs well in a container in my experience and is nice to work with, def support that. I know sqlite works well, no complaints from me

[-] F04118F@feddit.nl 3 points 4 months ago

Why bother setting up a hosted DB server when you can get all of the RDBMS optimizations in an in-process service? DuckDB is pretty cool

https://duckdb.org/

[-] AnnaFrankfurter@lemmy.ml 3 points 4 months ago* (last edited 4 months ago)

Depends on the size of data, use case like will you be doing any constant updates to it or just reading, you mentioned you have several files so do you need joins if so what will be an approx max number of joins you'll be doing on per query basis, I guess you said CSV so I'm assuming it is structured data and not semi structured or unstructured.

Few more questions, do you need a fast indexing but are not planning on doing any complex operations, areyoiu going to do a lot of OLTP operations and you need ACID. Or are you going OLAP route. are you planning on distributed database if so then which 2 do you want from CAP, do you want batch processing or stream processing,

I've few dozen other questions also

[-] django@discuss.tchncs.de 3 points 4 months ago

Nushell can be helpful to sift through structured data: https://www.nushell.sh/

You can just open a csv file and filter and select what you want.

[-] megaman@discuss.tchncs.de 2 points 4 months ago

Datasette is a neat tool intended to publish static data in a sqlite database on the web with a helpful gui and a bunch of extensions available. I havent come across a good enough reason to do it myself, but may do what you want.

You can spin it up locally and it wont be on the web at all, just accessed via your browser if thats what you want.

[-] some_guy@lemmy.sdf.org 1 points 4 months ago

Perhaps this:

https://csvkit.readthedocs.io/en/latest/index.html

this post was submitted on 20 Aug 2024

20 points (100.0% liked)

Self Hosted - Self-hosting your services.

11582 readers

3 users here now

A place to share alternatives to popular online services that can be self-hosted without giving up privacy or locking you into a service you don't control.

Rules

No harassment
crossposts from c/Open Source & c/docker & related may be allowed, depending on context
Video Promoting is allowed if is within the topic.
No spamming.
Stay friendly.
Follow the lemmy.ml instance rules.
Tag your post. (Read under)

Important

Beginning of January 1st 2024 this rule WILL be enforced. Posts that are not tagged will be warned and if not fixed within 24h then removed!

Lemmy doesn't have tags yet, so mark it with [Question], [Help], [Project], [Other], [Promoting] or other you may think is appropriate.

Cross-posting

!everything_git@lemmy.ml is allowed!
!docker@lemmy.ml is allowed!
!portainer@lemmy.ml is allowed!
!fediverse@lemmy.ml is allowed if topic has to do with selfhosting.
!selfhosted@lemmy.ml is allowed!

If you see a rule-breaker please DM the mods!

founded 3 years ago

MODERATORS

Zoe8338@lemmy.ml

dogmuffins@lemmy.ml

testman@lemmy.ml