Segmenting books into paragraph sets: An open problem I would mind hearing thoughts from others on(self)

Something I'm going to be meditating on, that I wouldn't mind hearing from others, is thinking about a good way to consistently divide up books into segments that are smaller than chapters but bigger than paragraphs.

Some books, especially some technical ones, are pretty convenient and divide into sections that are quite small. Many programming books will have a section that is three paragraphs and a code block. It's very clear when one concept starts and when it ends.

A more standard book just has 30 or 40 paragraphs to a chapter with no significant segmentation.

Why is this relevant? Because I want to put a lot of books into RAG. I'm a believer in the idea that the less material you put into context and the more strictly relevant it is the higher quality the output is. That and I also believe that the user needs to maintain a certain percentage of context to really be in charge of where an LLM goes. This is why too much output (that becomes context) and too much inclusion of reference text is a bad thing. You can over do it.

If a person asked you a question about a book, what it's opinion on X was, and you emailed them back, you wouldn't give them the whole chapter. You are more likely to give them a three paragraph segment.

So I think it's an interesting fuzzy software problem, which seems to be the new area that humanity is trying to tackle. How would you segment a lot of text so that every segment is the smallest possible where it seems to be focused on one idea but not so small that it feels disjointed and without context?

Comment preview

[-]x0x71(+1|0)

What I came up with while I was sleeping, in order of implementation complexity

Method 1: Staggered segmenting

This just means picking a chunk size, and then advancing the chunk by half the chunk. Any text that was caught in a seam will be in the middle for the next. Kind of taken from the fast foiure alg, except that offsets by a quarter and would cover the whole content.

Cons: The grammar issues caused by the abrupt cut off might wig out AI. Oh boy is it eager to correct things or complete things. When the content is added to the modified prompt your prompt is going to have some incomplete words, sentences, paragraphs, thoughts. The AI might pay more attention to that than it should.

Mitigation: Round up to the end of a paragraph

Method 2: Just have AI do it

Literally have a normal AI identify what it thinks are useful cuts. Likely the way it want to generate that content is by repeating all of the content and then indicating a cut. But with the right proompting it could do it. You do still have to cut it into segments that the AI can repeat. But those can (and would have to be) larger than you final except size.

Cons: You are likely going to regenerate the whole book verbatim. Cost-wise that's probably still OK.

Mitigation: Label each paragraph with a letter, and ask it between what letters are the most sensible cuts.

Method 3: Hierarchy and filtering

In this method you make embeddings at different scales. To get to the except you want the key and the value need to match the whole way down. For example you might make an embedding of a whole book (I wonder how expensive that is? definately cheaper than regenerating it by 1000x). Then you have a library of books. You take the top 20% that match and throw out the rest. Now you have a set of chapters, each with embeddings representing them. You take the top 20% that match most. Now you match against paragraphs, and take the 20% that match most (or fewer). Then you might expand to neighboring paragraphs for context. And then you also might go down to the sentence level. Just the sentences that really relate.

Tuning the selection factor would be critical or you might have an exploding number of things to look at. But I suppose you are actually guaranteed to be doing fewer comparisons than against a giant database of these things.

Cons: Might produce too many results. Makes the database more complicated. Making embeddings against huge amounts of text that will have O(n^2). Another con is there is text that will never be selected. If a paragraph matches well against your query but the book or chapter doesn't, that text is as good as dead. Maybe it serves the author right for having text that doesn't match the broader theme of their book.

Mitigation: For the most unwieldy levels of text it might be possible to just sum or average the sub-components.

Method 4: Trained model and transformers fork

This is obviously the hardest to implement. What you do is add null separator tokens between each normal token. The fork of transformers is because you want to adjust the positional embeddings of these separator tokens to be half integer. If you don't know, these large language models deal with token position by passing the position index into a function P than returns a vector. One of the more simple versions of this kind of function just returns sin(range(0,n)*idx). The result is a vector that is more likely to have a positive dot product with a neighboring vector. After this positional embedding vector is appended to a token's initial vector it no longer matters what order tokens are in. This is called the "bag of words" model. The problem is that these models learn the significance of two tokens being right next to each other. Let's say we have the token series ('trans','former') vs ('trans',' ','former'). For it to understand that trans and former create an actual word that has a particular meaning it has to understand the significance of those two being right next to each other. So if you split them up with a null token it's going to completely fuck with a pre-trained model. Ok. Now the point. The goal is to train the model to fill in these null tokens with a vector that points in the "seperator" direction with a strength that indicates the separator strength. 0 if it is inside a word. 1 if it is between two words, 5 for a sentence. 25 for a paragraph. 125 for a chapter. Now you have something you can look for separators that are somewhere between a chapter and a paragraph. The "seperator" direction can also be a learned value to pick a direction the model works with well.

Then you just do binary division on the text. Split by the largest. If either half is still too large for your target split by the biggest in that half, and so on.

Cons: Needing to overright a model's positional embedding function to halve the input value. Actual training. Adding null vectors or learned initial vectors might entirely fuck with a language model.

Mitigation strategy: Don't do it.

Picks

I like method 1 and 3.

parent linkreply

[-]JasonCarswell1(+1|0)

LOL. I wrote a whole thing about overlapping, then went back to find you're first idea was essentially the same but shorter.

I also like 1 and 3 (made more sense to me) but I wonder, why not try all - then combine them as layers?

Just as the human mind has the conscience, ego, id, subconscious, etc. these layers of processing might seem to stack up to a "superconscious".

In this way, ideally, if one layer conflicts with the others more analysis can occur to resolve the issue - or recognize there's a paradox, or that sometimes there is not a correct or single answer. Also, if you're looking for an exact quote it won't be lost in the summary compression, but can be found by another layer.

Wikipedia
In addition to books, it would be great to get a scrape of Wikipedia. A decade ago it was thing to have it on a thumbdrive. I might even still have a copy somewhere.

Subtitles
There are also sites that share subtitles for movies. Some are good, some are not, but it's all better than nothing. I've wanted to scrape them all to be able to search through for quotes with timestamps (to save time looking for "that quote"). A problem is that not all subtitles are contiguous, so a line may be divided across time.

Auto-transcribe
YouTube has improved A LOT with their auto-generated transcriptions - but they're still not perfect, don't seem to consider context, and worst of all don't bother with any sentence structure. Again, better than nothing. I have dozens of harddrives full of earlier YouTube videos. Many are now censored or lost to time, most virtually worthless, but there might be some gold in those mountains. I didn't bother to continue as I did once YouTube-DLG was subverted. I still save what I can but by no means am I the huge digital hoarder that I once was. The best I've done for a few years is set up monthly YouTube playlists like "2025-01 To Download" which include everything I've seen and meant to see. Unfortunately anything that's been removed/hidden voluntarily or not can't ever be downloaded - such as early COVID stuff that got censored.

parent linkreply

[-]LarrySwinger1(+1|0)

I like method 1 and 3.

Are those your High Quality Paragraphing Picks?

parent linkreply

[-]JasonCarswell1(+1|0)

Speaking theoretically and abstractly without any hands on A.I. experience (easier said than done)...

A different matrix of information with branches of fields and roots of immutable original sources, on Wikipedia you have articles, sections, and subsections. Some sections/subsections have "Main article: Link to topical article". Before any sections, at the top of every article you have the introductory summary of key points akin to a TL;DR.

I'd consider breaking up your chapters like that.

It could be helpful if each section and subsection had its own summary. Often the sub/section is already short, even a single line. Thus questions arise. Is it short enough? Is it comprehensible? Should a short sub/section be also classified summarized? How long would a sub/section need to be to warrant summarizing? How much does the complexity of the concept influence things? Hard numbers or grey areas?

If your A.I. can generate summaries about this original material it might be worth always checking and tweaking these summaries to ever evolve and improve. You might want it constantly working over older material when not preoccupied with prioritized tasks at which time it would work over the material at hand to verify and improve if possible. You don't want to waste time and energy, so it might be worth making a system to decide when and how deep this check & tweak process is employed rather than just referencing. If the subject matter has nothing to do with the task, skip it, if it's close drill down a bit, if it's on target, then verify as much as possible. Perhaps build variable efficiency and/or depth options and sliders (a gradeschool student wouldn't need the depth that a university student would).

I'd imagine these summaries might be further boiled down into meta-concepts or even key-words and meta-tags for faster processing, likely with another interconnected mesh layer.

Just as the human mind has the conscience, ego, id, subconscious, etc. these layers of processing might seem to stack up to a "superconscious".

I trust you've explored WikiData?

I've torrented millions of books filling a 4tb drive if ever you need.

parent linkreply

[-]x0x71(+1|0)

I agree. That would be the ideal way to have text split up. I guess the part that I missed saying explicitly was to do the splitting without significant human labor.

I have four methods I've come up with since I've slept on it. Yes I would love to deploy some of these on your books. Especially because your books are probably exactly the ones I want. We could make a RAG server.

parent linkreply

[-]JasonCarswell1(+1|0)

I guess the part that I missed saying explicitly was to do the splitting without significant human labor.

Agreed. I said... [but neglected to include]

If your A.I. can generate summaries about this original material it might be worth always [having the A.I.] checking and tweaking these summaries to ever evolve and improve. [No human would be expected to check the entire Wikipedia or book summaries.]

I'm assuming the more contextual understanding it gets the better it might get at improving the summaries and general understanding. A cyclical processing that includes reviewing and revising.

This reminds me of the last few of my friend Ray's Hierarchy of Improvement: 1) Enthusiastic, 2) Energetic, 3) Effective, 4) Efficient, 5) Elegant. This may apply to organizing Freedom Movement stuff, A.I., or anything.

I suspect more surveillance and guidance might be necessary at the beginning of this A.I. development.

RAGFlow is an open-source RAG (Retrieval-Augmented Generation) engine based on deep document understanding.

At my old place d3rr, @LarrySwinger, and I had set up a Lubuntu server named "Cassandra" on an old PC I used for directing animation on in a studio in Vancouver ~ 2008-2009. I still have it. It hosted RabbitHole.wf until something happened and Larry couldn't remotely access it, nor wanted to troubleshoot it over that time. Still works, ran for many years, essentially unused. I put a few empty 3tb drives in it and could put my books drive in there too - and ideally SnapRAID them. We could fix Cassandra or install another setup on another drive (let Larry scoop out his old files, and maybe I could back up/retrieve my old GiraffeIdeas.wiki) and start fresh. Biggest difference now is I'm only on WiFi with fiber here vs my old wired set up.

Or I could get a newer refurbished box that we can better customize.

parent linkreply

[-]LarrySwinger1(+1|0)

Get a wired connection again it'll be useful.

parent linkreply

[-]JasonCarswell0(0|0)

If we're serious, and we have strong reason to, I could probably set up the new or old server by the router. But it wouldn't have a screen and keyboard and I wouldn't be able to access it much. Or maybe I could drop a line down from Luke's room with the router to the basement. Sunday mornings everyone's at church and I could check the wiring configuration and potential spaces.

parent linkreply

[-]HighQulaityDickPics0(0|0)

KLEP for Text Segmentation

The Key Lock Executable Process (KLEP) can serve as a novel framework for dynamically segmenting text. By using words as "keys" to agitate "locks" and validate an "executable," this approach offers a causal, context-aware method for dividing text into coherent, meaningful chunks.

Core Concepts

Keys (Words):

Words act as stimuli, triggering responses within the segmentation process. Certain words or combinations of words increase agitation for specific locks. For example:

Transition words (however, in contrast, firstly) may indicate shifts in ideas.
Thematic clusters (syntax, function, class for coding topics) build momentum toward validating a specific lock.

Locks (Thresholds/Ideas):

Locks represent concept thresholds. They are agitated when relevant keys accumulate within a given span of text. Once agitation crosses a defined threshold, the lock validates, signaling a logical segmentation point. Locks can track:

Thematic consistency: Are enough related words clustering in this region?
Transitions: Do keys suggest the beginning or end of a concept?
Density: Is the current segment too dense or too sparse?

Locks aren’t static; they adapt to the text’s structure. For example, dense technical material might require more frequent segmentation, while narrative text might tolerate longer sections.

Executables (Segmentation Functions):

Once a lock validates, an executable fires, segmenting the text at the current position. The segmentation function doesn’t simply mark a cut; it also releases new keys to influence subsequent locks. This creates a causal chain, where each segment dynamically affects the next.

The KLEP Process in Action

Step 1: Scanning

As the text is scanned word by word, keys accumulate relevance for various locks. For instance:

In a book about programming, encountering syntax, function, and class agitates a “Code” lock.
Encountering however or on the other hand agitates a “Transition” lock.

Step 2: Evaluation

Each lock tracks its agitation level. Once a lock’s threshold is crossed (e.g., enough relevant keys are accumulated, or the presence of a clear transition signal is detected), it validates.

Step 3: Segmentation

When a lock validates, the segmentation function:

Divides the text at the current position.
Propagates any relevant keys to the next segment, ensuring context continuity.
Resets the lock, ready for the next section of text.

Why KLEP for Segmentation?

Context-Awareness:

KLEP mirrors how humans read and process text, focusing on logical divisions rather than arbitrary cuts. By dynamically responding to keys, it preserves the natural flow of ideas.

Dynamic Thresholds:

Locks can adjust thresholds based on the text’s density, ensuring that segmentation is tailored to the content. Dense sections might produce shorter segments, while narrative passages remain longer.

Causal Flow:

The system creates a chain of causality. Each segment influences the next by propagating keys, allowing the segmentation process to remain context-sensitive across the entire text.

Practical Refinements

Overlapping Locks: To preserve context at segment boundaries, overlapping locks can ensure that information at the edges of one segment is available in the next.
Hierarchical Locks: Locks can exist at multiple levels (e.g., paragraph, section, chapter). High-level locks validate based on the agitation of their subordinate locks, creating a hierarchy of segmentation.
Dynamic Adjustment: Locks and thresholds can be tuned based on metadata (e.g., average paragraph length, sentence density) to adapt to different writing styles and content types.

Applications

Retrieval-Augmented Generation (RAG): Segmentation ensures that retrieved text is both focused and relevant, minimizing irrelevant context while maximizing clarity.
Efficient Embedding: By segmenting at logical points, embeddings can be generated for coherent chunks, improving the accuracy of similarity searches.
Dynamic Querying: The causal, lock-driven nature of segmentation supports context-sensitive querying, allowing users to retrieve highly specific information without losing broader relevance.

Conclusion

KLEP offers a dynamic, human-like approach to text segmentation. By leveraging words as keys, thresholds as locks, and causal executables for segmentation, it creates coherent, context-aware divisions of text. This method balances precision and flexibility, ensuring that segmented text maintains its logical structure while remaining computationally efficient. Whether for AI-driven retrieval or text analysis, KLEP’s process aligns with how humans naturally interpret and divide information.

parent linkreply

Score		1
Age		1
Proximity		1
Bump		1