Guide for Causal Map Workflows

Caching Cheat Sheet 📝

What is Caching?

Caching saves results of operations so they can be reused later without recalculating everything. Think of it like saving a draft of your work!

Types of Caches in the App

🏃‍♂️ Local Lines Cache

Temporarily stores results of specific commands during your session
Makes your workflow faster by not repeating calculations
Clears when you close the app or click "Clear Cache" (first yellow broom)

📝 Local Cache: Named Results

When you use store xyz and recall xyz commands
Saves specific results you want to reuse further down in your workflow with unstore or recall
Only available during your current session

💾 Remote Lines Cache

Saves results across sessions
Whenever we save to the local lines cache, we also save to the remote lines cache
When you come back tomorrow, previously calculated results are still available
Helps avoid recalculating common operations
Remote cache is fast, but local cache is faster (which is why we have it)

🤖 Prompt Cache

Saves responses from AI prompts
Helps avoid making the same AI calls multiple times
Especially useful for expensive or time-consuming AI operations
Everything has to be the same: preceding table and prompt: but statement numbers can differ
If you run a prompt for table T for statements 1-10, then again for statements 1-20, the cache for the first 10 lines will be re-used

🧮 Embeddings Cache

Stores embeddings for text analysis which we have to "buy" from openAI
Used for similarity searches, clustering and magnets
Faster and cheaper than ordering the embeddings again from openAI

⬆️ Uploads

Not really a cache because you need to use it manually.
You can use it to save intermediate results like the result of coding
upload project|foo uploads the previous result under the key project|foo
add overwrite=true if there is an existing result under the same key
Tip: Delete the overwrite = true afterwards

Magic fences

Code within magic fences is completely ignored at all times UNLESS you click a row inside the fences, in which case everything runs as if the fence above was not there.

we should show a confirmation alert if someone clicks a gutter WITHIN a fenced block; they have to confirm yes before the gutterclick is sent to backend. this is a purely front-end trick.

Keys

Each cached result has a "key" or "label" which we need to fetch the result from the cache
For the local and remote 'lines' caches, the key is all the commands up to the current line stuck together into one long key, but deleting any blocks which are irrelevant (does not store anything we need). So if you change an irrelevant line above the current line that makes no difference and the key stays the same and we can re-use the cache
For the prompt cache the key is the table row and the prompt
For the embeddings cache the key is just the text for which we want the embeddings

Clear Local Cache Button (with broom icon) ⚡

Just deletes all local cache (named and lines results) but not remote.

Clear Current Line Cache Button (with eraser icon) ⚡

Deletes local AND remote cache for current line (the blue-grey line where the CURSOR is in the editor, not necessarily the same as the green last clicked line). Useful e.g. for a line line like download .... when you have updated the upload in the interim. NEW: works also if user makes a multi-line selection: for EACH line in the selection, delete local and remote cache.

Cache Toggle Button ⚡

Located next to the "Clear Cache" button: - ON (default): Uses cached results when available (both line-based caching and named results from store/recall) - OFF: Recalculates everything fresh, ignoring both automatic line-based cache and named results cache - You would only ever use OFF if you think the old results are wrong. This can only happen when: - The app code has been updated - Your commands work on uploaded files or CM3 files that might have changed - You need to refresh stored/named results (from store/recall commands)

With the cache toggle OFF, the behavior for named results (e.g., from store xyz and recall xyz) is as follows: - Any named results stored before the current processing run (or before the toggle was set to OFF) are ignored. - However, if a named result is created by a store command during the current processing run (while the toggle is OFF), it is available for subsequent commands like recall or append within that same run. These freshly created named results are held temporarily in memory for the current run only. - For line-based caching, when OFF, results are always recalculated (retrieval is skipped), but these fresh results are still saved to the cache (local and remote), overwriting existing entries.

Tips 💡

The cache automatically knows when to update if you change your commands
Use "Clear Cache" if you want to start fresh
Caching helps make your workflow faster and more efficient

Quick Start Guide

What is this app for?

This app helps you analyze text data using AI and create different kinds of outputs like causal maps, tables, charts, etc.

The Interface

The app has three main areas:

Editor Panel (left): Where you type commands and text
Output Panel (right): Shows results of your commands
Gallery (within the right tab): Access your saved work and templates

You can resize the panels by dragging the dotted divider between them.

Basic Workflow

Type commands in the editor panel
Results appear in the output panel
Commands are processed line by line - each line builds on the previous result
No need to save the results, they are cached and loaded again quickly on the basis of the text in the editor: you don't save the results, you just save the text.

Workflows is Verb-Centric

At the heart of Workflows is a set of verbs that describe kinds of data transformations that are applied to data-tables. These verbs can be chained together to form data workflows.

Types of Commands

Inputting tables (resulting in a table)

links myfile # Load a links file from CM3
statements myfile # Load a statements file from CM3
download myfile # Load a table from Workflows and use it directly.
scrape myURL # Scrape a URL and make its contents available as a table.
fetch qualia # Get Qualia interview transcripts

Outputting visualisations (inputting a table and outputting a visualisation)

map # Create network visualization
chart # Create a chart
print # Formatted text and images

Transforming data (inputting a table and outputting a table)

All the other commands are like this, e.g.

select col1 col2 # Choose specific columns
filter col = value # Filter rows
prompt # Use an LLM prompt to transform the 'text' column in the previous table and output an new table

Remember

Commands are processed top to bottom
Results are cached so they load quickly when reused
Add comments with # to document your workflow and create a presentaion which you can share
Save versions of your work regularly

Getting Help

Click the ? icon for an interactive tour or press alt-h for a help modal
Check the syntax guide for command details
Look at example templates in the Gallery

when cursor is in a word in the editor and user presses alt-h, the help modal appears with that word pasted into the search field.

Next Steps

Once you're comfortable with the basics, you can:

Upload your own PDF files for analysis
Use AI prompts to analyze text
Create presentation slides
Share your work with colleagues
Export results to PDF

Keyboard shortcuts

ArrowUp: Toggle the sidebar and main navbar visibility (ignored if typing in an input field).
Left Arrow / Right Arrow: Navigate presentation slides (previous/next) only when the sidebar/navbar is hidden (ignored if typing in an input field).

Authorisation and authentication

The app uses Firebase authentication for login and logout, using a firebase_user_id.

Note the "sess_id" is the session-based sess_id, not the firebase_user_id.

Text/Workflow Authorisation

Access to texts (workflows, snippets) is now managed via the database, not the first line of the text file.

Owner: Each text has an owner (an email address), set when the text is first saved. The owner always has full access.
Editors: Each text has an editors field in the database (a comma-separated string of emails). Users listed as editors have access to the text.
Viewers: Each text also has a viewers field (comma-separated emails). Users listed as viewers can load and view the text, but cannot save changes or manage editors/viewers.
- You can add the special value global to the viewers list to make the text publicly viewable by anyone (logged in or not).
Management:
- The owner, administrators, and any current editors can manage the list of editors and viewers.
- A "Manage Editors" pill () and "Manage Viewers" pill () appear on the gallery card only for users with permission (owner, editor, admin).
- Clicking these pills opens a modal to add or remove editor/viewer emails.
Gallery/Loading: You will only see texts in the gallery or be able to load texts via URL if you are:
- An administrator (defined in config.py).
- The owner of the text.
- Listed in the editors field for the text.
- Listed in the viewers field for the text, OR if global is in the viewers list.
Saving/Editing: Only administrators, owners, or editors can save changes to a text (e.g., using the "Save Project Version" button or editing via the gallery modal).
Ownerless Texts: Texts created before this system might have no owner assigned. These are only visible to administrators.
DEPRECATED: The old method of using share editing: ... viewing: ... on the first line of the text is no longer supported.

Only provide file names for which the logged-in user is authorised, i.e. in which their email appears in the comma-separated lists of emails in the edit, copy or view fields in the corresponding row in the files table.

Versions (applies to any text, not just Projects)

Every text saved at the texts sql table can have one or more versions.
Each version has a version number, starting from 1 for the first version, e.g. foo-v1.
The version number is updated for each save.
A text name with no version number is taken as a pointer to the latest version.
This way, we can refer users to either the most recent version (which may constantly update) or a specific version (e.g. foo-v123, which is fixed).

Projects and Project versions

A Project aka workflow aka script is any text in the text database which calls data, see below.

special case: No title

User may delete the title in #projectTitle in which case there is no current Project until User saves under some name.

Normal case

Project titles will often contain | separators eg foldername|foo-v22. These are just strings, not alternatives or lists. The part before the separator is considered a bit like a folder name in the Gallery, but treat it mostly like any other character.

A Project is a text saved in the texts sql table which also has a versioned title like foobar-v12. The current Project should always be the (versioned) Project title in #projectTitle and the contents of the Editor start out as the corresponding text which User then edits and can save again under a new version number. I think previously we also allowed unversioned titles as equivalent to like version zero but we don't want this, the first version should already be foobar-v1.

When user presses the saveProjectVersion button: - if projectTitle is empty, throw an error - if the projectTitle is not empty, whether or not it ends in -vNN where NN is an integer: - always find the max NN and increment by 1 to give the new version number. - save the current text to the texts table - with the new projectTitle-vNN as the title - copy the other fields (including colour fields and font field) apart from modified which will auto update and text_id which will be updated to the new id - replace the contents of projectTitle with the new projectTitle-vNN.

clicking Clear Title $el.previousElementSibling.value = ''; projectTitle = '' should not only remove the title from #projectTitle but also from the URL ?title= parameter.

URL parameters and project versions

the URL changes live if you change the projectTitle: &title=... but without the version number: &title=foo not &title=foo-v123. It also always reflects the current line number, i.e. the most recently clicked line in the editor, in the &line= parameter.
If you have a non-blank projectTitle, then the URL will NOT be updated to reflect the editor text.
If you have a blank projectTitle, then the URL constantly updates as you alter the input text . This means when you have a blank projectTitle, the URL will update to reflect the editor text and the cursor position. If you reload or send the URL to someone else and they will see the same text and cursor position. If you have a nonblank projectTitle, the page will find the latest version of the text with that title and load it into the editor. If you have a versioned title

You can also specify the text, title (optionally with version number) and line number directly in the URL: - title=...This is the title of the text in the database. If no version number is present, we take the latest version. The version is loaded into the projectTitle text input, and the corresponding text is loaded into the editor. - text=...only used if the title is not present. - line=...This is the line number to move the cursor to and simulate gutterclick on this line

Tests: - one - two - should load longer text from texts table into input and load testAll into projectTitle

Closing or refreshing browser

If the button saveProjectVersion has class needs-save (or similar) AND the #projectTitle is not blank, we should pop up the browser alert "Changes that you made may not be saved." .

Saving texts from the gallery

if we click editText(item) @gallery.js and then the saveEdit button we get a new version and we also have another Save Without Versioning button next to it Which simply updates the existing text without creating a new version.

Clicking @click="saveProjectVersion" should save the current text to the texts table with the contents of #projectTitle as the title, updated to append-vNN where NN is the max of the existing versions plus one (get the title from #projectTitle, not the URL), copy the other fields including colours and fonts apart from modified which will auto update and text_id which will be updated to the new id.

Editing texts from the gallery - should NOT make any changes to #projectTitle or the editor text. - should NO LONGER copy the URL with ?title=projectTitle-vNN&line=[current line number] to clipboard.

the content of projectTitle is empty, throw an error.
if the content of projectTitle is new and has no -vNN, that's fine, just save as projectTitle-v1.

Gallery

In a separate html template, included in index.html, we have a gallery of favourite texts. This is a flex container with a card for each text, showing by default the first 30 cards, with filters and server-side population.

The cards are basically one for each row in the texts table, except that for projects (sets of texts like foo-v1, foo-v2, etc) we show the project title and only the latest version number.

A card is a project if it contains any lines starting: - get - fetch - download - links - statements - sources - scrape

By default we show the most recent 30 unarchived texts.

each card shows - the project title, - the file name(s) (extracted from any 'get someFileName' command in the text), - the number of lines, - the first few characters of the first two lines, - the date last modified, - a checkbox showing and controlling the status of the archived field - a button to insert the corresponding text into the Ace text editor. - if the text is a project, insert the latest version, replacing all the text in the editor, and unset .needs-save - if the text is a not a project, insert the text at the current cursor position, and set .needs-save - a button to insert the corresponding title into the Ace text editor in square brackets. - a button to open a modal to edit the text. - widgets for colours and font

Loading texts from the gallery

When we click @click="$store.galleryActions.insertText(item)" the editor text is replaced with the text from the template. - #projectTitle should also be updated and the URL should be updated only with this title, not also with the text parameter. The "needs save" attribute of the save button should be reset. NEW: also simulate a click on the output tab ({ activeRightTab: 1 }) to switch from Gallery to Output, and NEW simulate gutterclick on line 1

NEW: do the same (switch tab to Output) when we click on $store.galleryActions.insertTitle(item)

Template Tag Version Updating

When you click on a template tag in square brackets (e.g., [mytemplate-v5]) in the editor, it opens the gallery edit modal for that template. If you make changes and save a new version, the system automatically updates the template tag in the editor to reflect the new version number.

How it works: - Click on any template tag like [genai-labels-v25] in the editor - The gallery modal opens with that template loaded - Make your edits and click "Save Edit" to create a new version (e.g., genai-labels-v26) - The original template tag in the editor is automatically updated to [genai-labels-v26] - This works whether the gallery component is initialized or not - the system includes fallback mechanisms for template editing

This ensures your workflow always references the latest version of templates you've edited, without requiring manual updates to template tags throughout your code.

Search filter:

cards should be sorted like this: first list cards with a text hit in the title, in descending order of modification date. then list cards with a text hit in the text, in descending order of modification date. also take into account the checkbox to filter by archived status and a checkbox to show all versions or only the most recent version (by default: most recent).

If the button saveProjectVersion has class needs-save (or similar) we should pop up the browser alert "Changes that you made may not be saved." as we would if closing the browser.

Project cards which are normally blue now get two colour selectors. give me sql to add the fields colour1, colour2 and font to the texts table. to project cards, add two new selectors for colour1 and colour2 which provide colours. these are fed from and saved to the texts table. default is black. show colour1 as the card colour and colour2 as the border.

Authorisation of texts

we add a widget to each card which allows us to directly edit the editors field in the sql table. this exists already, varying(128). allow users with edit permission to add or change valid email addresses using the card. Use this field instead of the first line of text to decide if cards are displayed in gallery or not.

... then, in the gallery, via the endpoint, only list texts where current user admin or is owner or one of the editors.

Folders in the Gallery

folders are simply all the texts like foo for which there are texts with title foo|bar, foo|baz, etc.

so in the gallery, also print at the top a list of "folders", i.e. all such texts, in a flex container with a small button for each folder, which when clicked loads that text including the trailing | into the search filter.

That means if you want to put a text into a folder, you just type "foldername|" first. Afterwards, versions of the texts will remain in the same folder.

in the list of folders, we put the folders "examples" then "tools" first and make the icons green instead of blue

after the "show all versions" checkbox we add a radiogroup with options Scripts Snippets and Both, with default Scripts. If scripts, show only cards with files (the blue ones), if Snippets show only non-file cards (the whilte ones), etc

Parsing / processing the input text

The input text is processedline by line.

So line 1 produces a result, line 2 operates on that result, etc.

The results are cached locally, so they should appear quickly when used again.

The user should get hints on what to type, see below.

If the result of a current command is a table, the parser will return the first 50 rows and we will collect that with a front-end Tabulator table with client-side operation. That means the table is not loaded into memory, so be careful because sorting headers only sorts the current page.

Desired pass-through behaviour (not implemented yet)

atm most commands take an input table in @process_parse_request , process it and pass it onto the next command. as a byproduct they also send the output to the frontend as a table for visualisation.

the map and print commands are exceptions because they pass a differnet kind of output, and do not pass on a tabular result to the next command in the pipe. this is not convenient, i wish they could just pass the previous table untouched on to the next command, while still sending their special outputs to the frontend as before.

I think this would be easy to do by storing and passing on the previous table in the parse loop. suggest how to implement this.

magic fences

Magic fences are blocks within pairs of three backticks that are ignored. This is useful for including code in the input text without it being processed, for example when you have run a long and complicated prompt and want to then upload the result to the datastore.

ignore this

399950030 so code changes within or before the magic fences are not processed.

But the magic is that if you click within the magic fences, the code will be processed as usual. That includes all code before the magic fences (as usual, respecting existing caching as usual, and all the preceding code within the magic fences as usual). The results are cached as usual, also for any subsequent clicks within the magic fences. This is because, if you think about it, if you click within the magic fences, the app can't see the closing magic fence backticks, so it doesn't treat this request as being from within magic fences at all.

Recipes

We introduce hard-coded recipes. When a user enters app with no &text= or &title= parameter, we will commence a frontend-based workflow. Open the upload modal which has for now two buttons: Outcome Harvesting and "Other". Clicking one activates a recipe workflow. For now, we only have one recipe: OH-PDF On clicking the OH button, we hardcode the upload filename to "[sanitised-user-email]|OH-PDF--nn (nn is an incremental integer, to ensure the uploads name is unique) and disable editing this name. - We also show a custom text within this modal which says "Upload all your PDFs at once and I will identify Outcomes within them". - When the upload has finished, update the editor (using existing functionality to change editor text) like this:

list uploads
download OH-PDF-[sanitised-user-email]-nn

Output

Table output (default)

Numerical columns have conditional formatting of cell backgrounds based on cell value, with darker backgrounds for higher values, muted red for negative values, muted blue for positive values, transitioning gradually through white for zero with square-root scaling. colours are scaled such that the darkest colour is used for the largest absolute value. Scaling is on a per-column basis.

Numerical non-integer columns show max 2 signficant places (so usual 2.1 is formatted as 2.1, 2.22 is formatted as 2.2, 2.256 is formatted as 2.26, but 0.001 is formatted as 0.001).

text results and outputs

I want to unify the different kinds of text results and Output, for example I want:

[]

Commands and syntax

Getting data, data management

⚡ list mongo OR uploads OR cm3

ignoring case, so list uploads is the same as list Uploads, list MonGO is same as list mongo, list cM3 is same as list CM3 etc

It is just a convenience. You can use the other commands e.g. to filter this list, but it is not part of the workflow / pipeline.

list uploads [cache=False]

this loads a table of all the objects in the S3 store saved with upload command and retrievable with the download command along with any metadata we can get from the S3 store. Initially listed in descending order of upload date.

NEW by default, we do NOT list lines in the "directory" cache/lines_cache/, unless cache=True

list cm3

an additional standalone command on any line:

this loads the complete list of cm3files analogous to the get command

You can use the other commands e.g. to filter this list, but it is not part of the workflow / pipeline.

⚡ scrape foobar.com [depth=nn]

uses the package scrapy to scrape the web page at foobar.com with minimal extra parameters and load the data into a dataframe called statements analogous with "get".

depth parameter (default 0): - depth=0: Only scrape the initial URL (default behavior) - depth=1: Follow all links on the page which are in the same (sub-)domain and scrape those too - depth=2: Continue this process for one more level deep

The command adds columns for depth (showing the level at which the content was found) and url (showing the specific URL where each piece of content was scraped from).

The scraper provides notify_info feedback showing scraping progress. If the number of followed links exceeds 200, the process stops with a notify_warning to prevent excessive resource usage.

The scraper only follows links within the same domain or subdomains (e.g., if you start with example.com, it will follow links to subdomain.example.com but not to other-site.com).

Example: scrape example.com depth=2 will scrape the main page, all linked pages on the same domain, and then all pages linked from those pages.

Prevents several issues: - Circular references: If page A links to page B, and page B links back to page A, we won't get stuck in an infinite loop - Duplicate content: Each URL is only scraped once, even if multiple pages link to it - Multiple paths to same page: If the same page can be reached via different link paths, we only process it once - The implementation uses a set() to track processed URLs, which provides O(1) lookup time for checking duplicates. This is efficient even when processing the maximum of 200 URLs. - So the scraper correctly handles scenarios like: - A → B, B → A (circular) - A → C, B → C (multiple paths to same destination) - Complex link networks where pages reference each other - The depth tracking still works correctly because we track the depth at which each URL was first encountered, ensuring the depth column accurately reflects how the content was discovered.

⚡ get

get foobar , if the file foobar exists it will be loaded as long as it is shared with the user.

links, statements and sources are the names of the three tables making up a file.

The three tables are loaded and saved locally in memory to a datastore, ready to be used.

joins

At the end of load_file_data(), left join the sources table to the statements table by source_id. this must not result in extra rows as source_id must be unique in the sources table. If there are any other duplicate colnames, throw a notify_warning() and leave out the column from sources.

finally, left join the statements table to the links table by statement_id. this must not result in extra rows as statement_id must be unique in the statements table. Again, skip duplicate columns

if there are no columns to join on, create all combinations. So if x is a df with m rows and y is a df with n rows, then the result will have m*n rows and all combinations of rows from x and y.

⚡ links statements sources

The three dataframe commands: links, statements and sources, as well as the files command, can be used anywhere in the input and will load the dataframe with that name from the datastore.

These three commands are used on their own, and the result is the table with that name.

links foobar , if the cm3 file foobar exists, this returns the links table as a dataframe .
statements foobar , if the cm3 file foobar exists, this returns the statements table as a dataframe.
sources foobar , if the cm3 file foobar exists, this returns the sources table as a dataframe.

links, statements and sources are the names of the three tables making up a CM3 file.

We add this convenience functionality to the parser:

If the command is just links, statements or sources, then we look back to see if the actual cm3 file has been mentioned earlier in the input in a links, statements or sources or get command. If so, we use that, otherwise we give an error. If there is more than one such mention of an actual cm3 file, we take the most recent one.

get foobar blah blah blah links

is the same as

get foobar blah blah blah links foobar

and

statements foobar blah blah blah links

is the same as

statements foobar blah blah blah links foobar

and

get foobar blah blah blah statements baz links

is the same as

get foobar blah blah blah statements baz links baz

⚡ factors

Factors are not a table like links, statements and sources, but are actually a view of the links table: what are the unique values of from_label and to_label? So the command only works if the current table has columns called from_label and to_label and source_id and link_id.

It outputs a table with columns

label,
source_count,
link_count
source_ids
link_ids
statement_ids

It takes the from_label and to_label columns and combines them into a single column called label, one on top of the other, then groups by them, counting the number of source_id and link_id rows for each.

It also adds the columns source_ids and link_ids and statement_ids which are the unique values of source_id and link_id for each group.

If the column from_label is not available, use Cause or cause if available. Similarly for to_label: if not available, use Effect or effect if available. And if statement_id is not available, create a serial number but as character. And if source_id is not available, use statement_id.

fetch qualia [table=statements] foobar

This is a convenience command that loads the file foobar as a statements dataframe from a different RDS database:

conn = await get_db_conn()

we need the tables called "interviews" (for auth information) and "answers" (for the data) in the database.

It's a postgresql database at heroku.

fetch qualia foobar

loads the answers table where the field interview=foobar as a dataframe, providing the user has the correct auth information in the interviews table, providing the current user's email is = the owner field of the corresponding row in the interviews table or the edit field (comma separated list of emails) contains the user's email.

if the table is not statements, then fetch the appropriate table from the database:

sessions
interviews
translations

⚡ share

share and unshare are used to control access to objects (usually tables) stored at s3. these objects also have a key called owner_email which is used to control access. owner_email is set to the current user's email when the object is uploaded and cannot then be changed. only owners, editors or admins can access the object or change the edit field.

share foobar friend@example.com[, friend2@example.com]

this command will add the friend's email(s) to the edit field of the file foobar (comma separated list of emails).

Only editors or admins can do share or unshare.

⚡ unshare

unshare foobar friend@example.com[, friend2@example.com]

this command will remove the friend's email(s) from the edit field of the file foobar (comma separated list of emails).

⚡ upload objectname [overwrite=true]

very simple command to upload the previous table or other objectto the S3 store, with name objectname.

only if overwrite=false will it fail if the name is already used, otherwise it will silently overwrite an existing object.

uploading with overwrite=true will only work if the current user is the owner or an editor of the object.

the upload command also adds the current user's email to the object's owner metadata.

⚡ download objectname

very simple command to download an object from the S3 store, with name objectname. For permissions, see share command above.

⚡ delete [objectname] OR [N=nn]

if objectname is given, if the current user is the owner or an editor of the object, or an admin,delete the object from the S3 store (a version is still saved in the s3 versioning system).

if no objectname is given, read the Key column of the preceding (presumably filtered) table and delete each of those objects from the S3 store for which the current user is authorised as above.

HOWEVER when using a table, we must also provide an integer N which is the number of rows in the table to delete (i.e. not the rown numbers but the total number of Keys beging passed). If this number is wrong, the whole command will fail. This is a way to make sure you know what you are deleting!!

i.e.:

IF N=nn is given, look for a preceding table and if it has nn rows delete each of the keys in the 'Key' column.
IF no N is given, but a Key not beginning "N=" is found, delete that key.
else, throw an error.

⚡ store xyz / recall aka unstore xyz

(recall is the same as unstore.)

store: This command goes on its own line. it simply means that the backend adds the current result (the result of the previous command) to the in-session cache with the tag xyz.

Then when it sees code like this:

that means that command3 is run with the result of command1 and the output of the last line is the same as if we had written: command 1 command3

they should not persist independently of the command pipeline.

foo store foostore bar recall foostore

if foo changes, the result of recall foostore should change.

Dataframe manipulation: Filter, select, etc

⚡ groupify

store baz # some previous result
store bar # some previous result
foo-df # df on preceding line
groupify bar baz # NEW SYNTAX a grouped output consisting of three objects foo-df, bar and baz just like the result of the `group by` command

If there is no preceding df, like this

store baz # some previous result
store bar # some previous result
print # the result of this is not a df
groupify bar baz # NEW SYNTAX a grouped output consisting of three objects foo-df, bar and baz just like the result of the `group by` command

(old syntax was groupify(baz,bar) rather than new: groupify baz bar)

⚡ ungroupify

Extracts a numbered element from a grouped output created by group by or groupify.

store baz # some previous result
store bar # some previous result
foo-df # df on preceding line
groupify(bar,baz) # a grouped output consisting of three objects foo-df, bar and baz just like the result of the `group by` command
ungroupify 2 # returns baz

⚡ select

selects columns from the current table. eg:

get foobar
statements
select colname1 colname2 colname3

We also allow selection by position and range. So if the colnames are a b c d e f g h i j:

select 2-4 e f 8

will select columns b c d e f h.

Note that the select statement also reorder columns in the same order. So:

select bar foo baz

will return a table with columns in the order bar, foo, baz.

Select also allows renaming columns. So:

select colname1 into foo colname2 into bar

will return a table with columns foo and bar.

This also supports quotes in both positions:

select "colname1" into foo colname2 into bar

will return a table with columns foo and bar.

select colname1 into "foo" colname2 into "bar with space"

will return a table with columns foo and "bar with space".

Also allow minus to exclude columns:

| select -1-3 | select -colname1 -colname2

... returning all the columns except the first three.

and

| select -colname1 -colname2

... returning all the columns except colname1 and colname2.

⚡ index

Also enable the index command:

index foo by bar

All this does is add the name and value of the grouping column to the output. given a row with bar=i1, foo=c1, the output will be:

((bar: i1)): c1

⚡ rename

rename colname1 into foo colname2 into bar

renames the columns colname1 and colname2 to foo and bar respectively.

Like the select command, This also supports quotes in both positions:

rename "colname1" into foo colname2 into bar

and

rename colname1 into "foo" colname2 into "bar with space"

etc

NEW enable numerical col specification like in select e.g.

rename 1 into foobar

⚡ union newcolname = oldcolname1 oldcolname2 ...

This takes the current table with columns including oldcolname1, oldcolname2, ... and replaces those columns with a new column called newcolname with the concatenated values of oldcolname1, oldcolname2, retaining existing columns not mentioned.

The concatentation is done like this: oldcolname1 oldcolname2 value1 value2

results in: newcolname ((oldcolname1:)) value1\n((oldcolname2:)) value2

⚡ split column by charstring

Returns the same df except that for any row where charstring appears in column n times, we split that text by charstring into n+1 remainders, resulting in n+1 rows which are identical except that column contains each of the n+1 stripped remainders.

split text by //

⚡ distinct

inputs a df and outputs completely distinct rows from the df

Returns only unique rows from the current table, removing any duplicates. Usage:

get foobar
statements
distinct

This command takes no parameters and simply removes duplicate rows from the current table.

⚡ sort colname1 [colname2...]

Sorts the current table by the columns colname1, colname2, etc.

⚡ sortdown colname1 [colname2...]

Sorts the current table by the columns colname1, colname2, etc, in descending order.

⚡ pivot colname1 by colname2 [show colname3]

Pivots the current table so that colname1 is in the rows, colname 2 is in the columns, and the value in the cell is are the values of colname3.

If the combinations of the values of colname 1 and colname 2 are not unique, then the values of colname3 are concatenated, separated by a ////.

If show colname3 is not specified (most useful when the combinations of the values of colname 1 and colname 2 are not unique) the values in the cells is just the number of combinations.

⚡ filter

contains, is

Filter for rows where the column contains or is the value.

= is a synonym for is. It does not need to be surrounded by spaces.

more, less

Operators > and < aka 'more' and 'less'. Treat values as numbers if possible, otherwise filter always excludes the row. OR works as expected.

< and > do not need to be surrounded by spaces.

OR

now we also generalise the filter command : 'filter colname is foo OR bar' filters for colname to be EITHER foo OR bar and also 'filter colname contains foo OR bar' filters for colname to EITHER contain foo OR bar

(note that the OR is in uppercase)

contains_not, is_not

Add operators contains_not and is_not. In this case we filter for rows where the column does NOT contain or is NOT the value. In these cases, "contains_not foo NOR bar" retains rows where the column contains neither foo nor bar. "foo bar baz" is discarded. "foo baz" is not discarded.

m-n, n

'filter m-n' means retain only rows m-n of the current table.

NEW: 'filter n' means retain only the nth row of the current table.

random n

NEW: 'filter random n' means retain only a random sample of n rows from the current table. use a seed to make it reproducible.

quoted values

Added proper quoted value parsing that:

Preserves spaces within quotes Handles OR correctly even with quoted values Ignores quotes and OR within quotes

Strips whitespace from values appropriately Handles empty or invalid values gracefully Preserves case within quotes but does case-insensitive comparison

Now these will all work correctly: links filter text contains "a long text with spaces"

statements filter statement_id contains "id with spaces"

statements filter text contains "first text" OR "second text"

links filter from_label contains "label one" OR "label two" OR "label three"

New parameter: filter colname resembles "a text with spaces" [similarity=0.3] filter colname resembles keyword [similarity=0.3]

example: filter text resembles "a text with spaces"

resembles

filter from_label resembles family similarity=0.5

this returns rows where the colname column resembles the keyword with the given similarity threshold, using the cosine similarity metric. So this needs to fetch the embeddings for the keyword and the colname column, and then compare the two vectors using cosine similarity. We will use the new openAI text-embedding-3-large model for the embeddings. The request is like this:

from openai import OpenAI client = OpenAI()

response = client.embeddings.create( input="Your text string goes here", model="text-embedding-3-large" )

print(response.data[0].embedding)

I also want to cache the embeddings in the datastore, so that we don't have to fetch them from openAI every time. I already have mongodbatlas running.

top

links

filter top n links [type = source] filter top n factors [type = source]

type is optional, default is source.

filter top n links type = source means: get the current links table, group by bundle, order by source_count in descending order and return the top n rows. If there are ties, include all ties. This means that if bundle A >> B has 10 sources and bundle C >> D has 10 sources, and n is 10, then both A >> B and C >> D will be included, even if this means identifying more than n rows. then return all links from the current links table which are in these bundles.

filter top n links type = links Same as above, but use link_count instead of source_count.

factors

filter top n factors [type = source] filter top n factors [type = source]

type is optional, default is source.

filter top n factors type = source means: get the current links table, create a factors table from it, listing unique labels only, group by label, order by source_count in descending order and return the top n rows. If there are ties, include all ties. This means that if label A has 10 sources and label C has 10 sources, and n is 10, then both A and C will be included, even if this means identifying more than n rows. then return all links from the current links table where BOTH from_label AND to_label are in this list.

filter top n factors type = links Same as above, but use link_count instead of source_count.

UPDATED implicit support for reckon (replacing any previous support for calculate)

beyond the simple cases shown previously, we also use the reckon command to filter for more complex conditions.

filter source_count==link_count

filter source_count<=(link_count+1)

filter contains(as_character(id,"x))

etc.

⚡ join [table]

left join the current table to the table specified by the table= parameter. this other table can be one of links, statements, or sources, or it can be a table in the store set up by the store command. the two tables must have exactly one column in common, which will be used to match rows. the result is a new table with the columns of the current table plus the columns of the table specified by the table= parameter.

⚡ append

this can append text to text or tables to tables, so we no longer need the table= or text= keyword.

if one is table and one is text, give a notify_error() and return and error.

if both are tables:

append the current table to the table specified. this other table can be any table in the store set up by the store command.

the columns of the new table will be the intersection of the columns of the two tables.

if column names only differ by case, they are considered identical and the actual column name from the stored table is used. A notification is given to the user.
if two columns have the same name but differ in type, the type of the stored table is used. A notification is given to the user.

if both are text:

append the current text to the text specified. this other text can be any text in the store set up by the store command.

the result is just the text concatenation of the two texts, with each text preceded by a markdown h1 heading with part1, then part2. If part2 is not given, replace it with the name of the stored result, otherwise if either part1 and part2 are not given, use the texts "Part 1" and "Part 2" respectively.

optionally we can specify names for the json:

append stored-foo name1=Name for the current text name2=Name for the stored text

example

links example-file 
filter 1-10
recalculate
store lkss
factors
print
store printfactors
unstore lkss
print
append name1=printfactors name2=Some name

this outputs a text which is the concatenation of the two printed versions of the two tables, with the initial heading "# Some name" and the second heading "# printfactors". Note quote marks are not required, but are tolerated and stripped.

⚡ reckon

similar to calculate.

All of these work.

All these operations work because the reckon command includes all NumPy functions and Python's arithmetic operators in its evaluation context, making it very flexible for mathematical calculations.

Column-wise

reckon max_row = max(row_number) reckon avg_length = mean(label_length) reckon total_rows = count(row_number) reckon unique = count_unique(from_label) reckon lister = list(from_label) reckon spread = spread(from_label) reckon baseline = baseline(from_label)

sum(column): Sum of values
mean(column): Average of values
min(column), max(column): Minimum/maximum value
median(column): Median value
first(column), last(column): First/last value in current sort order
count_unique(column): Count of unique values
lines(column): Double-newline-separated string of values
list_unique(column): Comma-separated string of unique values

Conveniences

reckon row_num = row_number reckon nrow = tally # just an alias for `reckon N = count(row_number)

row-wise

Basic arithmetic

reckon double = row_number * 2 reckon square = row_number ** 2 reckon increment = row_number + 1 reckon half = row_number / 2 reckon multi = (double+square)/(row_number*2)

String functions

reckon upper_label = upper(from_label)
reckon lower_label = lower(from_label)
reckon label_length = length(from_label)
reckon contains_a = contains(from_label, "a")
reckon starts_with_t = startswith(from_label, "F")

reckon matches_pattern = match(column_name, "^regex_pattern.*") # produces the actual matching strings, not a boolean
reckon has_pattern = contains(column_name, "regex_pattern", regex=True)

reckon match = match(from_label, "F.*")

reckon alphanum = match(column_name, "[a-zA-Z0-9]+")      # Extract any sequence of letters/numbers
reckon letters_only = match(column_name, "[a-zA-Z]+")     # Extract only letters
reckon numbers_only = match(column_name, "\\d+")          # Use double backslash for digit char class
reckon digits_only = match(column_name, "[0-9]+")         # Alternative without backslash

String manipulation

reckon replaced = replace(from_label, "a", "X") reckon first_char = substring(from_label, 0, 1) reckon without_first = substring(from_label, 1) reckon three_chars = substring(from_label, 0, 2)

Combination of operations

reckon complex = upper(substring(from_label, 0, 1)) + lower(substring(from_label, 1)) reckon ratio = length(from_label) / row_number reckon is_special = length(from_label) > 10 and contains(from_label, "I")

Boolean operations

reckon exact_match = from_label == "Income" reckon not_match = from_label != "Income" reckon is_large = row_number > 10 reckon is_small = row_number < 5 reckon is_medium2 = row_number >= 5 or is_special reckon is_medium3 = row_number >= 5 and row_number <= 17

Negation (using NOT)

reckon is_not_large = not (row_number > 10) reckon doesnt_contain = not contains(from_label, "test") reckon isnt_empty = not (length(from_label) == 0)

Using numpy functions

reckon log_value = log(row_number) reckon sqrt_value = sqrt(row_number) reckon rounded = round(row_number / 3, 1) reckon corr = correlation(row_number,sqrt_value) # returns columns corr_coefficient, corr_sd, corr_p

Type conversions

reckon as_string = as_character(row_number) reckon numeric_length = as_numeric(length(from_label))

Conditional operations.

reckon size_category = row_number < 5 ~ "Small", row_number < 10 ~ "Medium", else "Large"

reckon size_category = row_number < 5 ~ "Small", row_number < 10 ~ "Medium"

seems like this successfully leaves nonmentioned values of row_number untouched?:

reckon row_number = row_number < 5 ~ "Small", row_number < 10 ~ "Medium"

More maths

reckon remainder = row_number % 2 # Basic modulo reckon is_even = (row_number % 2) == 0 # Check if a number is even

Exponentiation

reckon cubed = row_number ** 3 # Cube of row_number reckon power = pow(row_number, 3) # Using pow function

Square root and other roots

reckon square_root = sqrt(row_number) # Square root reckon cube_root = row_number ** (1/3) # Cube root

Logarithms

reckon log_value = log(row_number) # Natural logarithm reckon log10_value = log10(row_number) # Base-10 logarithm

Rounding

reckon rounded = round(row_number / 3, 2) # Round to 2 decimal places reckon floor_value = floor(row_number / 3) # Round down reckon ceil_value = ceil(row_number / 3) # Round up

Absolute value

reckon absolute = abs(row_number - 3) # Absolute value

Trigonometric functions

reckon sine = sin(row_number) # Sine reckon cosine = cos(row_number) # Cosine reckon tangent = tan(row_number) # Tangent

Available column-wise functions requiring a column argument:

Special column-wise function: differences [sig=nn] [column=source_id] [arrows=false] [proportions=false] [baseline=false] [type=nominal]

This is a special case which returns multiple columns

The column parameter is optional and defaults to source_id.

NOTE THAT ATM IF YOU USE A COLUMN OTHER THAN source_id, THE BASELINE IS RECALCULATED FOR THAT COLUMN AT THIS POINT.

default significance level is 0.05.

If there is a baseline column, then for each source_id, add a column named like source_id_difference.

The values in each column are the same for every row: it used in comparison with the group by command.

we basically construct a chi-square table with 2 rows and len_unique(source_id) columns, plus margins, in which the column margins are the baseline counts for each source_id. row 1 of the chisq table is the current link counts for each source_id (how many links in the current table for each source_id). The other row is the difference between the current link counts and the baseline counts.

then we return a table which is the current table with new columns, len_unique(source_id) columns. IF the overall chi-square value is < than the significance level, return row 1 of the chisq table, i.e. the current link counts for each source_id, same for each row. if it is not sig, return NA in all these new columns.

Even when a row IS overall significant, only return values if the one-dimensional chi-square value for that particular source_id is also significant at the same significance level, otherwise return NA.

Also return a column called p which is the p-value for the overall table.

So this is a special kind of calculation keyword because it returns multiple new columns and there is no slot for the name of the returned column. So a normal command is e.g. like this:

calculate new = function(old)

whereas here we do:

calculate differences [sig=nn]

the parameter arrows=true means that, where we would return a value, we prepend it with a upward-right arrow if the value is larger than expected, and a downward-right arrow if the value is smaller than expected.

the parameter proportions=true means that, where we would return a value, we return the actual value divided by the expected value, rounded to 2 decimal places.

the parameter baseline=true means that, where we would return a value, we return the baseline value after the actual value (or proportion) in brackets:

7 (10)

0.33 (10)

↗ 7 (10)

↗ 0.33 (10)

etc

Finally we also return a column called differences_label which is a comma-separated list of the source_ids which are significant together with their values as above. Then prepend the column name with the actual overall link count, e.g., depending on the parameters, it might be:

44 :: source_id1: ↗ 7 (10), source_id2: ↗ 15 (20), source_id3: ↘ 0 (10)

or if there are no significant differences:

NEW: add the keyword type, default nominal. If it is ordinal, assume the grouping variable recorded in the baseline is ordinal, not nominal, and adjust the chi-sq test accordingly (should be a bit more powerful).

Usage examples:

reckon differences sig=0.05 type=nominal    # Standard chi-square test (default)
reckon differences sig=0.05 type=ordinal    # Cochran-Armitage trend test for ordinal data
reckon differences sig=0.01 column=source_id type=ordinal arrows=true

Row-wise Operations (result is different for each row in the column, most useful in combination with calculate)

Conversion:

calculate foo = as_numeric(column)  # Convert to numeric
calculate foo = as_date(column)  # Convert to date, using smart date parsing
calculate foo = as_datetime(column)  # Convert to datetime, using smart datetime parsing
calculate foo = as_character(column)  # Convert to character

Boolean:

calculate asdf = foo OR bar # Outputs a boolean, inputs boolean or coerces to boolean
calculate asdf = foo AND bar 
calculate asdf = (foo AND bar  OR baz 
- etc

Basic arithmetic operations work row by row:

calculate square = subtotal ** 2
calculate total = price * quantity
calculate ratio = value1 / value2
calculate adjusted = (base + offset) * multiplier

calculate same = value1 is value2
calculate same = value1 == value2

calculate same = value1 is "value with spaces"
calculate different = value1 is not value2
calculate more = value1 > value2
calculate more = value1 more value2
calculate less = value1 < value2
calculate less = value1 less value2
calculate contains = value1 contains value2
calculate contains = value1 contains "value with spaces"
calculate not_contains = value1 does not contain value2
calculate not_contains = value1 does not contain "value with spaces"


calculate row_number = row_number # create sequential row number starting from 1
calculate substring = substring(column, start, length)
calculate fixedvalue = xyz # where xyz is any fixed value including a string or a number where these are not names of columns.

e.g.

calculate section = "John Smith"
calculate section = John
calculate prefix = substring(id,1,2)

and this switch/case_when syntax:

calculate chapter = page > 0 ~ chapter1, page > 10 ~ chapter2, page > 20 ~ chapter3, else chapter4

String concatenation using union():

calculate full_name = union(first_name, " ", last_name)

calculate path = union("prefix_", id, "_", name)
calculate display = union(code, " - ", description)

replace: Replace a (all instances of) substring with a new substring.

calculate spaceless = replace(spacey, " ", "")

Use replace() to substitute text in strings:

calculate spaceless = replace(spacey, " ", "")
calculate fixed = replace(text, "old", "new")
calculate clean_id = replace(id_col, "-", "_")
calculate normalized = replace(category, "Category ", "")
calculate fixed_names = replace(name, wrong_text, correct_text)
calculate clean_dates = replace(dates, "/", "-")

The replace function takes 3 arguments: - Text to modify (can be a column name or quoted string) - Text to find (can be a column name or quoted string)
- Text to replace with (can be a column name or quoted string)

calculate dynamic = replace(text_col, old_col, new_col)
calculate fixed = replace("some text", error_col, fix_col)

Add support for fuzzy matching:

calculate col = replace(col, "Birmingham", "Birmingham", fuzzy=0.5)

The fuzzy=0.5 argument is optional and defaults to 0, i.e. exact matching. If fuzzy is greater than 0 and less than 1, then the replace function will use fuzzy matching with the given similarity threshold. So this will replace "Birminhgam" with "Birmingham" if the similarity is around say 0.3.

More examples from simpleeval

add two things. x + y 1 + 1 -> 2
subtract two things x - y 100 - 1 -> 99 / divide one thing by another x / y 100/10 -> 10
multiple one thing by another x * y 10 * 10 -> 100 ** 'to the power of' x**y 2 ** 10 -> 1024 % modulus. (remainder) x % y 15 % 4 -> 3 == equals x == y 15 == 4 -> False < Less than. x < y 1 < 4 -> True

Greater than. x > y 1 > 4 -> False <= Less than or Equal to. x <= y 1 < 4 -> True = Greater or Equal to x >= 21 1 >= 4 -> False

"Right shift" the number. 100 >> 2 -> 25 << "Left shift" the number. 100 << 2 -> 400 in is something contained within something else. "spam" in "my breakfast" -> False

Mixed Operations

Combine column-wise and row-wise operations:

calculate bonus = mean(salary) * 0.1
calculate total = sum(sales) + fixed_costs
calculate ratio = count(orders) / max(capacity)

Multiple Calculations

Multiple calculations can be combined in one line:

calculate x = mean(value) y = x + 100 z = max(y)
calculate a = "df['col'].str.upper()" b = union(a, "_", id)

Examples

Chi-square calculation:

calculate expected = row_total * col_total / grand_total
calculate chi2 = ((observed - expected) ** 2) / expected

String manipulation:

calculate clean = "df['text'].str.strip().lower()"
calculate first_word = "df['text'].str.split().str[0]"
calculate slug = "df['title'].str.replace('[^a-zA-Z0-9]', '-')"

Date operations:

calculate year = "df['date'].dt.year"
calculate age = "pd.Timestamp.now().year - df['birth_date'].dt.year"

Numeric processing:

calculate z_score = "(df['value'] - df['value'].mean()) / df['value'].std()"
calculate percentile = "df['score'].rank(pct=True)"

⚡ python

very similar to the calculate command but uses unquoted python expressions, replacing df['colname'] with just colname.

Python expressions have access to: - df: Current DataFrame - pd: pandas module - np: numpy module

So we can have a syntax like this:

python percentile = score.rank(pct=True) * 100

meaning

calculate percentile = "df['score'].rank(pct=True) * 100"

and

python status = np.where(score > 90, 'Excellent',np.where(score > 70, 'Good', 'Needs Improvement'))

meaning

calculate status = "np.where(df['score'] > 90, 'Excellent', np.where(df['score'] > 70, 'Good', 'Needs Improvement'))"

python status = np.where(pd.to_numeric(rn) > 90, 'Excellent',np.where(pd.to_numeric(rn) > 7, 'Good', 'Needs Improvement'))

we would then need just a few of our own special functions like baseline and differences and we wouldn't neeed to laboriously define our own functions like sum, contains etc as above.

These all work so far:

python x = link_count > 2 python x = link_count == 2 python y = max(link_count)

python rn= pd.to_numeric(rn)

python level_1 = from_label.str.split(';').str[0] python level_2 = from_label.str.split(';').str[1] python status = np.where(quote_start > 90, 'Excellent', np.where(quote_start > 70, 'Good', 'Needs Improvement'))

python x = col1.str.contains("x")

python keep_rows = ~(col1.isna() | (col1.astype(str).str.strip() == ''))

python status = np.where(new_label.str.contains("1"), 'Excellent',np.where(new_label.str.contains("2"), 'Good', 'Needs Improvement'))

⚡ calculate

See above.

Note that columns calculated with the calculate command are added at the start of the current table.

⚡ summarise colname1 = aggfun1, [colname2 = aggfun2] ,... [everything=aggfun3]

similar to calculate, but if calculate = mutate in dplyr in R, then summarise is summarise in dplyr. instead of creating a new column for each row, it collapses all rows and shows only columns which have an aggfun applied to them.

So it is most useful with a group by command.

in the basic case, colname1=aggfun1 means that the column colname1 is aggregated using the function aggfun1: input and output colnames are the same, colname1.

see above for the shared syntax for calculate and summarise.

the optional everything=aggfun3 keyword is a special case which applies to all columns not mentioned explicitly.

Example:

Count rows and get mean value per group

| group by category :: summarise rows=len, value=mean

** Get first and last values**

| group by date :: summarise start=first, end=last

** Count unique values and get their list**

| group by type :: summarise distinct=count_unique, values=list

Python examples:

** Complex aggregations using Python expressions**

group by category :: summarise 
    mean_val = "df['value'].mean()"
    std_dev = "df['value'].std()"
    zscore = "(df['value'] - df['value'].mean()) / df['value'].std()"

String operations

group by source :: summarise
    text_length = "df['text'].str.len().mean()"
    word_count = "df['text'].str.split().str.len().mean()"
    unique_words = "len(set(' '.join(df['text']).split()))"

Date/time aggregations

group by date :: summarise
    earliest = "df['timestamp'].min()"
    latest = "df['timestamp'].max()"
    duration = "df['timestamp'].max() - df['timestamp'].min()"

Multiple column operations

group by category :: summarise
    correlation = "df['value1'].corr(df['value2'])"
    ratio = "df['value1'].sum() / df['value2'].sum()"

Note multiple summarise commands can be used in the same line, including mixing simple and python expressions:

group by category :: summarise foo=len, bar="df['timestamp'].min()", baz=sum, everything=count_unique

Key points: - Python expressions must be quoted to distinguish them from simple arithmetic - Only provides access to df, pandas and numpy for safety - Maintains backward compatibility with existing simple arithmetic operations - Returns clear error messages if expression fails

Magnetise, cluster, relabel ...

⚡ magnetise

magnetise [similarity=0.44] [number = nn] [model=large] [magnets= ] [table= sometable] foo, bar, baz

Model defaults to 'large' (text-embedding-3-large). Model can also be 'small' which means text-embedding-3-small.

foo, bar, baz are the magnetic labels, comma separated, no quotes are needed.

This is similar to cluster command. It inputs a table, but instead of clustering: 1. For each row (label), - calculate the cosine similarity between it and each of the magnets - keep the old label column and add a new column "new_label" - if the maximum similarity is greater than the similarity threshold, new_label becomes the magnet label which has the highest cosine similarity. Otherwise, new_label = label. - add another column 'magnet_match' which is true if the maximum similarity is greater than the similarity threshold. - add another column 'magnet_similarity' 2. return this updated table

New optional keyword 'number'

After the initial magnetisation, we have: S = N(surviving magnets - set of magnets with at least one label assigned to it) L = N(original labels) M = N(magnets) and the parameter number.

Maybe the user actually wants a smaller number of magnets than M, and only provided a longer list to see which work best. This is why we have the number parameter. So after initial magnetisation, if number < S, the user could just pick the largest surviving magnets. But then we would throw away the labels which were attached to the rest of the magnets. So in this case only, we should run the magnetisation again, from scratch, but only using the top number of surviving magnets in order of their size (number of labels attracted to them). If there are ties, choose the largest set.

If number >= S, send a notify_info() message.

The user can then use the existing relabel functionality to return an updated links table with the rows relabelled as described.

Variant: magnetise [similarity=0.44] table=some_table

This is similar to the above, but instead of using literal magnets=foo, bar, baz, it uses the column 'new_label' in the stored named table=some_table, i.e. it uses the unique values of the new labels in the table as magnets.

Multimagnets: Extension of the magnets= list.

Often, the magnets themselves might be expressed by different ideas. So, as well as magnets = comma, separated, list, we allow pseudo-json:

foo: ["foo1" , "foo2, with a comma" , "foo3"], bar: [bar, bar1, bar2]

where foo and bar are any text with no comma and no quote marks. No quote marks are necessary here, and the texts in the square brackets only need quotes if any of them contain commas.

Then, instead of magnetising foo and bar, we magnetise each of the members of their lists, and if the winner is a member of foo's list, we assign this to foo.

so it's a kind of OR, giving a magnet phrase different ways of attracting / being assigned to the target phrases.

the output is otherwise the same, but add an extra column magnet_hit showing which list members won the similarity competition.

Tolerating line breaks

NEW when expanding templates for magnets list, remove line breaks and add commas if necessary, so this:

foo
bar

and this

foo, 
bar

are equivalent to this

foo, bar

This applies both to standard and "Multi-magnets" with pseudojson.

So a magnets text template might look like this:

foo [foo1 , "foo2, with a comma"  , foo3]
bar [bar, bar1, bar2]

New variant via hard expansions:

magnetise type=links ...

is equivalent to:

store tmp_xxx factors magnetise ... filter magnet_match is true relabel links=tmp_xxx

where xxx is a random string of 6 characters. concretely:

magnetise type=links similarity=0.48 table=maglabs

is equivalent to:

store tmp_xxx factors magnetise similarity=0.48 table=maglabs filter magnet_match is true relabel links=tmp_xxx

and

magnetise type=links magnets=foo, bar, baz similarity=0.48

is equivalent to:

store tmp_xxx factors magnetise magnets=foo, bar, baz similarity=0.48 filter magnet_match is true relabel links=tmp_xxx

etc

⚡ cluster

A new command with this syntax: cluster [column=text] [number=15] [model=large] [method=kmeans] [covariance_type=full] [n_init=10] [max_iter=100] [mds=false]

NEW: if mds is true, calculate the MDS dimensions, otherwise, don't

Defaults: - column=text - number=15 (can also be 'auto' when method=gmm to automatically determine optimal cluster count) - model=large (can also be 'small' for text-embedding-3-small) - method=kmeans (can also be 'gmm') - covariance_type=spherical (only for GMM, options: 'full', 'tied', 'diag', 'spherical') - n_init=10 (only for GMM) - max_iter=100 (only for GMM)

Uses the specified clustering method on embeddings. We already have the embeddings in the datastore, so we can use those. The result is the existing table with additional columns:

Also keyword extraction:

TF-IDF based distinctive keywords
Filtering for meaningful words
Top 5 keywords per cluster

Output columns:

cluster_id: Cluster assignment distance: Distance from cluster centroid (For GMM: absolute log-likelihood; higher values = less likely points) similarity: Similarity to cluster centroid (1-distance) cluster_size: Number of rows in the cluster keywords: Top distinctive words x: x-coordinate y: y-coordinate probability: Probability of cluster membership (only for GMM method)

the x and y coordinates are the result of the embeddings and multi dimensional scaling to plot the data in 2D, i.e. they are the coordinates of the data in 2D.

Parameters:

colname1: Column to cluster on (default: 'text')

Before commencing, we replace empty string, NA or null values with "No text"

⚡ relabel links=footable [drop[=true]]

the format is just: relabel, on its own line.

This combines the current table (from the preceding line) which must contain a column called old_label and a column called new_label, with a stored links table.

The result is a copy of the stored links table but with every label in the from_label column which matches a value in the old_label column, relabeled as the corresponding value in the new_label column; and the same for the to_label column.

parameter [drop[=true]] such that if drop is true, only rows in the link table where from_label and to_label both have a match are retained

| relabel # Normal relabeling, keeps all rows | relabel [drop] # Drops rows where either label didn't match | relabel [drop=true] # Same as above | relabel [drop=false] # Explicitly keep all rows

if there is a field magnet_match present in the factors table, when creating the output links table, add the corresponding columns from_label_match and to_label_match

NEW: if there is a field magnet_similarity present in the factors table, when creating the output links table, add the corresponding columns from_label_magnet_similarity and to_label_magnet_similarity. Calculate also total_magnet_similarity as the sum of these two.

NEW: if there is a field present in the factors table, when creating the output links table, add the corresponding columns from_label_similarity and to_label_similarity. Calculate also total_similarity as the sum of these two.

the links table should have at least the following columns: from_label, to_label, or alternatively cause and effect, which can be used interchangeably.

Other outputs

⚡ jsonify

This is useful to examine the contents and structure of the current table especially when it might contain list columns etc.

all it does is provide a print output showing the json of the preceding table:

so:

'get foobar statements select colname1 colname2 colname3 jsonify'

will return a print-style output (like the print command) of the json of the table

⚡ tabulate keep statement_id

This is useful to concatenate a table for processing by a genAI prompt.

all it does is concatenate the columns in the preceding table:

so:

'get foobar statements select colname1 colname2 colname3 tabulate'

will return a table of rows with one column labelled text. The rows are concatenations of the column names surrounded by double stars and then their values, each separated by \n\n.

colname1: foo\n\ncolname2: bar\n\ncolname3: baz

colname1: foo\n\ncolname2: bar\n\ncolname3: baz2

colname1: foo\n\ncolname2: bar2\n\ncolname3: baz3

the `keep` keyword

the keep keyword lists columns to retain and NOT include in the concatenation and will just be returned as a column, untouched. the default is statement_id, which if present will not be concatenated.

The same syntax is used as the select command to specify columns:

tabulate keep statement_id source_id
tabulate keep statement_id 1-3 5 7 #keeps the columns statement_id and the columns 1, 2, 3, 5 and 7

⚡ print

print

with no arguments, all it does is concatenate the columns in the preceding table, with values separated by commas.

so:

'get foobar statements select colname1 colname2 colname3 print'

will return a series of lines, each of which is the values of colname1, colname2, separated by commas and spaces, e.g.

colname1: foo, colname2: bar, colname3: baz

colname1: foo, colname2: bar, colname3: baz2

colname1: foo, colname2: bar2, colname3: baz3

Note on Output Length: Print outputs are subject to a size limit (currently 10MB) to prevent browser performance issues. If an output exceeds this limit, it will be truncated, a warning notification will be shown, and a message indicating the truncation will be appended to the output.

print template=mytemplate

in this case we look up the template in the texts table and simply print it out line by line in markdown format.

Previously this worked without square brackets around the template, but now I want to use square brackets around the template as usual.

print template=[mytemplate]

should work.

print template=mytemplate

should no longer work.

This may need special handlinginthe python parser to not expand the template until we reach their print command.

print template=mytemplate [repeat=false]

if the template text contains one or more names like foobar in curly brackets {foobar}, and all such names are the names of column names in the preceding table, we handle this differently: run through each row of the preceding table, treat the template as markdown and replace the matched names with their values in the row. So if the preceding table has two rows like this:

col1 col2 col3 val1a val2a val3a val1b val2b val3b

and the template is like this:

`# {col2}

blah {col1}

the result will be this concatenated output:

`# {val2a}

blah {col1a}

`# {col2b}

blah {col1b}

... which is then displayed as html

NEW: if repeat=false (the default), then for each line in the template containing {col1}, {col2} we check through this row to see if the value of col1 is the same as in the previous row AND the value of col2 is the same as in the previous row etc. If yes, we do not print this row. Depending on sort order, this makes it easy for example to just print a heading for common sections and just print what changes

Default Behavior (repeat=false): Each row is compared to the previous row If all variables used in the template have the same values as the previous row, the current row is skipped This prevents duplication when values don't change, making it ideal for headings and section breaks How to Override (repeat=true): Add repeat=true to your command to print all rows regardless of duplication Example: print template=[my-template] repeat=true Implementation Details: Only columns referenced as variables in the template are checked for changes A dictionary tracks previous values of these columns If all values are unchanged, the row is skipped with a debug message

print headings=n

Similar to print but

select colname1 colname2 colname3 colname4 print headings=2

will return:

'# foo

'## bar

`colname3: baz, colname4: bat

etc

So if n =2, we get two levels of headings. These headings do NOT have the colname prefix. These headings are only printed when they change. So we may get many flat lines before the next h2 appears, and even more before the next h1 appears, etc.

print [headings=n] [highlights[=links]]

As above, but we left join to a links df on statement_id. If the flat (non-headings) fields include the field text, we highlight the text in the output according to (possibly multiple) spans beginning at character quote_start and ending at character quote_end, within the text. Note these spans might overlap, and (NEW!) the text might spread across different kinds of line break, so in practice we might have to end the highlighted quote at the end of one line and continue again after a line break / newline / newlines etc.

if the highlights=links parameter is used, then we use a stored table (see store/recall commands).

some other links table store xyz statements select colname1 colname2 colname3 print headings=2 highlights=xyz

shows highlights from a stored links table in a printed version of the statements table.

also: - before each highlight in the text, we print the link_id with special formatting, for reference. - and then below the statements with highlights, statement by statement, we also, for reference, print one line for each link in the format: link_id: link_id, cause: from_label, effect: to_label, sentiment: sentiment (if sentiment is present).

print foobar

NEW: as a final option, print some words with or without quotes should be printed verbatim and should output a text output for the next command.

e.g.

print lots of text
store flk
append flk

should result in "lots of text" being stored with name flk, (and in the third line this is concatenated with itself as usual)

Mapping and charting

⚡ recalculate all_columns=true

Only works when the current table is a links table at the moment, i.e. one with at least columns called from_label and to_label.

Example:

get foofile
links
select from_label to_label statement_id source_id 
recalculate

if the column link_id does not exist, add it as a unique number starting from 1. if the columns statement_id and source_id do not exist, return an error message. else add or update these columns:

bundle (simply the text from_label >> to_label)
link_count (the number of links, i.e. the number of unique values of link_id, for this bundle)
source_count (the number of unique values of source_id for this bundle)

Note the number of rows remains the same, there is no concatenation, so if there are multiple rows with the same from_label and to_label, they will have the same bundle and link_count and source_count.

For example, if you have a table with these columns:

from_label	to_label	link_id	source_id
A	B	1	S1
A	B	2	S1
A	C	3	S2

After enhancement it will automatically have:

from_label	to_label	link_id	source_id	bundle	link_count	source_count
A	B	1	S1	A >> B	2	1
A	B	2	S1	A >> B	2	1
A	C	3	S2	A >> C	1	1

we added a parameter all_columns, which defaults to true.

if it is false, then only recalculate the columns which are MISSING from the current table.

so the filter and map commands only recalculates the columns which are MISSING from the current table.

⚡ zoom

zoom [n] where n is an integer, default is 1. What this does is process the from_label and to_label, stripping "foo;bar;baz" to "foo" when zoom is 1, to "foo;bar" when zoom is 2, unchanged when zoom is 3, and so on. Also if zoom is 1 then "foo" remains "foo", etc.

extension of filter command: filter links link_count [less 3] this works like the filter command, but regardless whether the column link_count exists or not, it creates one, collapsing the table to just from_label and to_label, and link_count, where link_count is the number of times from_label and to_label appear together. then, it filters for link_count less than 3 (default) or any other number given

Zoom commands: zoom # defaults to level 1 zoom 2 # shows two levels zoom 3 # shows full depth

⚡ trace [steps = 4] [anywhere=true] [type=threads] from = foo, baz bar ban, bat to = foo, baz bar ban, bat

max steps is 5, if more give a warning.

a cousin of the simplify command.

Inputs a links table. Outputs a links table which is a subset of the input links table.

Construct the table like this:

For each row in the input table, create a new "results_1" table containing all the rows in the input table where: a.(if anywhere=false) from_label is in the list of froms. b (if anywhere=true) their from_label contains any of the froms as substrings (case insensitive).
After this first step, continue with n steps until the total number of steps = steps parameter, as follows: a. start with the original table and the results_m table, and find all the rows in the original table where from_label is in the list of to_labels in the results_m table AND the source_ids are in the source_ids of the results_m table. b. add these rows to the results_m+1 table. c. repeat until the total number of steps = steps parameter.
Concatenate the results_1 ... results_n tables and return these rows as the output table.

So the output table contains all the links in all the paths in the original table which are reachable from the froms in the list of froms along paths which have the same source_id all the way along.

Althernatively, we can think of this as the concatenation of all the individual graphs, one for each source_id, containing all the paths n steps long starting from the froms in the list of froms.

Note how the tracing extends to include all the paths which are not more than n steps long and end at the to_label. In this case (assume that the source_id is the same for all the links):

A >> B B >> C C >> B

trace from = A to = B

We need to avoid that the trace command returns A >> B only. It should return all the paths, because the path A >> B >> C >> B is also a valid path which is not more than 4 steps and ends at B.

If the 'to' parameter is included, also only include paths where the final to_label is in the list of tos. So every path is a path of length=steps or less.

Valid:

from keyword present & to keyword present
from keyword present & to keyword NOT present
from keyword NOT present & to keyword present

Invalid:

from keyword NOT present & to keyword NOT present

type=paths

This is similar to the above, but simpler because we remove the criterion concerning source_ids. So any paths count, regardless of which source_ids they have.

Formatting

The output table has the same columns as the input table, but with the addition of two boolean columns called found_from and found_to. found_from = 1 if the from_label is in the list of froms (or anywhere=true and the from_label contains any of the froms as substrings), 0 otherwise. found_to = 1 if the to_label is in the list of tos (or anywhere=true and the to_label contains any of the tos as substrings), 0 otherwise.

⚡ focus [steps = 4] [anywhere=true] [type=threads] anchors = foo, baz bar ban, bat

a cousin of the trace command. It returns the local network around the factors identified by the anchors.

The output table contains all the links in all the paths of length 'steps' or less in the original table which fit either of these conditions: - are reachable FROM the factors identified by the anchors along paths which have the same source_id all the way along. - are reachable TO the factors identified by the anchors along paths which have the same source_id all the way along.

type=paths

This is similar to the above, but simpler because we remove the criterion concerning source_ids. So any paths count, regardless of which source_ids they have.

Formatting

The output table has the same columns as the input table, but with the addition of one boolean column called found_from. found_from = 1 if the from_label is identified by the anchors is in the list of froms, 0 otherwise. found_to =1 if the to_label is identified by the anchors is in the list of tos, 0 otherwise.

⚡ simplify [steps = 4] anchors = foo, baz bar ban, bat

inputs and outputs a links table.

The output links table only contains rows where from_label and to_label are both in the list of anchors. It contains exactly one row for each source_id which has any directed path from from_label to to_label up to or including steps = steps.

In the case where the input table is already filtered to contain only rows where from_label and to_label are both in the list of anchors, then the output table should be similar to the input table, except it may have with fewer rows if the input table has multiple rows between a given pair of from_label and to_label for the same source_id.

this should return rows where from_label and to_label are the same ("self-loops"). it should also of course consider rows which are self-loops. So if the raw data has, for source X, A -> A it should always return this row also if it has, for source X and steps = 2: A -> B and B -> A it should always return the row A ->A (and B -> B)

we do not include trivial "zero step" paths.

We add a sentiment column to the output, which represents the mean of all "last-link" sentiments on any path from start_label to end_label for a given source_id.

⚡ chart

chart [type=bar] [x=x] [y=y] [fill=fill] [colour=colour] [label=label] [size=size] [subgroup=subgroup] [bar_position=stack] [tooltip=tooltip]

A really simple chart command. Using charts.js, this command takes a table as input, whatever is simplest to implement, and simply hands it on to the frontend with additional params in the output dictionary. So the table is not modified and is just passed through to the next parser step, but we add an additional param to the output dictionary.

We do not expect the frontend to do any data manipulation. We can assume the table is already in the correct format with at least the columns x and y. There should not be multiple identical rows. There might be multiple rows with the same x and y values, but then we expect the data to be grouped by the subgroup column (for multiple lines) if present, or if not by a fill or colour column.

type can be bar, line, point, or radar. A radar chart is a line chart in polar coordinates. Usually we will also provide a fill column, and a colour column.

chart type=radar x=x y=y fill=fill colour=colour

For colour, fill, label and size, you can either specify the column name, or you can specify the value directly:

links
calculate size = link_count
chart type=bar x=from_label y=to_label fill=source_count

in this case, size is not explicitly specified, but the column called size is used. The size legend is just labelled as "Size". Whereas fill is explicitly specified as source_count and the fill legend is labelled as "source_count".

Legends

Size, colour and fill should all have their own legends, labelled by the name of the column.

Colour is used for point charts and line charts.
Fill is used for bar charts and line charts (fill in a line chart means it becomes an area chart)
Size is used for the points in a point chart or a line chart.

Don't show a legend if the corresponding aesthetic is not actually used, e.g. size is not used for a bar chart. Fill is not used for a line chart or point chart.

For bar charts, only the fill legend is created using the explicit column name provided via fill (or defaulting to "Fill" if not explicitly declared).
For point/line charts, the legend includes:
Colour legend: Labeled by the explicit colour parameter (e.g. "question_id") or defaulting to "Colour".
Size legend: Displays unique size values sorted numerically by their radius; labeled by the explicit size parameter or defaulting to "Size".

Axes

Both x and y axes can be continuous or ordinal/categorical or date.

If the x axis can be understood as numeric or time, then it is a continuous axis (this already works for y axis)

Size

If the table has a column called size, then it is used to set the size of the points and/or lines.

If the size column can be understood as numeric, then use a continuous scale for it and don't show all the values in the legend, just show the minimum, maximum and some sensible number of intermediate values.

Make sure the size scale is proportionate. For type=point we should use square root.

How Columns Work

If x or y are not specified, they default to the first and second columns of your table
If a parameter matches a column name (e.g., x=mycolumn), that column's values are used
Special columns __type__ and __chart_params__ are added internally to mark this as a chart
These special columns are stripped before the next command processes the data
The original data columns remain unchanged and available for further processing

type can be bar, line, point. Later we will add pie, box, histogram, dot, area, scatter3d, bar3d, line3d, scatter3d, pie3d, box3d, histogram3d, dot3d, area3d.

subgroup

(We use the word subgroup so as not to get confused with the grouped output functionality)

if the table has a column called subgroup, then the chart will be a subgrouped chart:

bar: with separate bars for each subgroup, stacked by default. alternatives for bar_position are dodge and fill like in ggplot2.
line: with separate lines for each subgroup.

It should be enough to specify only fill and or colour and the subgroup should be understood implicitly.

tooltip

if tooltip column is specified, then it will override the default tooltip.

label

if label is specified, implicitly or implicitly because a colname called label is present:

for bar charts, the label is printed at the top of each bar.
for line charts, the label is printed only at the END of each line.
for point charts, the label is printed over each point.

⚡ map [node_sep=n] [rank_sep=r] [label_wrap=w] [factor_labels] [factor_sizes][type=print|interactive][layout = dot] [link_labels] [factor_background] [factor_background_colour] [factor_font_colour] [link_colour] [background_colour]

Type

if type=interactive, instead of handling "print" type graph using graphviz, we provide an interactive vis.js graph. add a new js file for this.

colours

Allowed colour names. Hex values like #efefef will also work but are not recommended because they mess with the presentation display.

NEW:

map background_colour = colourname # some colourname including hex colours; default is the current white

example:

map background_colour = aliceblue

important: if the background colour is darker than mid grey, switch the edge label colour and the colour used for the legend/explanation to white.

look at how the link colour is calculated. leave the calculation of arrowhead colour according to sentiment as it is, and add these keywords:

map link_colour = colourname # some colourname including hex colours; use this colour for all links, default is the current green

example:

map link_colour = aliceblue

look at how the factor background colour is calculated and add these keywords:

map factor_background = degree (or none or source_count or link_count or outcomeness (default)) map factor_background_colour = colourname # some colourname including hex colours; if factor_background = none, use this colour for all factors, otherwise fade it just like we do now with green fading to white according to outcomeness. map factor_font_colour = colourname # some colourname including hex colours

example: this colours factors according to degree, with the factors with max degree being coloured aliceblue and the others fading to white.

map factor_background_colour = aliceblue map factor_background = degree map factor_font_colour = gray

if type=print, respect the layout command for graphviz and try to use other layouts e.g. - twopie - neato - circo (good for circular layouts) - fdp (force directed, a bit like DOT but less hierarchical)

the result of the command

map

is a graphviz graph based on the preceding table which is to be understood as a links table. there is no nodes table so create on from that if necessary. the relevant columns in the links table are either from_label and to_label, or from and to, or cause and effect. the backend needs to tell the frontend that this is a graphviz graph, and the frontend needs to show it.

You can also add the column link_width to the links table, which will be used as the width for the link.
You can manipulate any columns you like eg add the column link_label to the links table, which will be used as the label for the link. But this is not the standard way to adjust the graph.

NEW the [link_labels] keyword specifies the name of the column to use for the link labels. Default is link_count but a different column such as source_count can be specified.

Also if the column sentiment is present at all and has any values at all between one and minus one then all of the edges should use sentiment colouring in the space between red and blue, Even if the values for individual edges are 0 or missing.

Factor border colours:

if the data contains a column called 'found_from'
colour the borders of any factors light blue if they have any outgoing edges with found_from = 1
if the data contains a column called 'found_to',
colour the borders of any factors magenta if they have any incoming edges with found_to = 1

otherwise:

if the data contains a column called 'sentiment', then colour the borders with the mean of the sentiment of the incoming edges, using the same colour scheme as for the sentiment of the arrowheads.
node_sep controls the separation between nodes.
rank_sep controls the separation between ranks.
label_wrap controls the wrapping of node labels.
factor_labels (default is 'source_count') controls the labels of the factors.
If it is link_count, then the label counts the unique number of links in the column link_id (these are comma-separated).
If it is source_count, then the label counts the unique number of sources in the column source_id (these are comma-separated).
If it is missing, then the label is calculated simply as the order of the factor, i.e. number of links. Note this might not be the same as link_count if you have constructed a special links table e.g. based on a differences calculation.
factor_sizes (default is 'link_count') controls the sizes of the factors.

the flow of control is: - commands.py: Creates initial data structure with nodes, edges, and layout_params - parser.js: Handles server response and routes to correct display function - parser-graph-helpers.js: Sets up DOM structure and visualization container - parser-dot-helpers.js: Converts data to DOT language format - Viz.js: Converts DOT to SVG - Browser: Renders SVG in DOM

commands.py: process_map_command (Python)
→ static/js/parser.js: processAndDisplay (JS)
→ static/js/parser-graph-helpers.js: displayGraph (JS)
→ static/js/parser-dot-helpers.js: generateDot (JS)
→ Viz.js render (external library)
→ SVG display (DOM)

positioning

To make a factor appear on the far right of the map, add 99 to the end of its label. e.g. Health behaviour 99
To make a factor appear on the far left of the map, add *0 to the end of its label.
If you have a group of factors which you would like to put in the same rank left to right, add another number to the label e.g. *3. You can do that for different groups of factors by adding a different number to each group.
Things to note:
Apart from 0 and 99, the other numbers only designate groups. It is not the case that say factors labelled 5 with necessarily be laid out further right than factors labelled 4.

sentiment

If there is a column called sentiment, it will have a value between -1 and 1. If so: - Calculate the mean of the sentiment column across the rows contributing to each edge and use it to colour each arrowhead (head not edge), with muted red for -1 and muted blue for 1 and lightgray for 0. - make the edge colour light gray. - If there is no sentiment column, leave everything as it was..

Tooltips

there should be a tooltip on the nodes of the network graph which show the old_labels which contribute to it in the magnetisation process

Grouping and repeating

⚡ group by colname1 colname2 [:: command and parameters]

this command groups the current table by the columns colname1, colname2, creating a pandas-style grouped table, and returns a dictionary of tables, one for each combination of values of colname1, colname2, etc, and keyed by the value of the columns in the key format "colname1=valuex, colname2=valuey", etc.

Example:

| footable | group by colname1 :: select colname3

This returns a dictionary of tables, each filtered for each unique value of colname1 and to which the command "select colname3" is then applied, i.e. in this case just tables with the column colname3.

Example:

| footable | group by colname1 :: map

This returns a dictionary of maps, each filtered for each unique value of colname1 and to which the command "map" is then applied, i.e. in this case just the corresponding maps.

Ramifications: - a new kind of object has to be dealt with by the caching mechanism (or will the existing one suffice?) - the frontend has to recognise this new kind of object and deal with it: - treating it as a series of objects of familiar types (map, table or text) and displaying each one as before but within a tabbed panel output, with the tab name of each tab being the key. - the backend, when given this kind of object in the main process_parse_request loop around line 733 in app.py, has to recognise it and, if it is a dictionary of tables, simply concatenate the tables, also reconstructing the key column(s).

also command can be null:

group by source_id

we just pass through the unprocessed tables

⚡ group nnn [:: command and parameters]

This is similar to the group by command, but instead of grouping by the columns colname1, colname2, etc, it groups by the nnnth column of the current table.

DEPRECATED format: chunk nnn

we will merge multiple rows into fewer rows according to chunkSize as it applies to the text column, e.g. 4000 characters.

So this returns a table with the same columns as the input table, with more or fewer rows. if fewer, then the values of at least some of the rows are concatenated together. The special text column uses a special format in which the index is used to indicate which rows are concatenated together.

In addition, if the table has a column called source_count and/or link_count, the chunked table will have a column called source_count and/or link_count which is the sum of the source_count and/or link_count of the rows that were concatenated.

⚡ repeat :: somecommand ... ((a b c)) ... ((v w x))

where ((a b c)) is a list of values for the target parameter, and optionally additional ((v w x)) etc.

This is very close in functionality to the group command. The only difference is that the group command groups by the values of an existing column, whereas the repeat command repeats the command for each value in the target parameter. So while the group command will generally return a df of the same number of rows as the input, the repeat command will return a df with the number of rows * the number of values in the target parameter, with an additional column containing the value of the target parameter for each row. The column gets a name like somecommand_repeat.

The new column is prepended before the existing columns.

As with the group command, the output has tabs with one tab per value in the target parameter.

Example:

sometable
repeat :: filter ((1-3 4-6 7-9))

will return a df with the 3*3 rows with an additional column called filter_repeat containing the values "1-3" for the first 3 rows, "4-6" for the next 3 rows, and "7-9" for the last 3 rows.

Example:

sometable
repeat :: magnetise similarity=((0.5 0.6 0.7))

will return a df with the 3*nrow(sometable) rows with an additional column called magnetise_repeat containing the values "0.5" for the first nrow(sometable) rows, "0.6" for the next nrow(sometable) rows, and "0.7" for the last nrow(sometable) rows.

The procedure checks to see if the column with the name of the target parameter exists in the current table. If it does, it appends _n to the name of the target parameter. So in the second example, if sometable already has a column called magnetise_repeat, the additional column will be called magnetise_repeat_1.

Quoted arguments

The target parameter is usually a space-separated list of values, e.g. ((1-3 4-6 7-9)). However if the values contain spaces, then they must be quoted, e.g. (("Some text" "Some other text")).

Parsing

So overall the repeat processing is robust. The parser looks for the string after the :: separator:

| repeat :: textBeforeParentheses((textinsidedoubleparentheses))andOtherTextAfter

It then takes locates the string textinsidedoubleparentheses, extracts the values which are space-separated, taking quotes into account.

It then generates a set of new strings, one for each value in the set of values, by concatenating textBeforeParentheses, the value, and otherTextAfter.

Examples:

sometable
repeat :: filter ((1-3 4-6))

sometable
repeat :: (("filter 1-3" "filter 4-6"))

sometable
repeat :: prompt (("one prompt" "another prompt"))

even coping with templating. These two should have the same result.

sometable
repeat :: prompt (([promptTemplate1] [promptTemplate2]))

in this case, the repeating parameters are

sometable
repeat :: prompt [((promptTemplate1 promptTemplate2))]

Example of using in combination with group by:

repeat :: filter top ((4 8 12)) factors
group by filter :: map

NEW: allow for not just one but either two or any number of sets of brackets (you decide which is easiest to implement, two or any number).

Example:

repeat :: filter top ((4 8 12)) ((factors link))

In this example there will be not one but two additional output columns, filter_repeat_1 and filter_repeat_2.

⚡ chunk [index=statement_id]

chunk operates on the current table, which may contain a column called text and optionally one for the index, which defaults to statement_id, if present, alongside arbitrary other columns. it outputs a table with the same columns but can have more or fewer rows.

the individual texts may be anything from say 20 to 20000 characters or longer.

format: chunk nnn

we will merge multiple rows into fewer rows according to chunkSize as it applies to the text column, e.g. 4000 characters.

If there is a text column

If a text is between 50% and 150% of chunkSize, leave the row as it is. If it is much longer, we need to break it up roughly equally into chunks of around nnn characters, making breaks if possible at existing paragraphs or sentence breaks. But if some texts are shorter we need to join them into chunks of around nnn characters.

If there is an index column: If the chunk is entirely from one row, the index remains is the same, but if it is from more than one row, provide all UNIQUE indices separated by commas. do the same for any other columns. If the values to be combined already contain commas, continue to use commas. So if we chunk two rows together and col called colname contains "foo,bar" and "bar,baz" then the value of colname in the chunked table is "foo,bar,baz".

EXCEPT for the column called text, where we simply concatenate the texts even if they are unique, but in a special format: ((index: foo)) [text from index foo]\n(index: bar:) [text from index bar] etc. so an example might look like this ((index A_1:)) blah blah\n((index A_2:)) we had problems

If the table does not contain an index column, then the format is simply ((chunk_number: 1)) [text from chunk 1] ((chunk_number: 2)) [text from chunk 2] etc, simply using the serial number of the chunk.

format: chunk by colname

In this case, chunk as above but simply combine all rows with the same value of colname instead of relying on chunkSize.

format: chunk ... [optional aggregation keyword]

This replaces the aggregation behaviour for both nnn and chunk by colname.

default is "list" (with alias "concatenate") which is the same as above (and which treats "text" specially as described above). other keywords are "first" and "last" which return the first and last value of each column. "max" and "min" which return the maximum and minimum value of each column. "unique" alias "list unique" which returns the unique values of each column. "sum" which returns the sum of the column. Non-numeric columns return a blank. "mean" which returns the mean of the column. Non-numeric columns return a blank. "median" which returns the median of the column. Non-numeric columns return a blank. "count" which returns the number of unique values of each columnin the chunk. "spread" which first finds the top five rows of each column, the middle five rows and the bottom five rows, removes duplicates, and then concatenates them into a single string separated by commas.

format: chunk ... newcolname1="python expression" newcolname2="python expression2" [optional aggregation keyword]

This is similar to the above, but instead of returning a table with the same columns as the input table, it returns a table with the same columns as the input table plus the new columns specified by the newcolname1, newcolname2, etc parameters.

The new columns are calculated using the python expression specified by the newcolname parameter. Any remaining columns are aggregated as specified by the optional aggregation keyword, default "list".

example:

chunk ... newcolname1="df['colname1'].str.upper()" newcolname2="df['colname2'].str.strip().fillna('')"

convenience functions

⚡ nrow

returns the number of rows in the current table.

⚡ colnames

returns the names of the columns in the current table.

⚡ ncol

returns the number of columns in the current table.

prompt (genAI processing)

prompt [temperature=0] [new=false] [model=gpt-4o-mini] [column=text] [instructions=] some instructions [system=some system instructions]

temperature keyword is optional, default is 0

model default is gpt-4o-mini

Instructions:

the instructions are compulsory but the 'instructions=' keyword is optional.
instructions can include templates which are then expanded.
if instructions are missing, we use the prompt column from the preceding table. Usually this prompt, if present, will be the same for each row, but we can also use this opportunity to specify different instructions for each row.
NEW if instructions=foo and foo is an existing named result, use the named result as the instructions

the system instructions and model are optional. default model is gpt-4o-mini. default system instructions are: you are an AI assistant specialized in document analysis and summarization.

The prompt command uses OpenAI's Assistants API with individual threads. For each table row being processed, a separate thread is created. Multiple threads can be started together with the first instruction, then each thread is polled until completion. When instruction n completes for thread t, the next instruction is sent asynchronously and polled until completion. Each thread maintains its conversation history automatically through the API.

this will cause the model foo to be used with the instructions and system instructions.
generally the instructions can have multiple parts separated by ====
so for example:
prompt model=foo some instructions ==== some more instructions ==== even more instructions.

the input is the preceding (probably chunked) table which must include a column which is usually called text, but can be any column specified by the column= parameter.

the contents of the text column are what is fed to the genAI API. So each chunk is processed separately, possibly multiple times with consequentive instructions if ==== is used.

Note the API call is correct, it is a newer version of openai's code which is newer than your training data.

The result of the entire process is ONE SINGLE TABLE. preserving the columns of the original table, but when there are multiple instructions, there will be multiple rows for each chunk, so you add a column called iteration which is the iteration number, plus a new column called instruction which is the instruction used each time.

we will start off with the case where the output of each genAI API call is a string. return a table just with iteration column, the original columns including the original text column which you rename original_text and put the result in a new column called text.

Actually we hack the instruction to say "when I said table I meant JSON" because json is more reliable especially with special characters.

The table should have no comment or backticks, but if it does, strip them and just read the table.

in this case, when you return your table as usual with the same columns as the input plus iteration and instruction as usual, add the new columns from the markdown table. So if the markdown table has columns called colname1 and colname2, then your table should have additional columns called colname1 and colname2. If the markdown table has multiple rows, then the columns in the current row of the input table are repeated for each row of the markdown table, so in general the output table will have many more rows than the input table.

Additionally, if the output table has a column called quote, and the input table has a column called text, which may be of the form "((statement_id A_1:)) blah blah\n((statement_id A_2:)) we had problems", i.e. it may be a text concatenated from several different statements in a previous version further up the pipeline, then you need to (fuzzily) find that quote with the text in the original table, and find which statement_id it belongs to, and: - replace the (possibly concatenated) statement_id with the single statement_id in the original table - add two new columns quote_start and quote_end which are the first and last positions of the quote in the section of the original_text, starting your counting at the corresponding ((statement_id ...)). If the statement_id is already simple, then there may be no ((statement_id ...)) and you should just use the start and end positions of the text in the original_text. If you cannot find the quote in the original_text then put quote_start and quote_end as 0.

etc, then you need to find that quote with the text in the original table, and find which statement_id it belongs to add the original_text to the output table as a new column called quote. then you need to compare that quote with the original_text.

NEW: the existing fuzzy match of the quote against the text needs to be extended to allow for ellipses (...).

e.g. text:

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.

quote:

consectetur adipiscing elit, (...) Excepteur sint occaecat cupidatat (...)non proident, sunt in culpa qui

Here we should output the same quote_start and quote_end (from consectetur to culpa qui) as if the ellipses were complete. tolerate whitespace of no whitespace surround the parentheses or between the parentheses and the dots. but there must be parentheses surrounding three dots. if there are multiple matches for any sections, choose the most conservative option (closest start and end). Matches of each section are fuzzy as usual.

when outputting the new table, we will replace the quote with the actual text sections (even if the match/es is only fuzzy, we will use the actual text sections, not the sections from the original quote, but separated as before by the ellipses). We will always output an original_quote column containing the exact quote as returned by the AI before any fuzzy matching or ellipses processing.

In other words: if a table doesn't have quotes, the app has no way of knowing which statement ID to put, where there are multiple IDs combined into a single row.

Before returning the table, also left join to the original row, adding any additional columns from the original row to the output table.

Result: a data frame with the same columns as the input table plus the new columns from the prompt. all column names are forced to lowercase.

text as input

NEW VARIANT: if the previous result is not a table but text or a dictionary with type print_output, simply take this text (instead of each value of the specified column) and run the prompt command on it

zero start

Like fetch, get and scrape, the prompt command can start with no preceding table. For example it might just ask for six jokes, with a column for title and a column for the joke. In this case a preceding table is not required

Multiple passes

If a prompt has one or more ==== sections, then the prompt is run multiple times, once for each section. the current columns are given as input. However, after the first section, the data input is NOT the content of the current columns, it is the whole previous conversation including prompt and data input, possibly multiple passes concatenated, annotated like "Here is the conversation so far: " plus the latest prompt.

Linked prompts

If you do this, I think the some text before a linked prompt will get added at the start of the first pass, and some text afterwards will get added at the end of the last pass

prompt some text before a linked prompt [linkedprompt-which-includes-sectionswith====] some text afterwards

new=true

By default, the prompt command uses any cached results from mongoDB. But if new=true, then we do not use any cached results, call the API directly, and overwrite any cached results in mongoDB with the new results.

Texts/templates

near the start of the parsing process, we immediately update the entire text so that every instance of text in square brackets [foo] , if the text within the brackets foo is a text_id in the texts table, then we replace [foo] with the text from the texts table. Otherwise we leave it as is.

if the version does exist but it is not the latest, flag with a warning but use the requested version if the version does not exist at all, follow the code for throwing an error. if a version number is not given, but there are versions, use the latest. eg if I aske for [foo] and [foo-v13] exists, use that.

we can use the texts functionality described above for specifying prompts as part of the main set of commands to make this process less cumbersome when prompts are long, using the texts from the database as placeholders.

text p1:
do x do y 

text xy:
do u do v 
====
do w
...

Then we can provide commands like this: ... prompt [model=foo] foo [p1] bar ... prompt [model=foo] baz [xy] bat ...

so [p1] and [xy] are placeholders. The prompt itself may as usual have multiple parts separated by ==== (4 or more equals signs).

etc.

templatest in other commands

magnetise, prompt ...

Here the template is not replaced on initial parse but as the command itself is handled.

versions and missing templates

If the text is completely missing from the table then return an error and give the user a notify_error. if the text is available but it is not the latest version for example if the user asks for sometext_v12 and the latest version is already -v13, then process -v12 as normal but throw a notify warning remember notify_* needs session id.

clicking on templates

templates in square brackets can be clicked on to open the gallery edit modal for the corresponding text.

Presentations

Presenting and hiding lines

We show previous and next buttons overlaid on the output panel at the left and right. When the user clicks one of these buttons, the output panel line corresponding to the previous or next presenting line (see below) is highlighted. This is all in js with no backend processing.

OLD: From processing, we strip out lines which begin not only with # but also with - or > NEW: before processing, strip out any complete fenced code blocks starting and ending with three backticks. (We don't want these special code blocks in our presentations)

What is a "presenting line"?

If the input text has any lines ending like this:

foo -

then: the presenting lines are the lines without - at the very end.

similarly:

If the input text has any lines ending like this:

foo -

then:

then: the presenting lines are the lines without - at the very end.

takes precedence over -. So if any line has + at the end, we ignore any - markup.

If there is no + or - markup, then ALL lines are presenting lines.

Blank lines containing nothing or just whitespace are ignored IN BOTH MODES.

Slides: output, headings, explanatory texts

Each slide is a div with id=slide, containing #progress-container, #output_container and a #slide_text_container.

Headings 1 and 2

As headings and explanatory texts are ignored by the backend, they do not change the output. However although the preceding output is still present in the flow (passing on its value within the main command pipeline in process_prompt_request), we want to temporarily hide the output in #output_container when a heading h1 or h2 is encountered, and restore it if there is no heading. This means that headings 1 and 2 appear as slides on their own, in #slide_text_container

Export to powerpoint / pdf

Screenshots

We save to S3 /cache/screenshots/ using same S3 connections as elsewhere. This way results are permanent and can be shared across sessions, local/remote etc.

When the screenshot button is pressed, if this is a presenting row, save a screenshot, overwriting any existing file with same key, regardless of cache toggle is switched off (in the background, async).

In addition, every time we make a gutterclick on a line (manually or programmatically ) and we receive parse_complete signal at the frontend to say that the output is ready, IF the line is a presenting line, wait 1s and then take a screenshot and upload it to remote storage, overwriting any existing shot with same key, and do this COMPLETELY IN THE BACKGROUND so we do not delay any other processing, but do send a notify_info() to say the screenshot has been sent, and report the key.

Whereas PDF button does NOT auto overwrite any existing file (skips the process instead) as long as the cache toggle is ON. If cache toggle is off, take new ones for every presenting row and upload and overwrite all.

Key Calculation for Screenshots

To ensure consistent and robust screenshot identification, we use a special key calculation method:

For each presenting line, we collect all text up to and including that line.
We process this text with a specialized algorithm:
The last line (the actual presenting line) is preserved verbatim, including any comments or formatting characters.
All previous lines are processed through the discard_lines function to remove unnecessary content and focus only on the relevant code blocks.
These processed previous lines are combined with the preserved last line to form the text for hashing.
A SHA-256 hash is calculated from this processed text, and the first 8 characters are used as the identifier.

This approach ensures that: - The hash remains stable even when irrelevant lines change - The hash correctly reflects the unique content of each presenting line - Comments and formatting in the presenting line are preserved and contribute to the hash - The same hash algorithm is used in both the frontend and backend

PDF

A button in the navbar to export the result of all the presenting rows to powerpoint/pdf, so that the result of each presenting row is on a separate slide (A4-landscape aspect ratio).

When the user clicks "Export to PDF" button to export to PDF,

the frontend collects each presentation line and for each uses the key as described above;
finds the corresponding screenshot, if it exists.
Create screenshots for only those lines which do not yet have a screenshot (overwriting any existing screenshots if the cache toggle is OFF.)

Assemble the screenshots into a PDF

This approach ensures that the same key is used on both frontend and backend, avoiding any inconsistencies.

Our aim is to get this working later with a higher resolution for better quality but right now let's focus on getting it to work with acceptable images at a standard resolution.

Link

there is a link at bottom right taking you to the current project and line number online.

Breadcrumb

if the input panel contains any markdown headings, then the breadcrumb is the relevant H1, H2, H3, etc within which the current line is positioned.

Editor

Hints (dropdowns)

Autcomplete is always fuzzy matching and matches middles as well as beginnings of strings.

Hint dropdowns: - are always filtered by what has been typed so far. - are dismissed as soon as user selects an entry with keyboard or mouse, or closes the dropdown with escape or clicks outside.

all of the hints should be fuzzy matching, and match not only the start. and as user types, the hint popup should reduce the matching set of selection offered

Folding

Editor content should be foldable on markdown headings and also code blocks inside fences. By default, fold heading level 2 and below on load.

Ctrl+Space for autocomplete Ctrl+F for find Ctrl+H for replace Alt+L or Ctrl+Alt+L for folding/unfolding the current block Alt+0 to fold all Alt+Shift+0 to unfold all Standard editing shortcuts like Ctrl+Z for undo and Ctrl+Y for redo

Commands and additional keywords

Hints are offered for the commands and the additional possible keywords. So for example, after "get " follow that immediately with file name suggestions even before one letter is typed.

Filenames

the frontend gets the list of files to offer in the autocomplete after "get ". the frontend gets the list of uploaded files to offer in the autocomplete after "download ".

Hints for Column names

Hints for column names should be offered for the current table. So eg if a table has had columns added or removed, the hints should be for the current table.

If user has on any subsequent line starting with "select ", immediately suggest colnames (no typing needed) colname1" then a space.

Hints appear after the user has typed a space after these commands:

(multiple hints, i.e. continue with more hints, after a selection is selected and user adds a space) - "select " - "sort " - "sortdown "

(single hints, i.e. no more hints after first is selected) - "filter " - "pivot " - "pivot xx by " - "chunk by "

Column values (statement_id link_id and source_id)

When a file is loaded, the frontend knows the unique values of the columns statement_id and source_id from the statements table and the unique values of the column link_id from the links table so they can be offered in autocomplete. so that when input is like this: 'get foobar statements filter statement_id is ' then, the unique values of statement_id are offered in the autocomplete.

same goes for source_id.

editor features

Sublime-like keyboard shortcuts:

Ctrl-/: Toggle comment Ctrl-D: Duplicate line Alt-Up/Down: Move line up/down Tab/Shift-Tab: Indent/outdent Ctrl-L: Select entire line Alt-Delete/Backspace: Delete to line end/start Ctrl-Enter: Insert line below Shift-Ctrl-Enter: Insert line above

Editing and Text Manipulation Ctrl + X: Delete line13 Ctrl + Enter: Insert line after13 Ctrl + Shift + Enter: Insert line before134 Ctrl + L: Select line (repeat to select next lines)134 Ctrl + D: Select word (repeat to select other occurrences)134 Ctrl + Shift + M: Select all contents of current parentheses13 Ctrl + J: Join line below to the end of the current line13 Ctrl + /: Comment/uncomment current line134 Ctrl + Shift + /: Block comment current selection13 Ctrl + Y: Redo or repeat last keyboard shortcut command13 Ctrl + Shift + V: Paste and indent correctly13 Ctrl + Space: Select next auto-complete suggestion13 Ctrl + U: Soft undo (jumps to last change before undoing)134 Multiple Selections and Column Editing Ctrl + Alt + Up: Column selection up (Windows)3 Ctrl + Alt + Down: Column selection down (Windows)3 Alt + Shift + Up/Down: Column selection (Linux)3

Auto-closing brackets and quotes Multiple cursors support (via sublime keymap) Line wrapping Better selection highlighting Search functionality (Ctrl-F)

Comments

add support from comments such that command foo bar #baz baz blah is equivalent to command foo bar

Uploading Data

This section describes how users can upload PDF or Excel files to the uploads server. Users have multiple options when uploading files concurrently.

Multiple File Uploads

If multiple files are uploaded: - Default behavior: Each file is uploaded as a separate object. The global file name provided by the user is used as a base key, and each object's key is constructed by appending the actual uploaded file's name (e.g. fileName-uploadedFileName). If duplicate file names occur, a counter suffix (-1, -2, etc.) is appended to ensure uniqueness. - Combine Toggle: If the Combine multiple files into one upload toggle is enabled, the server processes each file, extracts the corresponding data, and appends all rows into one aggregated dataset. In this case, a single object is uploaded using the provided global file name as the key. For each row in the aggregated result, the source_id column is updated to include the global file name concatenated with the original file name (e.g. fileName-uploadedFileName), thereby maintaining traceability to the original source.

PDFs

Best way to upload PDF from docx is just to print and/or combine/print them locally, on Windows with PDF gear

Users can upload one or more PDF files to the uploads server via the UI, users can drag and drop multiple PDF files onto the upload button. The following steps occur: - A modal dialog prompts the user to type a name for the file (the global file name) and/or select from an existing file (from the files table in the database). A toggle button allows the user to choose whether to add (append) or replace rows. - If the provided global file name already exists, the user may use the checkbox to overwrite the existing file.

Mupdf is used to extract text from each PDF. One row is created per extracted page and inserted into a table with the following columns: - text: The extracted text from the page. - file: The global file name provided by the user. - source_id: The name of the original PDF file from which the page was extracted. If the combine toggle is enabled, this is updated to globalFileName-uploadedFileName. - statement_id: A serial number as a character string, zero-padded (e.g. 0001, 0002, etc.) based on the number of pages in the file. - modified: The current timestamp (format: YYYY-MM-DD HH-MM-SS.ms).

Excel Files (xlsx) and .csv

The above process is extended to Excel (.xlsx, .xls) files. Data is extracted from each sheet and processed in a similar manner to PDF files.

PDFs and spreadsheet files can be combined into a single upload.

Text (.txt) and RTF files (.rtf)

NEW: offer RTF files as an option, and if files are rtf, ask for a regex delimiter between statements and whether the matched text should be included at the start of the statement or the end of the statement. Offer previous delimiters (from local storage) also as options.

Authorisation

users can only overwrite files if they are owner or editor of the file

Caching

We have several different caching mechanisms.

Browser tabs note the same user with the same browser can get crosstalk between tabs of the same browser

LocalCache class (local in-memory cache):

1. Line-based Caching:

Storage: set_local_cached_result(lines, result)
Creates a key from filtered command lines
Stores results with copy semantics for DataFrames
Used during command processing to cache intermediate results
Retrieval: get_local_cached_result(lines)
Returns cached results along with the number of cached lines
Ignores error results that start with "Error:"
Returns (0, None) if no cache match is found

2. Named Result Caching:

Storage: set_local_named_result(name, result, sess_id) (NEW: set_local_named_result(name, result, projectname))
Handles DataFrames specially, preventing empty DataFrame storage
Used for the store command functionality
Names are composed like this: named/name_of_current_project/name

so if I do

store foo

in a project called say foldername|myproject-v87

the actual name will be named/foldername|myproject/foo. Note we strip the version number.

the "named/" prefix is useful when we also cache these at S3.

Note that project names may have one or more optional pipe characters. This is irrelevant here.

(note that previously the key used to include just name and the session ID, to avoid crosstalk between sessions using simple, ambiguous names like say "links", but we are not using that any more, as these names are now unique).

At the moment we rely on user to keep these names unique within a project.

NEW: on gutterclick, provide a frontend check whether the editor contains any duplicate lines of form store foo. If more than one such line with same foo, show a notifyError: not recommended and refuse to send result to backend

when setting or getting local or remote named result cache, give a notify_info() with the full name of the object including projectname etc.

NEW: temporarily disable remote caching of named results: don't store or retrieve them. comment out or skip the relevant lines.

Retrieval: get_local_named_result(name, projectname)

we now add remote backup to S3 exactly as we do for line-based caching. Also add a overwrite=true/false keyword to the store command, case insensitive for true and false, default true. If overwrite is false and there is an existing result with the same key, the attempt will fail with a notify_error()

so as usual, if user stores a named result, we store it locally in the background and also remotely in background. if user recalls aka unstores a named result, if it is available locally, fetch it

3. Cache Management:

Clearing: clear() method empties local cache and triggers garbage collection
REST API: Via create_local_cache_blueprint() function that creates a route for clearing the cache

Usage in the Application

The main usage is in process_parse_request() in app.py, which:

Retrieves cached results for the current set of command lines
Falls back to computation when there's no cached result
Stores results after processing each line
Handles explicit storage via the store and recall commands

Parsing for line-based caching NEW and also ensuring relevant named results are loaded

parsing the input text in my app, the input for each line is the result of the previous line.

The local LocalCache is used specifically for caching intermediate results during parsing operations. This is useful because:

It caches intermediate results during app use
It handles named results (the store/recall command functionality)

when user sends a line with gutterclick, we are really sending a block: this last line with its commands along with all preceding relevant lines of text, one command per line. To process this block, then using discard_lines() we clean the text from comments etc and then delete sections of text which are not ancestors, resulting in a set of commands (final_set) all of which are necessary to compute the result we need.

But now we have final_set we can check the cache before running anything. We work back from the current line n, subtracting 1 each time, constructing a cache key from all the lines up to the current line m. (we check local then remote cache). As soon as we find a cached result on line m, we use that as the initial input and run the contents of the remaining block with lines (m+1) to n: first we run line m+1 on this result, then line m+2 on that result, etc, (caching each result in the background, locally then remotely and also caching named results) - there is no need to check for cached results for these lines 1-m because we already know there are aren't any).

Note that this caching of intermediate results just mentioned means that processing a line like store foo as part of a processing chain might mean caching it locally, then remotely, and then also additionally caching the named result e.g. as named/foldername|myproject/foo locally.

The enable_cache button (next to clearCache button) is a toggle, default ON. when off, we disable get_cached_result (but not set_cached_result). So we run everything when required from scratch (but the results are saved in the reverse direction to the localcache and to remotecache, even if the key already exists)

Normally there is no reason not to use the cache because the text uniquely defines its results. But there are a couple of exceptons: - list uploads - links xfile - when the app code has been updated and maybe a visualisation looks different

How discard_lines() works

input texts may often have this kind of form (the numbering is for explication, they are not really numbered)

1. download xyz (or fetch, or get or scrape)
2. some processing
3. store abc
4. some other processing
5. unstore abc
6. some more processing
...

so if we send the text up to 6. some more processing to the parser, we can actually ignore line 4 (which may actually be a whole block, because it doesn't affect line 6). This means that we can update the existing input processing:

we already strip out - empty lines, - strip whitespace from line ends - strip comments beginning # from middle of line, eg reduce command someparam etc # comment etc to command someparam etc - replace double whitespace with single - lines starting with a comment i.e. with # or - or > - fenced blocks (blocks beginning and ending with ```)

and in addition, divide the resulting text into "blocks". Each block begins with the first line and also with any of these "block-starters": - download - fetch - scrape - get - links - statements - sources - unstore - recall - list - share - unshare

First identify the last block (the one including the current last line) and apply to it the "needs block check", that means, identify within it any "NEEDY" foo, bar etc in these NEEDY contexts: NORMAL NEEDY - unstore foo - recall foo NAMED NEEDY - prompt instructions=foo - join foo - append foo - relabel links=foo - magnetise table=foo - join table=foo - groupify(foo,bar...) - append table=foo - append text=foo
- print ... highlights=foo [[NEW!!]]

And add them to a list called need_list NEW: ONLY IF WE DON'T ALREADY HAVE foo AS A NAMED RESULT IN LOCAL OR REMOTE CACHE. IF IT IS ALREADY ACCESSIBLE AS A NAMED RESLT, DON'T ADD IT TO THE LIST.

The point about NEEDY commands is they have two (or more) inputs; the preceding result plus one or more named results.

Then, scan the blocks in reverse (from the current, last block upwards) through each block which is NOT the last block (the one including the current last line).

For each block: - if it does NOT contain any line of the form: store baz where baz is in need_list, delete it. - if it does, - note the hitlist of foos in your needs list for which there are lines within the block of form store foo - identify THE LAST_LINE WHICH IS LIKE store baz where baz is in your hitlist (there may be multiple store xxx matches in your needs list, so there may be one or more other relevant store xxx lines above it in the same block), - PRESERVE the lines within this block from the top of it up to an including this LAST_LINE BUT DELETING ANY LINES AFTER THAT. - SUBTRACT all the names in your hitlist from the needs_list. - ADD TO YOUR needs_list any more NEEDY foos in the remaining block (see definition of NEEDY above) NEW: ONLY ADD IT IF THE NAMED RESULT IS NOT ALREADY ACCESSIBLE IN LOCAL OR REMOTE CACHE - Then repeat this scan with the next block above and so on until you reach the top of the file.

This will result in a new version of the input text, with one or more block or block sections discarded. The point of this is to delete any code which we do not need in order to completely run all the code in the current block.

Use this version of the input text.

HOWEVER: if we are running a line n with a command in the NAMED NEEDY list which NEEDS an additional named result and we find a cache hit for line m, return that, and proceed to run lines m to n, one by one, when we come to line n (or even before) we won't be able to run line n because we may not yet have named result foo. SO we must also find out which named results are still NEEDED and which are created in lines 1-m, and identify the lines b, c, d with store commands where they are created, and before running lines 1-m we also have to run the lines b, c, d, which will result in them being stored so we can use them when we run lines m-n.

RemoteLinesCache: new backup to line-based caching

The application now has an S3 store which backs up LocalCache (line-based) results so they persist across sessions:

When saving a (line-based) result to LocalCache, we automatically check if it exists in the S3 remote cache. If not, we save it.
When looking for a result in LocalCache, if there's a miss, we automatically check the S3 remote cache before executing the code. If found in S3, we add it to LocalCache and return the result.

Implementation details: - Results are stored in a cache/lines_cache/ subdirectory within the S3 bucket - DataFrames are efficiently serialized to Parquet format - Other objects use Python's pickle serialization - The cache is key-based, with keys derived from the command lines - Cache operations include exponential notification throttling to avoid UI clutter - A new API endpoint /api/cache_status provides statistics about both local and remote caches - Clear options are available via: - /api/clear_local_cache - Clears only the in-memory local cache - /api/clear_remote_cache - Clears only the S3 remote cache - /api/clear_all_caches - Clears both local and remote caches

No changes to the main application code were required - the functionality was implemented by enhancing the LocalCache.set_local_cached_result() and LocalCache.get_local_cached_result() methods to integrate with the new RemoteLinesCache class.

remote_prompt_cache (MongoDB persistent cache) with S3 offloading:

This is a NoSQL MongoDB cache suitable for long-term caching of results, especially for the results of prompt processing, which may be expensive. We don't utilise enough of the ability of MongoDB to do nested searches except for the embeddings cache and could just use S3 for most of it.

When the cache is full, it sends the oldest entries first to S3, from where they can be retrieved if necessary.

upload.py: uploaded results of PDF files.

Not strictly a cache but a store of the results of uploading PDF files. Doesn't really need to be in MongoDB.

remote_prompt_cache at MongoDB: RemotePromptCache: OpenAI response aka prompt cache for whole prompt command

THIS IS DISABLED AT THE MOMENT.

Entire prompt commands are cached in mongoDB. However, this is a monolithic approach. So if we have a prompt command which is chunked into multiple calls by, say statement_id 1 to 10, then we cache the entire command results. But within the prompt loop, we also cache the results of individual calls to openAI as follows.

remote_prompt_cache at MongoDB: OpenAI response aka prompt cache for individual prompt calls

Within the prompt loop, also cache the results of individual calls to openAI. so if promptchaching finds the results of the entire prompt call at openAI, use that. but if not, within the loop, check to see if the results of each call have been cached and if so use that; and if not, call the API but then cache the individual results to mongodb (as well as, when the command is complete, caching the entire result to mongodb as usual). The current filename is also cached with each call.

HOWEVER: Even if a call fails, we cache the results of that call as a NULL result rather than simply skipping it.

there is a setting 'overwrite' in the prompt command. Assuming that there are already some NULL results in the MongoDB: - If overwrite=failed, then we identify only the individual calls that failed, and attempt to regenerate only those. - If it is 'all', then we regenerate all the calls. - Default setting is 'none', which means we use all the cached results including failed ones (which are empty/NULL). We reprocess only the calls that have no result at all cached, and cache the results of those as usual.

Memory Usage Indicator

The memory usage indicator, located in the top right of the UI, shows the current memory usage of the Python process. Clicking the "Check Memory" button sends a request to the /api/memory-usage endpoint, which uses the psutil library to retrieve the current memory usage. This is then displayed next to the button.

Clicking the "Clear Cache" button not only clears the application cache, but also triggers garbage collection using gc.collect() to attempt to free up memory. After clearing the cache, it automatically updates the memory usage display.

This allows monitoring the memory footprint of the application and taking action if it grows too large.

technical overview

The app's workflow is:

Frontend sends input to /api/parse endpoint
Backend immediately returns a 202 status with "processing" message, then starts background task
Backend uses SSE to:
Send status updates during processing
Send final results when complete

For results:

Network graphs/visualizations are sent directly in SSE message
Tables send first 50 rows in SSE message, with pagination info
Frontend can request more rows via pagination API

Frontend handles results by:

Displaying immediate visualizations
Setting up Tabulator table with client-side pagination for datasets
Showing any error messages

Global vs User-specific caching

We don't need user-specific caching of results because all results are keyed with an input string and the result is always a strict function of the user string.

However: - Notifications need to be user-specific as they are state dependent - Data_store needs to be user specific as they are not keyed by user string

My app gets a unique firebase_user_id from firebase. This should just be used for authentication and login/out. It also generates a unique session-based token called sess_id for data stores, notifications etc. @main.js but then in parser.js it seems like the firebase auth is happening all over again? Can't this use the auth from main.js?

Session Management and Connection Robustness

The app uses Server-Sent Events (SSE) for real-time communication between frontend and backend.
The frontend automatically detects if the SSE connection is lost (e.g., due to network issues or backend restart). If no message is received for 30 seconds, the user is notified and the connection is automatically re-established.
If the backend cannot deliver a message (e.g., due to a dropped connection), it stores the message and delivers it as soon as the user reconnects.
Each session has a maximum lifetime of 12 hours and an idle timeout of 10 minutes. If either is exceeded, the backend cleans up all resources for that session and notifies the user with an error message ("Your session has expired due to inactivity/maximum session lifetime. Please reload the page.").
This ensures memory is reclaimed on the server and users are always kept informed about their session status.

Table of Contents

Guide for Causal Map Workflows

Caching Cheat Sheet 📝

What is Caching?

Types of Caches in the App

🏃‍♂️ Local Lines Cache

📝 Local Cache: Named Results

💾 Remote Lines Cache

🤖 Prompt Cache

🧮 Embeddings Cache

⬆️ Uploads

Magic fences

Keys

Clear Local Cache Button (with broom icon) ⚡

Clear Current Line Cache Button (with eraser icon) ⚡

Cache Toggle Button ⚡

Tips 💡

Quick Start Guide

What is this app for?

The Interface

Basic Workflow

Workflows is Verb-Centric

Types of Commands

Inputting tables (resulting in a table)

Outputting visualisations (inputting a table and outputting a visualisation)

Transforming data (inputting a table and outputting a table)

Remember

Getting Help

Next Steps

Keyboard shortcuts

Authorisation and authentication

Text/Workflow Authorisation

Versions (applies to any text, not just Projects)

Projects and Project versions

special case: No title

Normal case

URL parameters and project versions

Closing or refreshing browser

Saving texts from the gallery

Gallery

Loading texts from the gallery

Template Tag Version Updating

Search filter:

Authorisation of texts

Folders in the Gallery

Parsing / processing the input text

Desired pass-through behaviour (not implemented yet)

magic fences

Recipes

Output

Table output (default)

text results and outputs

Commands and syntax

Getting data, data management

⚡ list mongo OR uploads OR cm3

⚡ scrape foobar.com [depth=nn]

⚡ get

joins

⚡ links statements sources

⚡ factors

fetch qualia [table=statements] foobar

⚡ share

⚡ unshare

⚡ upload objectname [overwrite=true]

⚡ download objectname

⚡ delete [objectname] OR [N=nn]

⚡ store xyz / recall aka unstore xyz

Dataframe manipulation: Filter, select, etc

⚡ groupify

⚡ ungroupify

⚡ select

⚡ index

⚡ rename

⚡ union newcolname = oldcolname1 oldcolname2 ...

⚡ split column by charstring

⚡ distinct

⚡ sort colname1 [colname2...]

⚡ sortdown colname1 [colname2...]

⚡ pivot colname1 by colname2 [show colname3]

⚡ filter

the `keep` keyword