Skip to content

Commit b26d26c

Browse files
author
jravenel
committed
feat: Update ArXiv agent components for improved modularity and maintainability
1 parent 06b530c commit b26d26c

File tree

1 file changed

+154
-0
lines changed

1 file changed

+154
-0
lines changed
+154
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,154 @@
1+
# ArXiv Agent Module - Standard Operating Procedure
2+
3+
## 1. Purpose and Scope
4+
This document provides standard operating procedures for using the ArXiv Agent module to search, download, analyze, and maintain a knowledge base of scientific papers from ArXiv.org. It covers all operations from basic paper discovery to advanced knowledge graph queries.
5+
6+
## 2. Module Overview
7+
The ArXiv Agent module enables easy interaction with ArXiv's scientific paper repository. It provides a suite of tools to search for papers, extract metadata, build a knowledge graph, and perform complex queries across your personal research library.
8+
9+
## 3. System Components
10+
11+
### 3.1 ArXiv Assistant
12+
An intelligent assistant that helps users discover, organize, and analyze research papers. The assistant integrates all the module's capabilities into a conversational interface.
13+
14+
### 3.2 ArXiv Integration
15+
Connects to the ArXiv API to search for papers and retrieve detailed metadata.
16+
17+
### 3.3 ArXiv Paper Pipeline
18+
Processes papers by:
19+
- Extracting metadata (authors, categories, publication dates, etc.)
20+
- Converting data into a structured knowledge graph
21+
- Storing the graph as TTL files
22+
- Downloading PDF versions of papers
23+
24+
### 3.4 ArXiv Query Workflow
25+
Enables powerful querying capabilities against your stored papers:
26+
- Find authors of specific papers
27+
- Find papers by author or category
28+
- Get frequency analysis of authors
29+
- Execute custom SPARQL queries
30+
31+
## 4. Directory Structure
32+
33+
src/custom/modules/arxiv_agent/
34+
- assistants/
35+
• ArXivAssistant.py - Assistant implementation
36+
- integrations/
37+
• ArXivIntegration.py - ArXiv API connection
38+
- ontologies/
39+
• ArXivOntology.ttl - ArXiv ontology schema
40+
- pipelines/
41+
• ArXivPaperPipeline.py - Paper processing pipeline
42+
- workflows/
43+
• ArXivQueryWorkflow.py - Knowledge graph query workflow
44+
- README.md - This documentation
45+
46+
## 5. Data Storage Locations
47+
48+
storage/triplestore/application-level/arxiv/ # TTL metadata files
49+
datastore/application-level/arxiv/ # PDF document files
50+
51+
## 6. Operating Procedures
52+
53+
### 6.1 Starting the ArXiv Agent
54+
55+
From project root directory:
56+
57+
```
58+
make chat-arxiv-agent
59+
```
60+
61+
### 6.2 Paper Discovery
62+
63+
1. **Basic Search**
64+
```
65+
Search for papers about quantum computing
66+
```
67+
68+
2. **Targeted Search**
69+
```
70+
Find papers by Yoshua Bengio about deep learning
71+
```
72+
73+
3. **Recent Papers Search**
74+
```
75+
What are the latest papers on transformers?
76+
```
77+
78+
### 6.3 Paper Storage
79+
1. **Adding Papers to Knowledge Graph**
80+
```
81+
Save paper 2201.08239 to my knowledge graph
82+
```
83+
84+
2. **Downloading Paper with PDF**
85+
```
86+
Download the PDF for paper 2201.08239
87+
```
88+
89+
### 6.4 Knowledge Graph Queries
90+
1. **Paper Author Lookup**
91+
```
92+
Who wrote the paper "Attention Is All You Need"?
93+
```
94+
95+
2. **Author's Papers Lookup**
96+
```
97+
What papers do I have by Geoffrey Hinton?
98+
```
99+
100+
3. **Category Search**
101+
```
102+
Find papers in the cs.AI category
103+
```
104+
105+
4. **Author Frequency**
106+
```
107+
Which authors appear most frequently in my papers?
108+
```
109+
110+
### 6.5 Advanced SPARQL Queries
111+
Execute this query:
112+
PREFIX abi: <http://ontology.naas.ai/abi/>
113+
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
114+
SELECT ?paper ?title WHERE {
115+
?paper a abi:ArXivPaper ;
116+
rdfs:label ?title .
117+
118+
119+
## 7. Troubleshooting
120+
121+
### 7.1 Common Issues and Solutions
122+
123+
| Issue | Possible Cause | Solution |
124+
|-------|---------------|----------|
125+
| Paper not found | Incorrect ID or API error | Verify ID and try again |
126+
| Query returns empty results | No papers stored or query mismatch | Check storage directory for TTL files |
127+
| PDF download fails | Network issue or invalid URL | Check connection and try again |
128+
| SPARQL query error | Syntax error or namespace issue | Ensure proper PREFIX declarations |
129+
130+
### 7.2 Error Messages
131+
132+
| Error Message | Meaning | Action |
133+
|--------------|---------|--------|
134+
| "Unknown namespace prefix" | Missing PREFIX in SPARQL | Add required PREFIX declarations |
135+
| "No TTL files found" | Empty storage directory | Add papers using the pipeline first |
136+
| "Query execution failed" | Malformed SPARQL | Check query syntax |
137+
138+
## 8. Maintenance
139+
140+
### 8.1 File Management
141+
- TTL files are retained indefinitely in the storage directory
142+
- Consider periodically backing up the storage directories
143+
- PDF files can be substantial in size - monitor disk space usage
144+
145+
### 8.2 Updating the Agent
146+
When updating the agent code:
147+
1. Restart the agent after code changes
148+
2. Existing data will remain accessible
149+
3. Test queries against existing data to verify functionality
150+
151+
## 9. References
152+
- ArXiv API Documentation: https://arxiv.org/help/api/
153+
- SPARQL Query Language: https://www.w3.org/TR/sparql11-query/
154+
- RDF Turtle Format: https://www.w3.org/TR/turtle/

0 commit comments

Comments
 (0)