-
Notifications
You must be signed in to change notification settings - Fork 31
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Javascript doesn't work #4
Comments
./edbrowse https://example.com/ > out.htm
Of course edbrowse-js is in the same directory and has exec rights.
I notice you exec via ./edbrowse, perhaps your $PATH does not contain .
They sometimes don't out of the box,
in which case the execlp() command would fail.
You say this is an attempt to "save a web page to file",
but out.htm would simply contain the output of edbrowse.
You want to bring in example.com, then from within the edbrowse session
w out
that would save the formatted web page.
or perhaps ub then w out if you want the raw html.
Karl Dahlke
|
Thanks, I did export PATH and now this error is gone. But can I just use it like wget? Download a file and save the final output (after running JS) to file? |
can I just use it like wget? Download a file and save the final output
(after running JS) to file?
Yes you can save the original html or the formatted text.
w/
is a convenient command, saves it to the filename part of the url.
Karl Dahlke
|
Thanks! Can I do this non-interactively? Like edbrowse http://google.com -w output.htm? |
Can I do this non-interactively?
Like edbrowse http://google.com -O output.htm
Well not like that.
edbrowse http://google.com <<!
w google-home.browse
ub
w google-home.htm
q
!
You can also write edbrowse scripts in $HOME/.ebrc to do various tasks,
somewhat like shell functions in $HOME/.bashrc
See the sample config file in the documentation
or edbrowse.wiki on github user cmb
Karl Dahlke
|
Unfortunately the unbrowse command In the meantime there is PhantomJS which has more complete DOM support but it is not as lightweight as edbrowse. For example in Python (adapted from Web Adjuster):
but none of these considerations apply if you merely wanted to view the formatted text of pages that don't need things like read access to |
Well there's a lot of information here.
First, with all your skills and experience, I wish we could recruit you as a programmer.
It's just a couple of us spare time, no compensation, etc.
We're looking at phantom js, but it's almost a start-over, and I don't know that any of us has the time for that.
I have so manyh personal issues write now, I barely have time to write this email.
I don't think I follow when you say innerHTML is not implemented.
You can read it and write it.
I just verified this with my test programs and jdb.
I pushed a button which changes the innerHTML under a <div>, and then when I read inerHTML I get the new html back again.
What you can't get is the html that might have built an entire reconstructed tree of nodes.
As when you built it using createElement and appendChild etc etc.
Karl Dahlke
|
Ah, I misremembered the order of parameters when I saw
the call to JS_DefineProperty. Should have double checked.
It does in fact give innerHTML an initial value of empty
but does not rig up a getter to return empty. The getter
is left as null and the setter is rigged up. That means
for one thing you can set innerHTML to any value you like
and then read back the value you set. But the next question
is, can you read its value before you even set it? And
it turns out you can, in some cases, but not all. The code
I should have looked at is src/decorate.c and its calls to
the function establish_inner. This is called only for 9
specific element types, namely,
input, td, div, object, span, sup, sub, ovb, and P.
Any elements whose types are in this list will have a
correct initial value of innerHTML, but any other element
(including document.body) will not have an initial innerHTML.
For more completeness, I would suggest deleting the 6 calls
to establish_inner in decorate.c's switch statement, and
instead add a catch-all after the final close brace of
that switch block, like this:
establish_inner(t->jv, t->value, 0, action == TAGACT_INPUT);
That should make innerHTML work for a lot more elements.
It would of course take up more memory and slow us down a
little bit, but not as much as PhantomJS, and it would buy
compatibility with more Javascripty sites.
|
Your idea of calling establish_inner after the switch, to cover all tags, seems reasonable,
but for caution sake I will wait until we have stamped a new version, which we expect to do in a week or so.
After that I'll put this modest yet important change at the top of the list.
Discussions like this should probably be on the developers mailing list
Edbrowse-dev@lists.the-brannons.com
Not all developers see these github messages,
and I'm sure they want to be in the loop.
Karl Dahlke
|
OK I'll see if I can sign up to that list at some point. Not today though as I am a bit overloaded at the moment. One more thing I should mention though is that edbrowse's support of default innerHTML is also limited by length, and if it is too long it will not be set at all. It's not immediately obvious from the code where this length limit is coming from. What it means is that scripts that try to do "search and replace" on the entire document by accessing a wrapper element's innerHTML will fail unless it is a short test document. It also means we cannot get a DOM tree out of current versions of edbrowse simply by adding a DIV element around the entire body, in case anyone was thinking of trying that. |
innerHTML is limited by length,
Fixed. There is a small impact on performance, which I described on list.
Karl Dahlke
|
Update: in edbrowse 3.7.4,
So as per previous comments on this thread, you can write the original page source to a file, or you can write the final version of the rendered text to a file, but there is not yet a way to write out a with-markup version of the DOM after it has been changed by Javascript. |
There are many topics in this thread.
This reply only addresses one of them.
I implemented element.getAttributeNames, because it was easy to do.
I'll look at the other issues later.
Karl Dahlke
|
I have a query. I want to download a webpage with javascripts using edbrowse to make offline copy. how can i achive this. When i browse that site javascript content is not loaded no text or links are loaded |
I'm not sure what you are asking.
If you want a local copy of a web page, with local javascript files and local css files, it is theoretically possible, we do this a lot when debugging,
but it's not easy, and has some caveates, and most users don't do that.
Call up the debugging page in the edbrowse wiki, and look for the word snapshot.
If you're just saying there's a web page wherein js isn't working properly, well there are a lot of those, let us know which one and we'll add it to the list.
Karl Dahlke
|
i am looking for a way to archive/backup fully javascript dom loaded website for offline backup. edbrowse is only text based browser that supppport javacript. So what should be the command |
i am looking for a way to archive/backup fully javascript dom loaded website
This is a complicated question, and we would have to start with some requirements.
* There are crawlers that gather all the html files that are directly linked by <a href=> and <iframe src=> tags.
google and other search engines have gone beyond this, because some html is brought in dynamically by scripts.
I'm guessing you would want something like that.
* Sounds like you want more than just the html files, but also to archive the js files.
* What about the css files?
* what about the json files?
In general, json is fetched dynamically, by scripts, which happens also for other scripts and sometimes even html, so this is not unusual,
however, json is often timely, like the articles of the day, or other information that is topical, or relevant today but perhaps not tomorrow.
Example: nasa.gov presentes only a template then fetches its articles and other things as json files through xhr, and pastes them in place.
So I'm guessing json files are not to be archived.
That would make it easier.
* Then there is the question of archiving files from your website or all files referenced.
A website often accesses common libraries from other domains, e.g. css fonts that google provides as public, or jquery libraries that are public.
Would you want to archive these off-site files, or just the files that are on the domain of the main html page, on the same web server if you will.
* Some javascript, and this is sadly more and more common, uses timers and promise jobs to fetch follow-on html or javascript or json data.
So you have to allow those timers to run.
In other words, it can never be a single command to edbrowse to do this.
You might have to send it commands, with a call to sleep in just the right place, so that the timers can run, and the additional scripts or html can be fetched.A human naturally pauses, until he feels like the page has been fully loaded, but that's hard to automate.
Err on the conservative side I suppose, and sleep for 30 seconds, and hope for the best -
but edbrowse can be slow at times, and combined with internet delays, sometimes even 30 seconds isn't enough.
* As I mentioned, when you think the page is loaded, you can enter the commands jdb and snapshot() and you will have local copies of all the css and js that were used to build that page, plus a jslocal file to map those to the urls where they came from.
Or,
browse with db3 and scrape the output for javascript source, css source, *redirect, xhr send, and other keywords,
capture the urls on those lines,
then use curl or wget to download all those files and put them in whatever names and locations you wish.
* However you do it, this is just one page.
It's not a crawler.
I don't follow the <A href=> tags to pull down other pages, and then the javascript that those pages might employ, which is sometimes the same js files and sometimes not.
I don't know if you want all the pages that might be reasonably referenced by this page, or just this page.
At the end I think it's a nontrivial development project, for which edbrowse is a great start, but perhaps only a start.
I'd need to know more of what you want to do, as per the questions in this writeup, and even then I don't think I have the time to take it on, but I can certainly consult.
Karl Dahlke
|
Thanks. What i want is to execute the external javascript that was in html src tags and update the DOM accordingly and then scrape the final updated html/DOM |
Well that is considerably simpler than the project I was imagining.
A script like this might be a start.
# get a rendered web page
# The url is the only argument.
(
echo showall+
echo "b $1"
sleep 30
echo ,p
echo q
) | edbrowse > outfile
Then do whatever you wish with the output file.
You'll recognize ,p as the ed command to print the entire page.
And q of course to quit.
But there are a lot of caveates.
There are still a lot of websites wherein edbrowse doesn't handle the javascript very well.
And I already commented on timers updating the page, thus sleep 30 to give the timers time to run.
I tested this on http://www.mathreference.com which isn't a great test cause it doesn't use much javascript, but it does use some.
You can test it on whatever.com and play around with it.
Karl Dahlke
|
If you want to back-convert the final rendered DOM into HTML, so for example if the site says |
Ok, I think I see where you are headed, and it is quite interesting.
I think you could do it from jdb, which is our interactive javascript debugger.
After the delay of 30 seconds, which I have already talked about:
jdb
document.documentElement.outerHTML ^> rendered_html
This is a standard feature of dom, which I made a tentative first step at implementing.
It slightly worked before, I could see it adding the <a> tag in your example, and it works a little better after my latest commit,
now bringing in all the attributes.
This is largely untested and unused, so if you wanted to play with it, and point out problems, I'd be happy to fix it;
some real world js on a web page might depend on this working some day, so it would help if it all worked properly.
Karl Dahlke
|
That's great. If anyone reading this gets
To compile a more recent
|
And yes, some sites do depend on outerHTML, but more depend on innerHTML, and some of them expect innerHTML to be dynamic (like outerHTML currently is). They also expect both innerHTML and outerHTML to have setters which re-parse an HTML fragment and repopulate part of the DOM. |
Silas S. Brown wrote on Mon, Nov 15, 2021 at 02:47:19PM -0800:
* Fedora 34 is still stuck on edbrowse 3.7.4 (and it won't install on
Fedora 35 because the `libtidy` package is missing; maybe we should
file a bug report with Fedora);
hm? There is no edbrowse package on fedora.
It looks like there's an external package from "rpmsphere" whatever that
is but it's not part of fedora.
libtidy is also very well present, but got a soname update, so that
edbrowse package which requires libtidy.so.5 can't find it because
fedora 35 provides libtidy.so.58 instead: that external package just
needs to be rebuilt for fedora 35, there's nothing wrong with it.
Ideally getting an edbrowse package upstream instead of an external repo
would fix all that, need to start packaging quickjs first though...
|
Yes I realised my comment was wrong (I'd forgotten I'd installed RPM Fusion on the box), so I edited my comment shortly after writing it. But GitHub still sent the wrong version to anyone subscribed to this thread by email. Sorry about that. |
Thank you for the comments on edbrowse packaging and various distros.
We should continue to "encourage" distros to package and release the latest edbrowse, for the benefit of the average user.
They are always quite a bit behind, if they provide it at all, an so much has been added recently.
In talking about outerHTML, sure it was added in 3.7.6, but didn't work very well, and I even made some changes to it recently, as per this thread, that aren't in any "version".
Folks should try to clone and build edbrowse from source, it isn't hard to do.
There are step by step instructions in the wiki, for 32 bit, for 64 bit, for the pi, etc.
I am often responsive, fixing bugs and problems quickly, but that only helps if you follow the latest.
Also, we do want to provide static binaries more often, not just on the releases but maybe weekly or some other schedule.
We'll keep you up to date if that happens.
Karl Dahlke
|
Just filed a ticket at MacPorts asking them to update, with a scripted version of the above instructions (they might say "oh that's not how we write our scripts at MacPorts" but hopefully they can adapt it) |
I've updated edbrowse in MacPorts to 3.8.2.1 and listed myself as the maintainer so I should notice any future versions becoming available and update the port in short order. If I fail to do so please file a MacPorts ticket or send a MacPorts pull request. |
I am trying to save a webpage to file:
./edbrowse https://example.com/ > out.htm
no ssl certificate file specified; secure connections cannot be verified
15848
Unable to exec edbrowse-js, javascript has been disabled.
1351
Of course edbrowse-js is in the same directory and has exec rights.
The text was updated successfully, but these errors were encountered: