这是用户在 2024-3-22 16:10 为 https://github.com/luin/readability 保存的双语快照页面,由 沉浸式翻译 提供双语支持。了解如何保存?
Skip to content
You have no unread notifications
luin  /   readability  /  
Owner avatar readability Public
also forked to ournet/readability-js 13
  • Watch 80

    Notifications

  • Lists

    Lists

Notifications

Notification settings

📚 Turn any web page into a clean view

Open in github.dev Open in a new github.dev tab Open in codespace

luin/readability

Repository files navigation

Readability 可读性

Turn any web page into a clean view. This module is based on arc90's readability project.
将任何网页转换为干净的视图。本模块是基于Arc90的S可读性项目。

Features 功能

  1. Optimized for more websites.
    针对更多网站进行了优化。
  2. Supporting HTML5 tags (article, section) and Microdata API.
    支持HTML5标签(,)和Microdata API。 article section
  3. Focusing on both accuracy and performance. 4x times faster than arc90's version.
    专注于准确性和性能。比arc90的版本快4倍。
  4. Supporting encodings such as GBK and GB2312.
    支持GBK和GB2312等编码。
  5. Converting relative urls to absolute for images and links automatically (Thank Guillermo Baigorria & Tom Sutton).
    自动将图像和链接的相对URL转换为绝对URL(感谢Guillermo Baigorria和Tom Sutton)。

Example 例如

Before -> After 之前->之后

Install 安装

$ npm install node-readability

Note that from v2.0.0, this module only works with Node.js >= 2.0. In the meantime you are still welcome to install a release in the 1.x series (by npm install node-readability@1) if you use an older Node.js version.
请注意,从v2.0.0开始,此模块仅适用于Node.js>=2.0。同时,如果您使用较旧的Node.js版本,仍然欢迎您安装1.x系列的发行版(由 npm install node-readability@1 )。

Usage 用法

read(html [, options], callback)

Where 哪里

  • html url or html code.
    html URL或html代码。
  • options is an optional options object
    Options是可选的Options对象
  • callback is the callback to run - callback(error, article, meta)
    callback是要运行的callback— callback(error, article, meta)

Example 例如

var read = require('node-readability');

read('http://howtonode.org/really-simple-file-uploads', function(err, article, meta) {
  // Main Article
  console.log(article.content);
  // Title
  console.log(article.title);

  // HTML Source Code
  console.log(article.html);
  // DOM
  console.log(article.document);

  // Response Object from Request Lib
  console.log(meta);

  // Close article to clean up jsdom and prevent leaks
  article.close();
});

NB If the page has been marked with charset other than utf-8, it will be converted automatically. Charsets such as GBK, GB2312 is also supported.
注意:如果页面被标记为utf—8以外的字符集,则该页面将自动转换。还支持GBK、GB2312等字符集。

Options 选项

node-readability will pass the options to request directly. See request lib to view all available options.
节点可读性将选项直接传递给请求。请参阅Requestlib以查看所有可用选项。

node-readability has two additional options:
node—readability有两个附加选项:

  • cleanRulers which allow set your own validation rule for tags.
    cleanRulers ,允许为标记设置您自己验证规则。

If true rule is valid, otherwise no. options.cleanRulers = [callback(obj, tagName)]
如果为真,则规则有效,否则为否。Options.leanRulers=[回调(obj,标记名)]

read(url, {
  cleanRulers: [
    function(obj, tag) {
      if(tag === 'object') {
        if(obj.getAttribute('class') === 'BrightcoveExperience') {
          return true;
        }
      }
    }
  ]}, function(err, article, response) {
    //...
  });
  • preprocess which should be a function to check or modify downloaded source before passing it to readability.
    preprocess 它应该是一个函数,在将下载的源代码传递到可读性之前检查或修改。

options.preprocess = callback(source, response, contentType, callback);
Options.preprocess=回调(来源,响应,内容类型,回调);

read(url, {
    preprocess: function(source, response, contentType, callback) {
      if (source.length > maxBodySize) {
        return callback(new Error('too big'));
      }
      callback(null, source);
    }
  }, function(err, article, response) {
    //...
  });

article object 文章对象

content 内容

The article content of the web page. Return false if failed.
网页的文章内容。如果失败,返回 false

title 标题

The article title of the web page. It's may not same to the text in the <title> tag.
网页的文章标题。它可能与 <title> 标记中的文本不同。

textBody 文本正文

A string containing all the text found on the page
包含页面上所有文本的字符串

html

The original html of the web page.
网页的原始html。

document 文件

The document of the web page generated by jsdom. You can use it to access the DOM directly (for example, article.document.getElementById('main')).
jsdom生成的网页文档。您可以使用它直接访问DOM(例如,)。 article.document.getElementById('main')

meta object 元对象

Response object from request lib. If you need to get current url after all redirect or get some headers it can be useful.
来自请求库的响应对象。如果您需要获取当前的URL,毕竟,重定向或获取一些头文件,它可能是有用的。

Why not Cheerio 为什么不干杯

This lib is using jsdom to parse HTML instead of cheerio because some data such as image size and element visibility isn't able to acquire when using cheerio, which will significantly affect the result.
这个lib使用jsdom而不是cheerio来解析HTML,因为在使用cheerio时无法获取图像大小和元素可见性等数据,这将极大地影响结果。

Contributors 贡献者

https://github.com/luin/node-readability/graphs/contributors

License 许可证

This code is under the Apache License 2.0. http://www.apache.org/licenses/LICENSE-2.0
这是Apache许可证2.0。http://www.apache.org/licenses/LICENSE-2.0