What is a LinkCrawler Rule?

26/11/2020 3:57 PM
- LinkCrawler Rules

What is a LinkCrawler Rule?

LinkCrawler Rules can be used to do automatically treat URLs as long as they are not already handled by a plugin.

To find out if a website is supported by a plugin, read this article.
You can add as many rules as you like and they can be chained e.g. the results of rule 1 get processed again by rule 2.
LinkCrawler Rules are part of JDownloaders advanced features.
You can find them under Settings -> Advanced Settings -> LinkCrawler: LinkCrawlerRules
Click in the Value field so you can modify the field and replace the content with your rule(s).
Also make sure that the "LinkCrawlerRules Checkbox" (first setting in screenshot below) is enabled.
Screenshot:

There is no GUI available for this feature.
If you are only here to find out how to add a pre-given LinkCrawler rule to JD, you may stop reading here but if you want to know how to create your own LinkCrawler Rules, continue reading.
Here is a list of LinkCrawler Rule types and simple examples on what they can be used for.

DEEPDECRYPT: Auto-deep scan URLs of websites which are not supported via specified plugin
REWRITE: Change URLs added to JD to rewriteReplaceWith
DIRECTHTTP: Make JD accept certain URLs as direct-downloadable URLs e.g. URLs that do not have a file-extension in them (DIRECTHTTP)
Can also be used to make JD accept URLs containing unsupported/rare file-extensions
FOLLOWREDIRECT: Allows JD to accept unsupported URLs that simply redirect from website/location A to B
SUBMITFORM: Allows JD to accept certain URLs and submit all HTTP Forms matching given formPattern | Returns all found redirects as results.

No matter which type of rule you use - afterwards JD will auto-grab URLs matching your defined "pattern" (see below) also via clipboard observation.

Preparation

A basic knowledge of Regular Expressions is recommended before you get started.
Useful online tool to learn regular expressions: regexone.com
Useful online tool for testing regular expressions: regex101.com -> Make sure to set the "Flavor" in the left side to "Java 8"!
Useful online tool for validating LinkCrawler rules (json strings): jsonschemavalidator.net -> Open the list next to Select schema on the left side -> Search for "JDownloader" and select the "multi-rule" schema if you're working with the examples given in our official help articles (thx to sergxerj)

Our knowledgebase contains common examples but if you need to create "more complicated" rules you may find examples in our support forum and of course you can contact our staff if you get stuck.

Basic example of the structure of a LinkCrawler Rule:


[
  {
    "enabled": true,
    "cookies": [
      [
        "key1",
        "value1"
      ],
      [
        "key2",
        "value2"
      ]
    ],
    "headers": [
      [
        "YourHeaderKey",
        "YourHeaderValue"
      ],
      [
        "User-Agent",
        "JDownloader"
      ]
    ],
    "updateCookies": true,
    "logging": false,
    "maxDecryptDepth": 1,
    "name": "example first rule in list of rules",
    "pattern": "https://(?:www\\.)?example\\.com/(.+)",
    "rule": "DEEPDECRYPT",
    "packageNamePattern": "",
    "passwordPattern": null,
    "formPattern": null,
    "deepPattern": null,
    "rewriteReplaceWith": "https://example2.com/$1"
  },
  {
    "enabled": true,
    "logging": false,
    "maxDecryptDepth": 1,
    "name": "example second rule in list of rules",
    "pattern": "https://support\\.jdownloader\\.org/Knowledgebase/Article/GetAttachment/\\d+/\\d+",
    "rule": "DIRECTHTTP"
  }
]

LinkCrawler Rules are stored as a json array.
Especially if you have multiple rules it can be a good idea to use a json editor to work on them e.g. jsoneditoronline.org or jsonformatter.org.
JD will only allow you to add rules with a valid json structure!
Make sure that special chars like quotation marks are correctly escaped so that your json is valid!

Explanation of all possible fields:
Depending on the type of your LinkCrawler rule, only some of these fields are required.
While some fields are optional for the user, JDownloader may auto-generate those after adding the rule for example the fields id and enabled.

Field name	Description	Data-type / example	Usable for rule-type(s)
enabled	enables/disables this rule	boolean	ALL
cookies	Here you can put in your personal cookies e.g. login cookies of websites you want to crawl content from (this only makes sense if the content is e.g. not accessible without account). Also if "updateCookies" is enabled, JD will update these with all cookies it receives from the website(s) that match pattern. Supported cookie formats: - Our own format (see example) - Json export formats of the following browser addons: Cookie-Editor, EditThisCookie, FlagCookies,	List[List[String, String]] Example: `"cookies" : [ ["phpssid", "ffffffffffvoirg7ffffffffff"] ]`	DIRECTHTTP, DEEPDECRYPT, SUBMITFORM, FOLLOWREDIRECT
updateCookies	If the target websites returns new cookies, save these inside this rule and update this rule.	boolean	DIRECTHTTP, DEEPDECRYPT, SUBMITFORM, FOLLOWREDIRECT
logging	Enable this for support purposes. Logs of your LinkCrawler Rules can be found in your JD install dir/logs/: LinkCrawlerRule.<RuleID>.log.0 and /LinkCrawlerDeep.*	boolean	ALL
maxDecryptDepth	How many layers deep do should your rule crawl (e.g. rule returns URLs matching the same rule again - how often is this chain allowed to happen?)	int	ALL
id	Unique ID of the rule. Gets auto generated when you initially add the rule	int	ALL
name	Name of the rule	String	ALL
pattern	RegEx: This rule will be used for all URLs matching this pattern	String `<title>(.*?)</title>`	ALL
rule	Type of the rule	String DEEPDECRYPT or REWRITE or DIRECTHTTP or FOLLOWREDIRECT or SUBMITFORM	ALL
packageNamePattern	HTML RegEx: All URLs crawled by this rule will go into one package if the RegEx returns a result	String `https://(?:www\\.)?example\\.com/(.+)`	DEEPDECRYPT
passwordPattern	HTML RegEx: Pattern to find extraction passwords	String `password:([^>]+)>`	DEEPDECRYPT
formPattern	HTML RegEx: Find- and submit HTML Form	String `<form id="example">(.*?)</form>`	SUBMITFORM
deepPattern	HTML RegEx: Which URLs should this rule inside HTML code. null = auto scan- and return all supported URLs found in HTML code.	String `src="(https?://[^\"]+)"`	DEEPDECRYPT
rewriteReplaceWith	Pattern for new URL	String `https://example2.com/$1`	REWRITE
headers	Headers as key value pairs Warning: Provide cookies either via the cookies field or via header, not both at the same time!	List[List[[String, String]] "headers": [["YourHeaderKey", "YourHeaderValue"], ["User-Agent", "JDownloader"]]	DEEPDECRYPT
propertyPatterns (STILL_UNDER_DEVELOPMENT!)	Property patterns as map of single regular expressions or list of regular expressions. If multiple regular expressions are provided, the first result will be set as property. This can be useful as a fallback or if a specific information you want to grab is sometimes available at one place and sometimes at another one. Each result is accessible via property "lc_<yourKey". The property keys of the example provided here would be: lc_key1 and lc_videoDurationSeconds. You can use them for example to create custom filenames, see Packagizer docs.	Map "propertyPatterns": { "videoTitle": [ "data-v-title=\"([^\"]+)\"", "<title>([^<]+)</title>" ], "videoDurationSeconds": "duration:([0-9]+)" },	DEEPDECRYPT

Tags: Rule, Rules, Link Crawler, Crawler, LinkCrawlerRules

What is a LinkCrawler Rule?

Ähnliche Artikel